Hardware Acceleration

现代 GPU 的硬件支持：Matrix-Multiply-Accumulate (MMA)

现代 GPU（如 NVIDIA RTX 系列）内置了专门用于 矩阵乘加（MMA） 的硬件单元（例如 Tensor Cores）。
这些单元能高效执行低精度（FP16、FP8、INT8，甚至 FP4）的大规模矩阵运算。
在传统 SIMT（Single Instruction, Multiple Thread）模型之上，引入了 协作式执行（Cooperative Execution） 模式，以充分利用 MMA 单元。

MMA 的编程接口（Intrinsics）

不同图形 API 提供了对 MMA 硬件的底层访问：

平台 / API	对应扩展 / 特性
Vulkan	`VK_KHR_cooperative_matrix`、`VK_NV_cooperative_matrix`
DirectX 12	Linear Algebra Matrix Conversion APIs（如 `ConvertLinearAlgebraMatrix`）
Metal	SIMD Group Matrix（Metal 4 + Shader Model 6.9）

以神经网络中的一层作为例子，直观地理解我们是如何使用 MMA 的

struct NetworkParameters<int InputSize, int OutputSize>
{
    float weights[InputSize][OutputSize];
    float biases[OutputSize];
    Array<float, OutputSize> eval(float input[InputSize])
    {
        float output[OutputSize] = biases;
        output += MatMulVec(weights, input);
        for (int i = 0; i < OutputSize; ++i)
            output[i] = max(output[i] * 0.01f, output[i]); // ReLU
        return output;
    }
}

使用 cooperative vectors 之后

struct NetworkParameters<int InputSize, int OutputSize>
{
    half* weights;
    half* biases;

    CoopVec<half, OutputSize> eval(CoopVec<half, InputSize> input)
    {
        let output = coopVecMatMulAdd<half, OutputSize>(
            input, CoopVecComponentType::Float16,   // 输入数据格式
            weights, CoopVecComponentType::Float16, // 权重数据格式
            biases, CoopVecComponentType::Float16,  // 偏置数据格式
            CoopVecMatrixLayout::RowMajor,          // 矩阵布局：行优先
            false,                                  // 是否转置矩阵再乘？
            sizeof(half) * InputSize);              // 矩阵大小（字节）
        return max(output * 0.01h, output); // Leaky ReLU 激活
    }
};

Using 16-bit floats because 32-bit float is not supported by HW.

现在我们面临两个问题：

权重梯度应存放在哪里？
如何使其可微？

因此尚不能直接用于训练过程中的反向传播。

对于第一个问题可以在结构体中直接存储相应变量的梯度

half* weights;
half* biases;
half* weightsGrad;
half* biasesGrad;

对于第二个问题，首先我们说明为什么不可微？

CoopVec 类型不可微，不支持自动求导。

coopVecMatMulAdd 是不可微的内在函数（intrinsic）

这是一个高度优化的、低级的硬件指令或库函数，用于快速执行矩阵乘加。它被设计为高性能但不透明的操作，没有内置梯度计算逻辑。

为了将其可微化，我们将不可微的 CoopVec 类型封装在一个可微的类型 MLVec<N> 中，使其能够支持自动微分。

struct MLVec<int N> : IDifferentiable
{
    CoopVec<half, N> data;
    typealias Differential = This;
    static Differential dadd(Differential d0, Differential d1)
    {
        return {d0.data + d1.data};
    }
    static Differential dmul<U:__BuiltinRealType>(U s, Differential d)
    {
        return {d.data * __realCast<half>(s)};
    }
    static Differential dzero()
    {
        return {};
    }
}

struct NetworkParameters<int InputSize, int OutputSize>
{
    ...
    half* weightsGrad;
    half* biasesGrad;

    MLVec<OutputSize> eval(MLVec<InputSize> input) { ... }

    [BackwardDerivativeOf(eval)]
    void evalBwd(
        inout DifferentiablePair<MLVec<InputSize> input,
        MLVec<OutputSize> resultGrad)
    {
        ...
    }
}

接下来我们将要探究如何通过多个线程来计算梯度

原先的计算如下

\[ \mathbf{u} = M\mathbf{v} + B \]

xxxxxxxxxx7 1float identity(float x) { return x; }23[BackwardDerivativeOf(identity)]4void identityBwd(inout DifferentialPair x, float dOut) {5 printf("Gradient at this point: %f\n", dOut);6 x.d += dOut;7}C

\[ \begin{aligned} dM \mathrel{+} &= d\mathbf{u} \otimes_{\text{outer}} \mathbf{v}\\ dB \mathrel{+} &= d\mathbf{u}\\ d\mathbf{v} &= M^T d\mathbf{u} \end{aligned} \]

这样我们可以相应的理清每个线程应该干的事情

[BackwardDerivativeOf(eval)]
void evalBwd(
    inout DifferentialPair<MLVec<InputSize>> input,
    MLVec<OutputSize> resultGrad)
{
    let fwd = eval(input.p);
    // Back-prop resultGrad through activation.
    for (int i = 0; i < OutputSize; i++)
        if (fwd.data[i] < 0.0)
            resultGrad.data[i] *= 0.01h;

    // Back-prop gradients to the weights matrix.
    coopVecOuterProductAccumulate(
        resultGrad.data,
        input.v.data,
        weightsGrad, 0,
        CoopVecMatrixLayout.TrainingOptimal, CoopVecComponentType.Float16);

    // Back-prop gradients to the biases vector.
    coopVecReduceSumAccumulate(resultGrad.data, biasesGrad, biasesOffset);

    // Back-prop gradients to the input vector.
    let dInput = coopVecMatMul<half, InputSize>(
        resultGrad.data, CoopVecComponentType.Float16,
        weights, CoopVecComponentType.Float16,
        CoopVecMatrixLayout.ColumnMajor, false, sizeof(half)*InputSize);
    input = {input.p, {dInput}};
}

什么是 “layout”（内存布局）

在高性能计算（尤其是 GPU 或专用 AI 加速器）中，矩阵在内存中的排列方式（即 layout）会极大影响访问效率和计算速度。

常见的布局包括：

RowMajor（行主序）：按行连续存储。
ColumnMajor（列主序）：按列连续存储。
TrainingOptimal（训练最优布局）：一种为训练阶段梯度累加专门优化的内部布局，可能不是标准的行或列主序，而是分块、向量化对齐、或针对硬件 SIMD/张量核心优化的格式。

因为 coopVecOuterProductAccumulate 是一个高度优化的底层函数（可能由编译器或硬件库提供），它假设权重梯度已经按照最适合训练时累加操作的格式存储。开发者必须确保在初始化模型权重梯度时，就将其分配为 TrainingOptimal 布局

Note

let dInput = coopVecMatMul<half, InputSize>( resultGrad.data, CoopVecComponentType.Float16, weights, CoopVecComponentType.Float16, CoopVecMatrixLayout.ColumnMajor, false, sizeof(half)*InputSize);

为了高效计算 \(M^T d\mathbf{u}\)，系统将权重矩阵 weights 的内存布局设置为 ColumnMajor（列优先），这样可以直接利用矩阵乘法库实现转置后的乘法。

评论区

对你有帮助的话请给我个赞和 star =>

欢迎跟我探讨！！！