Assignment 1

Softmax 的 W 梯度推导¶

我们考虑一个样本 \(\mathbf{x}_i \in \mathbb{R}^D\)，其真实类别标签为 \(y_i \in \{0, 1, \dots, C-1\}\)。

权重矩阵 \(W \in \mathbb{R}^{D \times C}\)：其中第 \(j\) 列 \(\mathbf{W}_j\) 是第 \(j\) 个类的权重向量。
得分（logits）：

\[ s_j = \mathbf{x}_i^T \mathbf{W}_j \quad \text{或} \quad \mathbf{s} = W^T \mathbf{x}_i \in \mathbb{R}^C \]

Softmax 概率（预测分布）：

\[ p_j = \frac{e^{s_j}}{\sum_{k=1}^C e^{s_k}} = \frac{e^{\mathbf{x}_i^T \mathbf{W}_j}}{\sum_{k=1}^C e^{\mathbf{x}_i^T \mathbf{W}_k}} \]

交叉熵损失（单样本）：

\[ L_i = -\log p_{y_i} = -\log \left( \frac{e^{s_{y_i}}}{\sum_{k=1}^C e^{s_k}} \right) = -s_{y_i} + \log \left( \sum_{k=1}^C e^{s_k} \right) \]

目标：计算损失 \(L_i\) 对权重矩阵 \(W\) 的梯度 \(\frac{\partial L_i}{\partial W}\)。

由于 \(L_i\) 通过得分 \(s_j\) 依赖于 \(W_j\)，我们使用链式法则：

\[ \frac{\partial L_i}{\partial \mathbf{W}_j} = \frac{\partial L_i}{\partial s_j} \cdot \frac{\partial s_j}{\partial \mathbf{W}_j} \]

其中： - \(s_j = \mathbf{x}_i^T \mathbf{W}_j\) - \(\frac{\partial s_j}{\partial \mathbf{W}_j} = \mathbf{x}_i\) （标量对向量的导数）

因此：

\[ \boxed{\frac{\partial L_i}{\partial \mathbf{W}_j} = \frac{\partial L_i}{\partial s_j} \cdot \mathbf{x}_i} \]

现在问题转化为求 \(\frac{\partial L_i}{\partial s_j}\)。

回忆损失函数：

\[ L_i = -s_{y_i} + \log \left( \sum_{k=1}^C e^{s_k} \right) \]

我们分两种情况讨论：

✅ 情况 1：\(j = y_i\) （当前类是真实类别）

\[ \begin{align*} \frac{\partial L_i}{\partial s_j} &= \frac{\partial}{\partial s_j} \left( -s_j + \log \left( \sum_k e^{s_k} \right) \right) \\ &= -1 + \frac{1}{\sum_k e^{s_k}} \cdot \frac{\partial}{\partial s_j} \left( \sum_k e^{s_k} \right) \\ &= -1 + \frac{1}{\sum_k e^{s_k}} \cdot e^{s_j} \\ &= -1 + p_j \end{align*} \]

✅ 情况 2：\(j \neq y_i\) （当前类不是真实类别）

此时 \(-s_{y_i}\) 不依赖于 \(s_j\)，导数为 0。

\[ \begin{align*} \frac{\partial L_i}{\partial s_j} &= \frac{\partial}{\partial s_j} \left( \log \left( \sum_k e^{s_k} \right) \right) \\ &= \frac{1}{\sum_k e^{s_k}} \cdot e^{s_j} \\ &= p_j \end{align*} \]

综上所述：

\[ \frac{\partial L_i}{\partial s_j} = \begin{cases} p_j - 1 & \text{if } j = y_i \\ p_j & \text{if } j \neq y_i \end{cases} \]

代入链式法则：

\[ \boxed{ \frac{\partial L_i}{\partial \mathbf{W}_j} = \begin{cases} (p_j - 1) \cdot \mathbf{x}_i & \text{if } j = y_i \\ p_j \cdot \mathbf{x}_i & \text{if } j \neq y_i \end{cases} } \]

综上所述：

\[ \frac{\partial L_i}{\partial s_j} = \begin{cases} p_j - 1 & \text{if } j = y_i \\ p_j & \text{if } j \neq y_i \end{cases} \]

代入链式法则：

\[ \boxed{ \frac{\partial L_i}{\partial \mathbf{W}_j} = \begin{cases} (p_j - 1) \cdot \mathbf{x}_i & \text{if } j = y_i \\ p_j \cdot \mathbf{x}_i & \text{if } j \neq y_i \end{cases} } \]

我们可以用 one-hot 编码向量 \(\mathbf{t} \in \mathbb{R}^C\) 来统一表示： - \(t_j = \mathbf{1}(j = y_i)\)，即真实标签的 one-hot 向量。

定义预测概率向量 \(\mathbf{p} = [p_1, p_2, \dots, p_C]^T\)。

则：

\[ \nabla_{\mathbf{s}} L_i = \mathbf{p} - \mathbf{t} \]

因为 \(\mathbf{s} = W^T \mathbf{x}_i\)，所以：

\[ \nabla_W L_i = \mathbf{x}_i (\nabla_{\mathbf{s}} L_i)^T = \mathbf{x}_i (\mathbf{p} - \mathbf{t})^T \]

这个外积的结果是一个 \(D \times C\) 矩阵，其第 \(j\) 列就是 \(\mathbf{x}_i (p_j - t_j)\)，与分情况讨论一致。

对于一个包含 \(N\) 个样本的小批量（minibatch），总损失为：

\[ L = \frac{1}{N} \sum_{i=1}^N L_i + \lambda \|W\|^2 \]

梯度为：

\[ \nabla_W L = \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i (\mathbf{p}^{(i)} - \mathbf{t}^{(i)})^T + 2\lambda W \]

其中 \(\mathbf{p}^{(i)}\) 和 \(\mathbf{t}^{(i)}\) 是第 \(i\) 个样本的预测概率和 one-hot 标签。

相应的代码

def softmax_loss_naive(W, X, y, reg):
    loss = 0.0
    dW = np.zeros_like(W)
    num_train = X.shape[0]

    for i in range(num_train):
        scores = X[i].dot(W)
        scores -= np.max(scores)  # 数值稳定性
        exp_scores = np.exp(scores)
        probs = exp_scores / np.sum(exp_scores)

        loss -= np.log(probs[y[i]])

        # 计算梯度
        for j in range(W.shape[1]):
            if j == y[i]:
                dW[:, j] += (probs[j] - 1) * X[i]
            else:
                dW[:, j] += probs[j] * X[i]

    loss /= num_train
    loss += reg * np.sum(W * W)
    dW /= num_train
    dW += 2 * reg * W

    return loss, dW

我们可以对其进行向量化处理

\[ \nabla_W L = \frac{1}{N} \mathbf{X} (\mathbf{P} - \mathbf{T})^T + 2\lambda W \]

其中 \(\mathbf{P}\) 和 \(\mathbf{T}\) 都是一个 \(N\times C\) 的矩阵，分别是求出来的概率矩阵和独热矩阵（姑且叫这个名字），然后 \(\mathbf{X}\) 是一个 \(D\times N\) 的矩阵

相应的代码

def softmax_loss_vectorized(W, X, y, reg):
    """
    Softmax loss function, vectorized version.

    Inputs and outputs are the same as softmax_loss_naive.
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    num_train = X.shape[0]

    # 1. Compute scores (N x C)
    scores = X.dot(W)  # Shape: (N, C)

    # 2. Numerical stability: subtract max from each row
    scores -= np.max(scores, axis=1, keepdims=True)  # Shape: (N, 1)

    # 3. Compute softmax probabilities
    exp_scores = np.exp(scores)  # Shape: (N, C)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)  # Shape: (N, C)

    # 4. Compute the loss: average cross-entropy loss + regularization
    correct_logprobs = -np.log(probs[np.arange(num_train), y])  # Log probability of true class
    loss = np.sum(correct_logprobs) / num_train
    loss += reg * np.sum(W * W)  # Add regularization

    # 5. Compute the gradient
    # Start with the gradient of the softmax loss
    dscores = probs.copy()  # Shape: (N, C)
    dscores[np.arange(num_train), y] -= 1  # Subtract 1 from correct class scores
    dscores /= num_train  # Average over the batch

    # Backpropagate gradient to W: dW = X^T * dscores
    dW = X.T.dot(dscores)  # Shape: (D, C)

    # Add gradient of regularization
    dW += 2 * reg * W

    return loss, dW

评论区

对你有帮助的话请给我个赞和 star =>

欢迎跟我探讨！！！