Assignment 1
Softmax 的 W 梯度推导¶
我们考虑一个样本 \(\mathbf{x}_i \in \mathbb{R}^D\),其真实类别标签为 \(y_i \in \{0, 1, \dots, C-1\}\)。
- 权重矩阵 \(W \in \mathbb{R}^{D \times C}\):其中第 \(j\) 列 \(\mathbf{W}_j\) 是第 \(j\) 个类的权重向量。
- 得分(logits):
- Softmax 概率(预测分布):
- 交叉熵损失(单样本):
目标:计算损失 \(L_i\) 对权重矩阵 \(W\) 的梯度 \(\frac{\partial L_i}{\partial W}\)。
由于 \(L_i\) 通过得分 \(s_j\) 依赖于 \(W_j\),我们使用链式法则:
其中: - \(s_j = \mathbf{x}_i^T \mathbf{W}_j\) - \(\frac{\partial s_j}{\partial \mathbf{W}_j} = \mathbf{x}_i\) (标量对向量的导数)
因此:
现在问题转化为求 \(\frac{\partial L_i}{\partial s_j}\)。
回忆损失函数:
我们分两种情况讨论:
✅ 情况 1:\(j = y_i\) (当前类是真实类别)
✅ 情况 2:\(j \neq y_i\) (当前类不是真实类别)
此时 \(-s_{y_i}\) 不依赖于 \(s_j\),导数为 0。
综上所述:
代入链式法则:
综上所述:
代入链式法则:
我们可以用 one-hot 编码向量 \(\mathbf{t} \in \mathbb{R}^C\) 来统一表示: - \(t_j = \mathbf{1}(j = y_i)\),即真实标签的 one-hot 向量。
定义预测概率向量 \(\mathbf{p} = [p_1, p_2, \dots, p_C]^T\)。
则:
因为 \(\mathbf{s} = W^T \mathbf{x}_i\),所以:
这个外积的结果是一个 \(D \times C\) 矩阵,其第 \(j\) 列就是 \(\mathbf{x}_i (p_j - t_j)\),与分情况讨论一致。
对于一个包含 \(N\) 个样本的小批量(minibatch),总损失为:
梯度为:
其中 \(\mathbf{p}^{(i)}\) 和 \(\mathbf{t}^{(i)}\) 是第 \(i\) 个样本的预测概率和 one-hot 标签。
相应的代码
def softmax_loss_naive(W, X, y, reg):
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
for i in range(num_train):
scores = X[i].dot(W)
scores -= np.max(scores) # 数值稳定性
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores)
loss -= np.log(probs[y[i]])
# 计算梯度
for j in range(W.shape[1]):
if j == y[i]:
dW[:, j] += (probs[j] - 1) * X[i]
else:
dW[:, j] += probs[j] * X[i]
loss /= num_train
loss += reg * np.sum(W * W)
dW /= num_train
dW += 2 * reg * W
return loss, dW
我们可以对其进行向量化处理
其中 \(\mathbf{P}\) 和 \(\mathbf{T}\) 都是一个 \(N\times C\) 的矩阵,分别是求出来的概率矩阵和独热矩阵(姑且叫这个名字),然后 \(\mathbf{X}\) 是一个 \(D\times N\) 的矩阵
相应的代码
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
# 1. Compute scores (N x C)
scores = X.dot(W) # Shape: (N, C)
# 2. Numerical stability: subtract max from each row
scores -= np.max(scores, axis=1, keepdims=True) # Shape: (N, 1)
# 3. Compute softmax probabilities
exp_scores = np.exp(scores) # Shape: (N, C)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # Shape: (N, C)
# 4. Compute the loss: average cross-entropy loss + regularization
correct_logprobs = -np.log(probs[np.arange(num_train), y]) # Log probability of true class
loss = np.sum(correct_logprobs) / num_train
loss += reg * np.sum(W * W) # Add regularization
# 5. Compute the gradient
# Start with the gradient of the softmax loss
dscores = probs.copy() # Shape: (N, C)
dscores[np.arange(num_train), y] -= 1 # Subtract 1 from correct class scores
dscores /= num_train # Average over the batch
# Backpropagate gradient to W: dW = X^T * dscores
dW = X.T.dot(dscores) # Shape: (D, C)
# Add gradient of regularization
dW += 2 * reg * W
return loss, dW