Softmax Attention

agicy2026/6/6大约 1 分钟

Softmax Attention

原始题目：LeetGPU - Softmax Attention

题目描述

编写一个 GPU 程序，计算矩阵的 softmax 注意力操作。给定查询矩阵 $Q$ （ $M \times d$ ）、键矩阵 $K$ （ $N \times d$ ）和值矩阵 $V$ （ $N \times d$ ），计算输出矩阵：

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d}}\right) V

其中 softmax 按行应用。

实现要求

只允许使用 GPU 原生功能（不允许使用外部库）。
solve 函数签名必须保持不变。
最终结果必须存储在输出矩阵 output 中。

示例

示例 1

Input:  Q (2×4): [[1,0,0,0], [0,1,0,0]]
        K (3×4): [[1,0,0,0], [0,1,0,0], [0,0,1,0]]
        V (3×4): [[1,2,3,4], [5,6,7,8], [9,10,11,12]]
Output: (2×4): [[4.29, 5.29, 6.29, 7.29], [5.00, 6.00, 7.00, 8.00]]

示例 2

Input:  Q (1×2): [[1, 2]]
        K (2×2): [[1, 0], [0, 1]]
        V (2×2): [[3, 4], [5, 6]]
Output: (1×2): [[4.34, 5.34]]

约束条件

$Q$ 为 $M \times d$ ， $K$ 和 $V$ 为 $N \times d$ 。
$1 \le M, N \le 100{,}000$ ， $1 \le d \le 128$ 。
性能测试在 $M = 512,\ N = 256$ 的规模下进行。

Softmax Attention 是 Transformer 的核心运算，可分解为三步： $S = QK^T$ （矩阵乘）、 $P = \text{row-softmax}(S / \sqrt{d})$ 、 $O = PV$ （矩阵乘）。关键挑战在于 softmax 的数值稳定性：需要先按行找 max，再用 "max trick" 计算 softmax。当 $N$ 较大时， $QK^T$ 的结果矩阵可能非常大，需要分块（tiling）计算来节省显存——这就是 FlashAttention 的核心思想。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
__global__ void attention_kernel(const float* Q, const float* K, const float* V,
    float* O, int M, int N, int d) {
    int i = blockIdx.x;  // query position
    float scale = 1.0f / sqrtf((float)d);
    // Compute attention scores for this query
    float scores[256], vals[256];
    float mx = -INFINITY, sm = 0.0f;
    for (int j = 0; j < N; j++) {
        float dot = 0.0f;
        for (int k = 0; k < d; k++) dot += Q[i*d+k] * K[j*d+k];
        scores[j] = dot * scale;
        mx = fmaxf(mx, scores[j]);
    }
    for (int j = 0; j < N; j++) sm += expf(scores[j] - mx);
    for (int j = 0; j < N; j++) vals[j] = expf(scores[j] - mx) / sm;
    for (int k = 0; k < d; k++) {
        float sum = 0.0f;
        for (int j = 0; j < N; j++) sum += vals[j] * V[j*d+k];
        O[i*d+k] = sum;
    }
}
extern "C" void solve(const float* Q, const float* K, const float* V, float* O, int M, int N, int d) {
    attention_kernel<<<M, 1>>>(Q, K, V, O, M, N, d);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def attention_kernel(Q_ptr, K_ptr, V_ptr, O_ptr, M, N, d: tl.constexpr, BLOCK_M: tl.constexpr):
    i = tl.program_id(0) * BLOCK_M + tl.arange(0, BLOCK_M)
    mask_i = i < M
    d_range = tl.arange(0, d)
    qi = tl.load(Q_ptr + i[:,None]*d + d_range[None,:], mask=mask_i[:,None])
    scale = 1.0 / tl.sqrt(d.to(tl.float32))
    acc = tl.zeros((BLOCK_M, d), tl.float32)
    for j in range(N):
        kj = tl.load(K_ptr + j*d + d_range)
        vj = tl.load(V_ptr + j*d + d_range)
        score = tl.sum(qi * kj[None,:], axis=1) * scale
        weight = tl.exp(score - tl.max(score))
        weight = weight / tl.sum(weight)
        acc += weight[:,None] * vj[None,:]
    tl.store(O_ptr + i[:,None]*d + d_range[None,:], acc, mask=mask_i[:,None])