Causal Self-Attention

agicy2026/6/6大约 1 分钟

Causal Self-Attention

原始题目：LeetGPU - Causal Self-Attention

题目描述

实现因果（掩码）自注意力。给定查询矩阵 $Q$ （ $M \times d$ ）、键矩阵 $K$ （ $M \times d$ ）和值矩阵 $V$ （ $M \times d$ ），计算：

\text{Attention}_{\text{causal}}(Q, K, V) = \text{softmax}\!\left(\text{mask}\!\left(\frac{Q K^T}{\sqrt{d}}\right)\right) V

其中因果掩码将当前位置之后的所有键位置设为 $-\infty$ ：

\text{mask}(a_{ij}) = \begin{cases} a_{ij}, & j \le i \\ -\infty, & j > i \end{cases}

softmax 按行应用。所有数据为 float32。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。

示例

Q (2×4): [[1,0,0,0],[0,1,0,0]]
K (2×4): [[1,0,0,0],[0,1,0,0]]
V (2×4): [[1,2,3,4],[5,6,7,8]]
→ Row 0 只看 pos 0, Row 1 看 pos 0+1
Output (2×4): [[1,2,3,4],[3.49,4.49,5.49,6.49]]

约束条件

$1 \le M \le 10{,}000$ ， $1 \le d \le 128$ 。
所有元素范围 $[-100, 100]$ 。
性能测试在 $M = 5{,}000$ 下进行。

解题思路

因果自注意力 = 标准自注意力 + 下三角掩码。掩码可以通过在 softmax 前将上三角元素（ $j > i$ ）设为 $-\infty$ 来实现。对于解码器（仅生成阶段），每步只有 1 个查询 token，因此 $Q$ 是 $1 \times d$ ， $QK^T$ 是 $1 \times M$ 的向量，计算量极小——瓶颈在加载整个 $K$ 和 $V$ （即 KV-cache）。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
__global__ void causal_attn(const float* Q, const float* K, const float* V, float* O, int M, int d) {
    int i = blockIdx.x;
    float scale = 1.0f / sqrtf((float)d);
    float scores[256], mx = -INFINITY, sm = 0.0f;
    for(int j = 0; j <= i; j++) {
        float dot = 0.0f;
        for(int k = 0; k < d; k++) dot += Q[i*d+k] * K[j*d+k];
        scores[j] = dot * scale;
        mx = fmaxf(mx, scores[j]);
    }
    for(int j = 0; j <= i; j++) sm += expf(scores[j] - mx);
    for(int j = i+1; j < M; j++) scores[j] = -INFINITY;
    for(int k = 0; k < d; k++) {
        float sum = 0.0f;
        for(int j = 0; j <= i; j++) sum += expf(scores[j]-mx)/sm * V[j*d+k];
        O[i*d+k] = sum;
    }
}
extern "C" void solve(const float* Q, const float* K, const float* V, float* O, int M, int d) {
    causal_attn<<<M, 1>>>(Q, K, V, O, M, d);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def causal_attn(Q_ptr,K_ptr,V_ptr,O_ptr, M,d, BLOCK:tl.constexpr):
    i=tl.program_id(0)*BLOCK+tl.arange(0,BLOCK); mask=i<M
    d_range=tl.arange(0,d); qi=tl.load(Q_ptr+i[:,None]*d+d_range[None,:],mask=mask[:,None])
    scale=1.0/tl.sqrt(d.to(tl.float32)); acc=tl.zeros((BLOCK,d),tl.float32)
    for j in range(M):
        kj=tl.load(K_ptr+j*d+d_range); vj=tl.load(V_ptr+j*d+d_range)
        score=tl.sum(qi*kj[None,:],axis=1)*scale
        score=tl.where(i[:,None]>=j, score, -float('inf'))
        acc+=tl.softmax(score)[:,None]*vj[None,:]
    tl.store(O_ptr+i[:,None]*d+d_range[None,:],acc,mask=mask[:,None])