Attention with Linear Biases

agicy2026/6/6大约 1 分钟

Attention with Linear Biases

原始题目：LeetGPU - Attention with Linear Biases

题目描述

实现论文 "Train Short, Test Long" 中提出的 ALiBi（Attention with Linear Biases）注意力机制。给定查询矩阵 $Q$ （ $M \times d$ ）、键矩阵 $K$ （ $N \times d$ ）和值矩阵 $V$ （ $N \times d$ ），计算：

\text{Attention}_{\text{ALiBi}}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d}} + \alpha \cdot \Delta\right) V

其中 $\alpha$ 是控制线性偏置的斜率， $\Delta = i - j$ 表示查询 $i$ 与键 $j$ 之间的相对位置。softmax 按行应用，所有数据均为 float32。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在输出矩阵 output 中。

示例

示例 1

Input:  Q (2×4): [[1,0,0,0],[0,1,0,0]]
        K (3×4): [[1,0,0,0],[0,1,0,0],[0,0,1,0]]
        V (3×4): [[1,2,3,4],[5,6,7,8],[9,10,11,12]]
        α = 0.5
Output: (2×4): [[3.05,4.05,6.05,7.05],[3.93,4.93,5.93,6.93]]

示例 2

Input:  Q (1×2): [[1,2]], K (2×2): [[1,0],[0,1]], V (2×2): [[3,4],[5,6]], α=0.8
Output: (1×2): [[3.95,4.95]]

约束条件

$Q$ 为 $M \times d$ ， $K$ 和 $V$ 为 $N \times d$ 。
$1 \le M, N \le 2{,}048$ ， $1 \le d \le 1{,}024$ 。
$-1.0 \le \alpha \le 1.0$ 。
性能测试在 $M = N = 2{,}048$ 的规模下进行。

解题思路

ALiBi 用简单的线性偏置替代了位置编码，让模型能够外推到训练时未见过的序列长度。实现上在标准 softmax attention 的 $QK^T/\sqrt{d}$ 基础上加上一个下三角矩阵 $\alpha \cdot (i - j)$ （对因果掩码的场景）。偏置矩阵可以预计算并存储在常量内存中。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
__global__ void alibi_kernel(const float* Q, const float* K, const float* V,
    float* O, int M, int N, int d, float alpha) {
    int i = blockIdx.x;
    float scale = 1.0f / sqrtf((float)d);
    float scores[256]; float mx = -INFINITY, sm = 0.0f;
    for (int j = 0; j < min(N, 256); j++) {
        float dot = 0.0f;
        for (int k = 0; k < d; k++) dot += Q[i*d+k] * K[j*d+k];
        scores[j] = dot * scale + alpha * (i - j);
        mx = fmaxf(mx, scores[j]);
    }
    for (int j = 0; j < N; j++) sm += expf(scores[j] - mx);
    for (int k = 0; k < d; k++) {
        float sum = 0.0f;
        for (int j = 0; j < N; j++) sum += expf(scores[j]-mx)/sm * V[j*d+k];
        O[i*d+k] = sum;
    }
}
extern "C" void solve(const float* Q, const float* K, const float* V, float* O,
    int M, int N, int d, float alpha) {
    alibi_kernel<<<M, 1>>>(Q, K, V, O, M, N, d, alpha);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def alibi_kernel(Q_ptr, K_ptr, V_ptr, O_ptr, M, N, d: tl.constexpr, alpha, BLOCK_M: tl.constexpr):
    i = tl.program_id(0)*BLOCK_M + tl.arange(0, BLOCK_M)
    mask = i < M
    d_range = tl.arange(0, d)
    qi = tl.load(Q_ptr + i[:,None]*d + d_range[None,:], mask=mask[:,None])
    scale = 1.0/tl.sqrt(d.to(tl.float32))
    acc = tl.zeros((BLOCK_M, d), tl.float32)
    for j in range(N):
        kj = tl.load(K_ptr + j*d + d_range); vj = tl.load(V_ptr + j*d + d_range)
        score = tl.sum(qi*kj[None,:], axis=1)*scale + alpha*(i.to(tl.float32) - j)
        weight = tl.softmax(score)
        acc += weight[:,None] * vj[None,:]
    tl.store(O_ptr + i[:,None]*d + d_range[None,:], acc, mask=mask[:,None])