Sliding Window Self-Attention

agicy2026/6/6大约 1 分钟

Sliding Window Self-Attention

原始题目：LeetGPU - Sliding Window Self-Attention

题目描述

实现滑动窗口自注意力。给定查询矩阵 $Q$ 、键矩阵 $K$ 和值矩阵 $V$ （均为 $M \times d$ ）以及窗口大小 $W$ ，每个位置 $i$ 只关注窗口内的位置 $j \in [i - W, i]$ （因果窗口）：

score_{i,j} = \frac{Q_i \cdot K_j}{\sqrt{d}}

output_i = \sum_{j=\max(0, i-W)}^{i} \text{softmax}(score_{i,*})_j \cdot V_j

窗口外的注意力分数设为 $-\infty$ （softmax 前）。所有数据为 float32。

约束条件

$1 \le M \le 10{,}000$ ， $1 \le d \le 128$ ， $1 \le W \le M$ 。
性能测试在 $M = 5{,}000$ 下进行。

滑动窗口注意力将注意力限制在局部窗口内，将复杂度从 $O(M^2)$ 降为 $O(M \cdot W)$ 。这避免了长序列中 attention 矩阵的显存爆炸问题，被 Mistral 等模型采用。GPU 实现中可以利用每个 query 只读局部 window 内 KV 的特点，使用分块策略加载共享内存，显著减少全局内存访问。因果掩码（ $j \le i$ ）+ 窗口掩码（ $j \ge i - W$ ）的结合限制了每个 query 的有效 attention 范围。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
__global__ void sw_attn(const float* Q, const float* K, const float* V, float* O, int M, int d, int W) {
    int i = blockIdx.x;
    float scale = 1.0f / sqrtf((float)d);
    float scores[256], mx = -INFINITY, sm = 0.0f;
    int start = (i > W) ? i - W : 0;
    for(int j = start; j <= i; j++) {
        float dot = 0.0f;
        for(int k = 0; k < d; k++) dot += Q[i*d+k] * K[j*d+k];
        scores[j - start] = dot * scale;
        mx = fmaxf(mx, scores[j-start]);
    }
    int cnt = i - start + 1;
    for(int j = 0; j < cnt; j++) sm += expf(scores[j] - mx);
    for(int k = 0; k < d; k++) {
        float sum = 0.0f;
        for(int j = 0; j < cnt; j++)
            sum += expf(scores[j]-mx)/sm * V[(start+j)*d+k];
        O[i*d+k] = sum;
    }
}
extern "C" void solve(const float* Q, const float* K, const float* V, float* O, int M, int d, int W) {
    sw_attn<<<M, 1>>>(Q, K, V, O, M, d, W);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def sw_attn(Q_ptr,K_ptr,V_ptr,O_ptr, M,d,W, BLOCK:tl.constexpr):
    i=tl.program_id(0)*BLOCK+tl.arange(0,BLOCK); mask=i<M
    d_range=tl.arange(0,d); qi=tl.load(Q_ptr+i[:,None]*d+d_range[None,:],mask=mask[:,None])
    scale=1.0/tl.sqrt(d.to(tl.float32)); acc=tl.zeros((BLOCK,d),tl.float32)
    for j in range(M):
        kj=tl.load(K_ptr+j*d+d_range); vj=tl.load(V_ptr+j*d+d_range)
        score=tl.sum(qi*kj[None,:],axis=1)*scale
        score=tl.where((i[:,None]>=j) & (i[:,None]-j<=W), score, -float('inf'))
        acc+=tl.softmax(score)[:,None]*vj[None,:]
    tl.store(O_ptr+i[:,None]*d+d_range[None,:],acc,mask=mask[:,None])

Sliding Window Self-Attention

Sliding Window Self-Attention

题目描述

实现要求

约束条件

解题思路

代码实现

CUDA

Triton