INT8 KV-Cache Attention

agicy2026/6/6大约 2 分钟

INT8 KV-Cache Attention

原始题目：LeetGPU - INT8 KV-Cache Attention

题目描述

实现解码阶段的多头注意力，其中键和值缓存以 int8 格式存储并附有逐 token 缩放因子。相比 float32，这种布局将 KV-cache 带宽减半，被 TensorRT-LLM 和 vLLM 等生产级 LLM 服务系统广泛采用。

给定单个新 token 的查询张量 $Q$ 、int8 键缓存 $K_{int8}$ 、int8 值缓存 $V_{int8}$ 以及逐 token 缩放因子 $k\_scale$ 和 $v\_scale$ ，解量化缓存并计算缩放点积注意力输出。

解量化： $K_{float}[h, s, d] = K_{int8}[h, s, d] \times k\_scale[h, s]$ （ $V$ 同理）。注意力使用 $1/\sqrt{head\_dim}$ 作为缩放因子。

实现要求

实现 solve(Q, K_int8, V_int8, k_scale, v_scale, output, num_heads, seq_len, head_dim)。
不允许使用外部库。

示例

num_heads=1, seq_len=3, head_dim=4
Q = [[1,1,1,1]]
K_int8 = [[[100,0,0,0],[0,100,0,0],[0,0,100,0]]], k_scale = [0.1,0.1,0.1]
V_int8 = [[[10,20,30,40],...]], v_scale = [0.1,0.1,0.1]
K_float = [[1,0,0,0],[0,1,0,0],[0,0,1,0]], V_float = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]
Scores = softmax([0.25,0.25,0.25]) = [1/3,1/3,1/3]
Output = [5,6,7,8]

约束条件

$1 \le num\_heads \le 64$ ， $1 \le seq\_len \le 32{,}768$ ， $8 \le head\_dim \le 256$ （8 的倍数）。
$K_{int8}, V_{int8}$ 值在 $[-128, 127]$ ，缩放因子为正 float32。
性能测试在 $num\_heads=32, seq\_len=8{,}192, head\_dim=128$ 下进行。

#include <cuda_runtime.h>
#include <math.h>
__global__ void int8_kv_attn(const float* Q, const int8_t* K_i8, const int8_t* V_i8,
    const float* ks, const float* vs, float* O, int H, int S, int d) {
    int h=blockIdx.x,i=threadIdx.x; if(i>=S)return;
    float scale=1.0f/sqrtf(d);
    float scores[256],mx=-INFINITY,sm=0.0f;
    for(int j=0;j<S;j++){
        float dot=0;for(int k=0;k<d;k++)dot+=Q[h*d+k]*(float)K_i8[h*S*d+j*d+k]*ks[h*S+j];
        scores[j]=dot*scale;mx=fmaxf(mx,scores[j]);
    }
    for(int j=0;j<S;j++)sm+=expf(scores[j]-mx);
    for(int k=0;k<d;k++){float sum=0;for(int j=0;j<S;j++)sum+=expf(scores[j]-mx)/sm*(float)V_i8[h*S*d+j*d+k]*vs[h*S+j];O[h*d+k]=sum;}
}
extern "C" void solve(const float* Q, const int8_t* K_i8, const int8_t* V_i8,
    const float* ks, const float* vs, float* O, int H, int S, int d) {
    int8_kv_attn<<<H,S>>>(Q,K_i8,V_i8,ks,vs,O,H,S,d); cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def int8_kv(Q_ptr,Ki8_ptr,Vi8_ptr,ks_ptr,vs_ptr,O_ptr, H,S,d, BLOCK:tl.constexpr):
    h=tl.program_id(0); i=tl.program_id(1)*BLOCK+tl.arange(0,BLOCK); mask=i<S
    d_range=tl.arange(0,d)
    qi=tl.load(Q_ptr+h*d+d_range); scale=1.0/tl.sqrt(d.to(tl.float32))
    acc=tl.zeros((BLOCK,d),tl.float32)
    for j in range(S):
        kj=(tl.load(Ki8_ptr+h*S*d+j*d+d_range).to(tl.float32)*tl.load(ks_ptr+h*S+j))
        vj=(tl.load(Vi8_ptr+h*S*d+j*d+d_range).to(tl.float32)*tl.load(vs_ptr+h*S+j))
        score=tl.sum(qi[None,:]*kj,axis=1)*scale; w=tl.softmax(score)
        acc+=w[:,None]*vj[None,:]
    tl.store(O_ptr+h*d+d_range,acc[0],mask=d_range<d)

INT8 KV-Cache Attention

INT8 KV-Cache Attention

题目描述

实现要求

示例

约束条件

解题思路

代码实现

CUDA

Triton