INT4 Weight-Only Quantized MatMul

agicy2026/6/6大约 2 分钟

INT4 Weight-Only Quantized MatMul

原始题目：LeetGPU - INT4 Weight-Only Quantized MatMul

题目描述

实现仅权重的 INT4 量化矩阵乘法（W4A16），这是现代 LLM 推理中使用的核心 kernel。给定 float16 激活矩阵 $x$ （ $M \times K$ ）和以 packed INT4 格式存储的权重矩阵，计算输出矩阵 $y = x \times W^T$ （ $M \times N$ ），其中 $W$ 是解量化后的 float16 权重矩阵（ $N \times K$ ）。

打包格式：w_q 的每个字节存储两个 INT4 权重。高半字节（bit 7–4）存储 $w[n, 2i]$ ，低半字节（bit 3–0）存储 $w[n, 2i+1]$ 。INT4 值以无符号形式存储在 $[0, 15]$ 范围内，偏移量为 8，因此有符号权重为 $\text{nibble} - 8$ ，范围为 $[-8, 7]$ 。

解量化：权重按组解量化。沿 $K$ 维度的每 $group\_size$ 个连续权重共享一个 float16 缩放因子：

W[n, k] = (w\_q\_nibble[n, k] - 8) \times scales[n, \lfloor k / group\_size \rfloor]

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在 $y$ 中。

示例

Input:  M=2, N=4, K=4, group_size=2
        x (float16, 2×4): [[1,0,1,0],[0,1,0,1]]
        w_q (uint8, 4×2): packed, 带符号INT4值为 [[1,1],[1,1],[2,2],[2,2],[-1,-1],[-1,-1],[0,0],[0,0]]
        scales (float16, 4×2): 全 0.5
        W_dequant = (nibble-8)*0.5
Output: y = x × W^T (float16, 2×4): [[1,2,-1,0],[1,2,-1,0]]

约束条件

$1 \le M, N, K \le 8{,}192$ 。
$K$ 可被 2 和 $group\_size$ 整除。
$group\_size \in \{2, 4, 8, 16, 32, 64, 128\}$ 。
x 和 scales 为 float16，w_q 为 uint8，y 为 float16。
性能测试在 $M=N=K=4{,}096$ ， $group\_size=128$ 下进行。

#include <cuda_runtime.h>
__global__ void w4a16_matmul(const __half* x, const uint8_t* w_q, const __half* scales,
    __half* y, int M, int K, int N, int group_size) {
    int row=blockIdx.y*blockDim.y+threadIdx.y, col=blockIdx.x*blockDim.x+threadIdx.x;
    if(row<M&&col<N){
        float sum=0.0f;
        for(int k=0;k<K;k++){
            int nibble_idx=k/2, is_high=(k%2==0);
            uint8_t byte=w_q[col*(K/2)+nibble_idx];
            int8_t w_int4 = (int8_t)((is_high?(byte>>4):(byte&0xF)) - 8);
            float w_f = (float)w_int4 * __half2float(scales[col*(K/group_size)+k/group_size]);
            sum += __half2float(x[row*K+k]) * w_f;
        }
        y[row*N+col]=__float2half(sum);
    }
}
extern "C" void solve(const __half* x, const uint8_t* w_q, const __half* scales,
    __half* y, int M, int K, int N, int group_size) {
    dim3 t(16,16), b((N+15)/16,(M+15)/16);
    w4a16_matmul<<<b,t>>>(x,w_q,scales,y,M,K,N,group_size); cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def w4a16_matmul(x_ptr,w_q_ptr,scales_ptr,y_ptr, M,N,K,group_size, BLOCK_M:tl.constexpr,BLOCK_N:tl.constexpr,BLOCK_K:tl.constexpr):
    pm=tl.program_id(0);pn=tl.program_id(1)
    rm=pm*BLOCK_M+tl.arange(0,BLOCK_M);rn=pn*BLOCK_N+tl.arange(0,BLOCK_N);rk=tl.arange(0,BLOCK_K)
    acc=tl.zeros((BLOCK_M,BLOCK_N),tl.float32)
    for k in range(0,K,BLOCK_K):
        a=tl.load(x_ptr+rm[:,None]*K+(k+rk)[None,:])
        w_byte=tl.load(w_q_ptr+rn[:,None]*(K//2)+(k+rk)[None,:]//2)
        w_hi=((w_byte>>4).to(tl.int32)-8).to(tl.float32)
        w_lo=((w_byte&0xF).to(tl.int32)-8).to(tl.float32)
        s=tl.load(scales_ptr+rn[:,None]*(K//group_size)+(k+rk)[None,:]//group_size)
        w=(tl.where(rk[None,:]%2==0,w_hi,w_lo))*s
        acc+=tl.dot(a.to(tl.float32),tl.trans(w.to(tl.float32)))
    tl.store(y_ptr+rm[:,None]*N+rn[None,:],acc.to(tl.float16))

INT4 Weight-Only Quantized MatMul

INT4 Weight-Only Quantized MatMul

题目描述

实现要求

示例

约束条件

解题思路

代码实现

CUDA

Triton