FP16 Batched Matrix Multiplication

agicy2026/6/6大约 1 分钟

FP16 Batched Matrix Multiplication

原始题目：LeetGPU - FP16 Batched Matrix Multiplication

题目描述

在 FP16 中实现批量矩阵乘法。给定一批形状为 $[B, M, K]$ 的矩阵 $A$ 和一批形状为 $[B, K, N]$ 的矩阵 $B$ （均为 FP16/half 类型），计算输出批次 $C$ （形状 $[B, M, N]$ ）：

C_b = A_b \times B_b

累加过程中使用 FP32 以获得更好的精度，最终结果转换回 FP16。所有矩阵以行优先顺序存储。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
累加使用 FP32，最终结果以 half 存储在 $C$ 中。

示例

Input:  B=2, M=2, K=3, N=2
        A[0]=[[1,2,3],[4,5,6]], A[1]=[[7,8,9],[10,11,12]]
        B[0]=[[1,2],[3,4],[5,6]], B[1]=[[6,5],[4,3],[2,1]]
Output: C[0]=[[22,28],[49,64]], C[1]=[[92,68],[128,95]]

约束条件

$1 \le B \le 128$ ， $1 \le M, N, K \le 1{,}024$ 。
性能测试在 $K = M = N = 256$ 的规模下进行。

解题思路

与 FP32 批量矩阵乘法结构相同，但使用 FP16 存储和 FP32 累加。利用 Tensor Core（Ampere+）的 FP16 矩阵乘法指令可以显著提升吞吐。关键点：FP16 的精度有限（约 3.3 位十进制有效数字），累加器必须用 FP32 以避免舍入误差累积。__half 类型和 __hmul/__hadd 内置函数是标准工具。

代码实现

CUDA

#include <cuda_runtime.h>
__global__ void fp16_bmm(const __half* A, const __half* B, __half* C, int B, int M, int K, int N) {
    int b=blockIdx.z, row=blockIdx.y*blockDim.y+threadIdx.y, col=blockIdx.x*blockDim.x+threadIdx.x;
    if(b< ::B && row<M && col<N) {
        float sum=0.0f;
        for(int i=0;i<K;i++) sum+=__half2float(A[b*M*K+row*K+i])*__half2float(B[b*K*N+i*N+col]);
        C[b*M*N+row*N+col]=__float2half(sum);
    }
}
extern "C" void solve(const __half* A, const __half* B, __half* C, int B, int M, int K, int N) {
    dim3 t(16,16), b_((N+15)/16,(M+15)/16,B);
    fp16_bmm<<<b_,t>>>(A,B,C,B,M,K,N); cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def fp16_bmm(A_ptr,B_ptr,C_ptr, B,M,K,N, BLOCK_M:tl.constexpr,BLOCK_N:tl.constexpr,BLOCK_K:tl.constexpr):
    pb=tl.program_id(2);pm=tl.program_id(0);pn=tl.program_id(1)
    rm=pm*BLOCK_M+tl.arange(0,BLOCK_M);rn=pn*BLOCK_N+tl.arange(0,BLOCK_N);rk=tl.arange(0,BLOCK_K)
    acc=tl.zeros((BLOCK_M,BLOCK_N),tl.float32)
    for k in range(0,K,BLOCK_K):
        a=tl.load(A_ptr+pb*M*K+rm[:,None]*K+(k+rk)[None,:])
        b=tl.load(B_ptr+pb*K*N+(k+rk)[:,None]*N+rn[None,:])
        acc+=tl.dot(a.to(tl.float32),b.to(tl.float32))
    tl.store(C_ptr+pb*M*N+rm[:,None]*N+rn[None,:],acc.to(tl.float16))