FP16 Dot Product

agicy2026/6/6大约 1 分钟

FP16 Dot Product

原始题目：LeetGPU - FP16 Dot Product

题目描述

编写一个 GPU 程序，计算两个 16 位浮点数（FP16/half）向量的点积：

A \cdot B = \sum_{i=0}^{n-1} A_i \cdot B_i

所有输入以 FP16 存储。累加过程使用 FP32 以获得最佳精度，最终结果转换回 FP16。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
累加使用 FP32，最终结果以 half 存储在 output 变量中。

示例

示例 1

Input:  A = [1.0, 2.0, 3.0, 4.0], B = [5.0, 6.0, 7.0, 8.0]
Output: 70.0

示例 2

Input:  A = [0.5, 1.5, 2.5], B = [2.0, 3.0, 4.0]
Output: 15.5

约束条件

$A$ 和 $B$ 长度相同。
$1 \le N \le 100{,}000{,}000$ 。
性能测试在 $N = 100{,}000{,}000$ 的规模下进行。

解题思路

FP16 点积与 FP32 版本结构相同，但输入是半精度。每个线程先以 FP32 精度乘法和累加对应元素对，再做分块规约。混合精度（FP16 输入 → FP32 累加 → FP16 输出）是现代 ML 推理的标准范式，Tensor Core 也原生支持这种模式。注意 FP16 对大 $N$ 的场景需要更谨慎的累加策略，避免 FP16 精度损失。

代码实现

CUDA

#include <cuda_runtime.h>
__global__ void fp16_dot(const __half* A, const __half* B, __half* output, int N) {
    __shared__ float sdata[256]; int tid=threadIdx.x; float sum=0.0f;
    for(int i=blockIdx.x*blockDim.x+tid;i<N;i+=gridDim.x*blockDim.x)
        sum += __half2float(A[i]) * __half2float(B[i]);
    sdata[tid]=sum; __syncthreads();
    for(int s=blockDim.x/2;s>0;s>>=1){if(tid<s)sdata[tid]+=sdata[tid+s];__syncthreads();}
    if(tid==0)*output=__float2half(sdata[0]);
}
extern "C" void solve(const __half* A, const __half* B, __half* output, int N) {
    fp16_dot<<<min((N+255)/256,1024),256>>>(A,B,output,N);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def fp16_dot(A_ptr,B_ptr,output_ptr,N,BLOCK:tl.constexpr):
    idx=tl.program_id(0)*BLOCK+tl.arange(0,BLOCK); mask=idx<N
    a=tl.load(A_ptr+idx,mask=mask,other=0.0); b=tl.load(B_ptr+idx,mask=mask,other=0.0)
    tl.atomic_add(output_ptr, tl.sum(a.to(tl.float32)*b.to(tl.float32),axis=0))