Batch Normalization

agicy2026/6/6大约 1 分钟

Batch Normalization

原始题目：LeetGPU - Batch Normalization

题目描述

编写一个 GPU 程序，实现二维输入张量的批归一化（Batch Normalization）前向传播。给定形状为 $[N, C]$ 的输入张量（ $N$ 为批量大小， $C$ 为特征数），使用可学习的缩放（ $\gamma$ ）和平移（ $\beta$ ）参数计算归一化输出。

对于每个特征通道 $j$ ，批归一化计算：

\begin{aligned} \mu_j &= \frac{1}{N}\sum_{i=1}^{N} x_{i,j} \\ \sigma_j^2 &= \frac{1}{N}\sum_{i=1}^{N} (x_{i,j} - \mu_j)^2 \\ \hat{x}_{i,j} &= \frac{x_{i,j} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}} \\ y_{i,j} &= \gamma_j \hat{x}_{i,j} + \beta_j \end{aligned}

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在 output 张量中。

示例

示例 1

Input:  input = [[1,2],[3,4],[5,6]] (N=3, C=2), gamma=[1,1], beta=[0,0], eps=1e-5
Output: [[-1.224, -1.224], [0, 0], [1.224, 1.224]]

示例 2

Input:  input = [[0,1],[2,3]] (N=2, C=2), gamma=[2,0.5], beta=[1,-1], eps=1e-5
Output: [[-1, -1.5], [3, -0.5]]

约束条件

$1 \le N \le 10{,}000$ ， $1 \le C \le 1{,}024$ 。
$\epsilon = 10^{-5}$ 。
$-100.0 \le input \le 100.0$ ， $0.1 \le \gamma \le 10.0$ ， $-10.0 \le \beta \le 10.0$ 。
性能测试在 $N = 5{,}000$ 的规模下进行。

解题思路

BatchNorm 需要两趟：第一趟对每个通道做规约求 $\mu_j$ 和 $\sigma_j^2$ ，第二趟做逐元素归一化和缩放。跨 $N$ 维度的规约可以利用 warp shuffle 和共享内存高效完成。当 $C$ 较大时，可以每个 block 处理一个通道，多个 block 并行处理所有通道。也可以使用 Welford 在线算法在单趟中同时计算均值和方差以减少内存访问。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
__global__ void bn_kernel(const float* input, float* output, const float* gamma,
    const float* beta, int N, int C) {
    int c = threadIdx.x + blockIdx.x * blockDim.x;
    if (c < C) {
        float sum=0.0f, sq=0.0f;
        for (int n=0; n<N; n++) { float x=input[n*C+c]; sum+=x; sq+=x*x; }
        float mean=sum/N, var=sq/N - mean*mean, inv=rsqrtf(var+1e-5f);
        for (int n=0; n<N; n++) output[n*C+c] = gamma[c]*(input[n*C+c]-mean)*inv + beta[c];
    }
}
extern "C" void solve(const float* input, float* output, const float* gamma,
    const float* beta, int N, int C) {
    bn_kernel<<<(C+255)/256, 256>>>(input, output, gamma, beta, N, C);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def bn_kernel(input_ptr, output_ptr, gamma_ptr, beta_ptr, N: tl.constexpr, C: tl.constexpr, BLOCK: tl.constexpr):
    c = tl.program_id(0)*BLOCK + tl.arange(0, BLOCK)
    mask = c < C
    gamma = tl.load(gamma_ptr + c, mask=mask); beta = tl.load(beta_ptr + c, mask=mask)
    acc = tl.zeros((BLOCK,), tl.float32); sq = tl.zeros((BLOCK,), tl.float32)
    for n in range(N):
        x = tl.load(input_ptr + n*C + c, mask=mask); acc += x; sq += x*x
    mean = acc/N; var = sq/N - mean*mean; inv = 1.0/tl.sqrt(var+1e-5)
    for n in range(N):
        x = tl.load(input_ptr + n*C + c, mask=mask)
        tl.store(output_ptr + n*C + c, gamma*(x-mean)*inv + beta, mask=mask)