Llama Transformer Block

agicy2026/6/6大约 1 分钟

Llama Transformer Block

原始题目：LeetGPU - Llama Transformer Block

题目描述

实现单个 Llama 风格 Transformer 解码器块。给定输入张量 $x$ （形状 $(seq\_len, 512)$ ）、打包权重缓冲区和预计算的 RoPE 表，使用 pre-norm 架构计算输出：

\begin{aligned} x' &= x + \text{GQA}(\text{RMSNorm}_1(x), \text{RoPE}) \\ output &= x' + \text{SwiGLU}(\text{RMSNorm}_2(x')) \end{aligned}

LLaMA 与 GPT-2 的关键区别：

RMSNorm 替代 LayerNorm（无均值减法）
Grouped Query Attention (GQA) 替代全多头注意力（KV 头 < Q 头）
RoPE 旋转位置编码
SwiGLU（而非 GELU）作为 FFN 激活函数

架构参数： $d_{model}=512$ ， $n_{heads}=8$ ， $n_{kv\_heads}=4$ ， $d_{ffn}=1{,}368$ 。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。

约束条件

$1 \le seq\_len \le 4{,}096$ 。
性能测试在 $seq\_len = 1{,}024$ 下进行。

解题思路

LLaMA 块的计算模式与 GPT-2 类似，但需要处理 GQA 中的 KV 头广播（每 2 个 Q 头共享 1 个 KV 头）和 RMSNorm 的简化计算（省去向量的均值减法）。SwiGLU 需要先计算 gate+up 投影（可合并成一个 matmul），再做 SiLU 门控+乘法，最后 down 投影。RoPE 可在 attention kernel 中融合。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
#define D 512
#define FFN 1368
#define H 8
#define KVH 4
#define DK 64
__global__ void llama_block(const float* x, const float* w, const float* rope_cos,
    const float* rope_sin, float* out, int S) {
    // RMSNorm -> GQA with RoPE -> Residual -> RMSNorm -> SwiGLU -> Residual
    int i=blockIdx.x*blockDim.x+threadIdx.x; if(i>=S*D)return;
    out[i]=x[i]; // Placeholder
}
extern "C" void solve(const float* x, const float* w, const float* rope_cos,
    const float* rope_sin, float* out, int S) {
    llama_block<<<(S*D+255)/256,256>>>(x,w,rope_cos,rope_sin,out,S); cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def llama_block(x_ptr,w_ptr,cos_ptr,sin_ptr,out_ptr, S, BLOCK:tl.constexpr):
    idx=tl.program_id(0)*BLOCK+tl.arange(0,BLOCK); mask=idx<S*512
    x=tl.load(x_ptr+idx,mask=mask)
    # RMSNorm -> GQA+RoPE -> Residual -> RMSNorm -> SwiGLU -> Residual
    tl.store(out_ptr+idx, x, mask=mask)  # placeholder