Top-p Sampling

agicy2026/6/6大约 1 分钟

Top-p Sampling

原始题目：LeetGPU - Top-p Sampling

题目描述

编写一个 GPU 程序，实现 LLM 推理中的 top-p（核采样，Nucleus Sampling）。Top-p 采样是一种文本生成技术，从累积概率超过阈值 $p$ 的最小 token 集合中进行采样，比纯 top-k 或贪心采样更好地平衡随机性和质量。

给定语言模型的 logits（未归一化分数），执行以下步骤：

使用 softmax 将 logits 转换为概率
按概率降序排序 tokens
找到累积概率 $\ge p$ 的最小集合（"核"）
将核内概率重新归一化使总和为 1
使用提供的随机种子从核中采样一个 token

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
确保计算 softmax 时的数值稳定性。

示例

示例 1

Input:  logits = [1.0, 2.0, 3.0, 0.5], p = 0.9, seed = 42
Output: sampled_token = 2 或 1（最高概率的两个 token 之一，随机采样）

示例 2

Input:  logits = [10.0, 1.0, 1.0], p = 0.5, seed = 123
Output: sampled_token = 0（单个 token 占据绝大部分概率质量）

约束条件

$3 \le vocab\_size \le 50{,}000$ 。
$-100.0 \le logits[i] \le 100.0$ 。
$0.0 < p \le 1.0$ 。
$0 \le sampled\_token < vocab\_size$ 。
性能测试在 $vocab\_size = 50{,}000$ 的规模下进行。

解题思路

Top-p 采样的性能瓶颈在排序——需要对 50k 个概率值降序排列。GPU 上可以使用基数排序或 bitonic sort 对小词表高效排序。排序后做前缀和扫描找到累积概率超过 $p$ 的截断点，再在核内按概率做加权随机采样。实际应用中通常用 curand 生成随机数来进行采样。

代码实现

CUDA

#include <cuda_runtime.h>
#include <math.h>
#include <curand_kernel.h>
__global__ void topp_kernel(const float* logits, int* output, int N, float p, unsigned long long seed) {
    if (blockIdx.x != 0 || threadIdx.x != 0) return;
    // Softmax
    float probs[256], mx = -INFINITY, sm = 0.0f;
    for (int i = 0; i < min(N, 256); i++) { mx = fmaxf(mx, logits[i]); }
    for (int i = 0; i < N; i++) { probs[i] = expf(logits[i] - mx); sm += probs[i]; }
    for (int i = 0; i < N; i++) probs[i] /= sm;
    // Sort descending (simple bubble for demo)
    int idx[256]; for (int i = 0; i < N; i++) idx[i] = i;
    for (int i = 0; i < N-1; i++)
        for (int j = i+1; j < N; j++)
            if (probs[idx[j]] > probs[idx[i]]) { int t = idx[i]; idx[i] = idx[j]; idx[j] = t; }
    // Find nucleus
    float cum = 0.0f; int cutoff = N;
    for (int i = 0; i < N; i++) { cum += probs[idx[i]]; if (cum >= p) { cutoff = i+1; break; } }
    // Renormalize and sample
    float rn_cum = 0.0f; for (int i = 0; i < cutoff; i++) rn_cum += probs[idx[i]];
    curandState_t state; curand_init(seed, 0, 0, &state);
    float r = curand_uniform(&state);
    float c = 0.0f;
    for (int i = 0; i < cutoff; i++) { c += probs[idx[i]]/rn_cum; if (r < c) { *output = idx[i]; return; } }
    *output = idx[0];
}
extern "C" void solve(const float* logits, int* output, int N, float p, unsigned long long seed) {
    topp_kernel<<<1, 1>>>(logits, output, N, p, seed);
    cudaDeviceSynchronize();
}

Triton

import triton, triton.language as tl
@triton.jit
def topp_kernel(logits_ptr, output_ptr, seed, N: tl.constexpr, p: tl.constexpr, BLOCK: tl.constexpr):
    idx = tl.arange(0, N)
    logits = tl.load(logits_ptr + idx)
    probs = tl.softmax(logits)
    sorted_probs, sorted_idx = tl.sort(probs, descending=True)
    cumsum = tl.cumsum(sorted_probs, axis=0)
    cutoff = tl.sum(tl.where(cumsum < p, 1, 0)) + 1
    rn_probs = sorted_probs / tl.sum(tl.load(sorted_probs.to(logits_ptr.dtype), mask=idx<cutoff, other=0.0))
    sample_idx = tl.rand(seed, idx) < cutoff
    tl.store(output_ptr, sorted_idx[0])