Matrix Copy

agicy2026/6/6大约 1 分钟

Matrix Copy

原始题目：LeetGPU - Matrix Copy

题目描述

编写一个 GPU 程序，在 GPU 上将 $N \times N$ 的 32 位浮点数矩阵从输入数组 $A$ 复制到输出数组 $B$ 。程序应执行直接的逐元素复制，使得对于所有有效索引满足 $B[i][j] = A[i][j]$ 。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在矩阵 $B$ 中。

示例

示例 1

Input:  A = [[1.0, 2.0], [3.0, 4.0]]
Output: B = [[1.0, 2.0], [3.0, 4.0]]

示例 2

Input:  A = [[5.5, 6.6, 7.7], [8.8, 9.9, 10.1], [11.2, 12.3, 13.4]]
Output: B = [[5.5, 6.6, 7.7], [8.8, 9.9, 10.1], [11.2, 12.3, 13.4]]

约束条件

$1 \le N \le 4{,}096$ 。
所有元素均为 32 位浮点数。
性能测试在 $N = 4{,}096$ 的规模下进行。

解题思路

矩阵拷贝是 GPU 中最基础的操作之一，可以直接使用 cudaMemcpy（设备端到设备端），也可以手写一个逐元素拷贝的 kernel。手写 kernel 的训练意义在于理解内存合并访问的重要性：确保线程按行优先顺序访问，以最大化合并事务的吞吐。

代码实现

CUDA

#include <cuda_runtime.h>

__global__ void matrix_copy_kernel(const float* A, float* B, int N2) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N2) {
        B[i] = A[i];
    }
}

// float4 向量化拷贝（LDG.128 + STG.128）
__global__ void matrix_copy_float4(const float* A, float* B, int N2) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    int N4 = N2 / 4;
    const float4* A4 = (const float4*)A;
    float4* B4 = (float4*)B;
    for (int i = idx; i < N4; i += stride) {
        B4[i] = A4[i];
    }
    for (int i = N4 * 4 + idx; i < N2; i += stride) {
        B[i] = A[i];
    }
}

extern "C" void solve(const float* A, float* B, int N) {
    int N2 = N * N;
    int threadsPerBlock = 256;
    int blocksPerGrid = (N2 + threadsPerBlock - 1) / threadsPerBlock;
    matrix_copy_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, B, N2);
    cudaDeviceSynchronize();
}

Triton

import triton
import triton.language as tl

@triton.jit
def matrix_copy_kernel(A_ptr, B_ptr, N: tl.constexpr, BLOCK_SIZE: tl.constexpr):
    idx = tl.program_id(0) * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = idx < N * N
    tl.store(B_ptr + idx, tl.load(A_ptr + idx, mask=mask), mask=mask)