Matrix Multiplication

agicy2026/6/6大约 1 分钟

Matrix Multiplication

原始题目：LeetGPU - Matrix Multiplication

题目描述

编写一个 GPU 程序，将两个 32 位浮点数矩阵相乘。给定矩阵 $A$ （ $M \times N$ ）和矩阵 $B$ （ $N \times K$ ），计算乘积矩阵 $C = A \times B$ （维度为 $M \times K$ ）。所有矩阵均以行优先（row-major）格式存储。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在矩阵 $C$ 中。

示例

示例 1

Input:  Matrix A (2×2): [[1.0, 2.0], [3.0, 4.0]]
        Matrix B (2×2): [[5.0, 6.0], [7.0, 8.0]]
Output: Matrix C (2×2): [[19.0, 22.0], [43.0, 50.0]]

示例 2

Input:  Matrix A (1×3): [[1.0, 2.0, 3.0]]
        Matrix B (3×1): [[4.0], [5.0], [6.0]]
Output: Matrix C (1×1): [[32.0]]

约束条件

$1 \le M, N, K \le 8{,}192$ 。
所有元素均为 32 位浮点数。
性能测试在 $M = 8{,}192,\ N = 6{,}144,\ K = 4{,}096$ 的规模下进行。

解题思路

矩阵乘法是 GPU 编程的"Hello World"进阶版。朴素实现中每个线程计算一个输出元素，但全局内存的重复访问会导致严重的带宽浪费。核心优化手段是共享内存分块（Tiling）：将矩阵划分为子块加载到共享内存中，大幅减少全局内存访问。

代码实现

CUDA

#include <cuda_runtime.h>

// 朴素实现：每个线程计算 C 的一个元素
__global__ void matmul_naive(const float* A, const float* B, float* C, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < K) {
        float sum = 0.0f;
        for (int i = 0; i < N; i++) {
            sum += A[row * N + i] * B[i * K + col];
        }
        C[row * K + col] = sum;
    }
}

extern "C" void solve(const float* A, const float* B, float* C, int M, int N, int K) {
    dim3 threads(16, 16);
    dim3 blocks((K + 15) / 16, (M + 15) / 16);
    matmul_naive<<<blocks, threads>>>(A, B, C, M, N, K);
    cudaDeviceSynchronize();
}

Triton

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(A_ptr, B_ptr, C_ptr,
                  M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
                  BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A_ptr + rm[:, None] * N + rk[None, :]
    B = B_ptr + rk[:, None] * K + rn[None, :]
    acc = tl.zeros((BLOCK_M, BLOCK_N), tl.float32)
    for k in range(0, N, BLOCK_K):
        a = tl.load(A, mask=rm[:, None] < M and rk[None, :] < N - k)
        b = tl.load(B, mask=rk[:, None] < N - k and rn[None, :] < K)
        acc += tl.dot(a, b)
        A += BLOCK_K; B += BLOCK_K * K
    tl.store(C_ptr + rm[:, None] * K + rn[None, :], acc, mask=rm[:, None] < M and rn[None, :] < K)