Matrix Transpose

agicy2026/6/6大约 1 分钟

Matrix Transpose

原始题目：LeetGPU - Matrix Transpose

题目描述

编写一个 GPU 程序，对 32 位浮点数矩阵进行转置。矩阵转置即交换行和列：给定一个 $rows \times cols$ 的矩阵 $A$ ，其转置 $A^T$ 的维度为 $cols \times rows$ ，满足：

A^T[j][i] = A[i][j]

所有矩阵均以行优先（row-major）格式存储。

实现要求

不允许使用外部库。
solve 函数签名必须保持不变。
最终结果必须存储在输出矩阵中。

示例

示例 1

Input:  2×3 矩阵: [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
Output: 3×2 矩阵: [[1.0, 4.0], [2.0, 5.0], [3.0, 6.0]]

示例 2

Input:  3×1 矩阵: [[1.0], [2.0], [3.0]]
Output: 1×3 矩阵: [[1.0, 2.0, 3.0]]

约束条件

$1 \le rows, cols \le 8{,}192$ 。
性能测试在 $cols = 6{,}000,\ rows = 7{,}000$ 的规模下进行。

解题思路

矩阵转置的核心挑战在于非合并内存访问（Uncoalesced Access）：按行优先格式读取但按列方向写入（或反之），会导致全局内存事务效率大幅下降。标准优化是使用共享内存作为中间缓冲，先以合并方式读入共享内存，再从共享内存以合并方式写出。

更进阶的做法是使用bank-conflict-free 的 pad 技巧和向量化加载（float4）来减少内存事务数量。

代码实现

CUDA

#include <cuda_runtime.h>

// 朴素转置：直接交换行列索引（非合并写入）
__global__ void transpose_naive(const float* input, float* output, int rows, int cols) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < rows && j < cols) {
        output[j * rows + i] = input[i * cols + j];
    }
}

// 共享内存优化：先合并读入 shared memory，再合并写出
#define TILE 32
__global__ void transpose_shared(const float* input, float* output, int rows, int cols) {
    __shared__ float tile[TILE][TILE + 1];  // +1 避免 bank conflict
    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;
    if (x < cols && y < rows) {
        tile[threadIdx.y][threadIdx.x] = input[y * cols + x];
    }
    __syncthreads();
    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;
    if (x < rows && y < cols) {
        output[y * rows + x] = tile[threadIdx.x][threadIdx.y];
    }
}

extern "C" void solve(const float* input, float* output, int rows, int cols) {
    dim3 threads(TILE, TILE);
    dim3 blocks((cols + TILE - 1) / TILE, (rows + TILE - 1) / TILE);
    transpose_shared<<<blocks, threads>>>(input, output, rows, cols);
    cudaDeviceSynchronize();
}

Triton

import triton
import triton.language as tl

@triton.jit
def transpose_kernel(input_ptr, output_ptr,
                     rows: tl.constexpr, cols: tl.constexpr,
                     BLOCK: tl.constexpr):
    pid_i = tl.program_id(0)
    pid_j = tl.program_id(1)
    i = pid_i * BLOCK + tl.arange(0, BLOCK)
    j = pid_j * BLOCK + tl.arange(0, BLOCK)
    mask_i = i[:, None] < rows
    mask_j = j[None, :] < cols
    x = tl.load(input_ptr + i[:, None] * cols + j[None, :], mask=mask_i & mask_j)
    tl.store(output_ptr + j[:, None] * rows + i[None, :], x, mask=mask_j[:, None] & mask_i[None, :])