KV Cache：大模型推理的性能瓶颈与优化艺术

agicy2026/3/6大约 11 分钟

KV Cache：大模型推理的性能瓶颈与优化艺术

在与 DeepSeek、Qwen 或 Doubao-Seed 等前沿大模型交互时，一个显著的现象是：生成的响应速度往往远低于输入的处理速度。这一现象背后的核心技术瓶颈，在于大模型推理过程中必不可少的 KV Cache 机制。

KV Cache 是一种经典的空间换时间（Space-Time Trade-off）策略，它极大地加速了自回归（Auto-Regressive）生成的解码过程。然而，随着模型规模和上下文长度的增长，它逐渐演变为显存占用和系统吞吐量的主要制约因素。

本文将深入解析 KV Cache 的工作原理、其带来的显存墙（Memory Wall）挑战，以及 MQA/GQA、PagedAttention、PD 分离等前沿优化技术如何试图突破这一瓶颈。

核心概念

KV Cache 的本质是以显存空间换取计算时间。它避免了在每一步生成时重复计算历史 Token 的 Key 和 Value 向量，从而将自回归生成的复杂度从 $\mathcal{O}(t^2)$ 降低到 $\mathcal{O}(t)$ 。

自回归生成的计算冗余

在 Transformer 的解码阶段，生成过程是逐 token 进行的。

假设已生成序列 $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_t$ ，当前目标是预测 $\mathbf{x}_{t+1}$ 。这一预测过程主要由 Self-Attention 层和 Feed-Forward Network (FFN) 层交替堆叠而成。

FFN 层：主要进行特征变换，其计算复杂度为 $\mathcal{O}(d^2)$ （ $d$ 为隐藏层维度），与序列长度无关。
Attention 层：负责捕捉上下文依赖，其计算复杂度随着序列长度 $t$ 的增加而增长。

Attention 层的核心计算公式为：

\text{Attention}(\mathbf{q}_t, \mathbf{K}_{\le t}, \mathbf{V}_{\le t}) = \text{Softmax}\left(\frac{\mathbf{q}_t \mathbf{K}_{\le t}^\top}{\sqrt{d}}\right) \mathbf{V}_{\le t}

其中， $\mathbf{K}_{\le t}$ 和 $\mathbf{V}_{\le t}$ 表示截止当前时刻 $t$ 的所有历史 Key 和 Value 向量的集合：

\mathbf{K}_{\le t} = [\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_t], \quad \mathbf{V}_{\le t} = [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_t]

计算冗余的根源：
让我们对比连续两个生成步骤中，Attention 层所需的 Key 向量集合：

生成 $\mathbf{x}_t$ 时（当前时刻 $t-1$ ）：
需要计算前 $t-1$ 个 token 的 Key 向量：
$\mathbf{K}_{\le t-1} = [{\color{red}\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_{t-1}}]$
生成 $\mathbf{x}_{t+1}$ 时（当前时刻 $t$ ）：
需要计算前 $t$ 个 token 的 Key 向量：
$\mathbf{K}_{\le t} = [{\color{red}\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_{t-1}}, \mathbf{k}_t]$

其中，标红的序列 $\color{red}{\mathbf{k}_1, \dots, \mathbf{k}_{t-1}}$ 在上一步生成 $\mathbf{x}_t$ 时已经计算过一次。

若不采用缓存机制，在生成 $\mathbf{x}_{t+1}$ 时，这些向量需要通过投影矩阵 $\mathbf{W}_K$ 重新计算。这意味着 $\mathbf{k}_1$ 会被重复计算 $t$ 次， $\mathbf{k}_2$ 会被重复计算 $t-1$ 次……这种 $\mathcal{O}(t^2)$ 的冗余计算构成了巨大的算力浪费。

位置编码的不变性

对于大多数位置编码方案，对于固定的历史输入 $\mathbf{x}_i$ 及其位置 $i$ ，其 Key 向量 $\mathbf{k}_i$ 的变换都是确定的，不随当前生成时刻 $t$ 的变化而改变。这保证了 Cache 的有效性。

KV Cache 的工作原理

KV Cache 的核心思想非常直观：缓存历史计算结果，实现增量计算。

标准推理流程被划分为两个阶段：

Prefill 阶段（首字生成）：
并行计算 Prompt 中所有 token 的 $\mathbf{K}$ 和 $\mathbf{V}$ ，并将它们存入显存中的 Cache 结构。此时计算是计算密集型（Compute Bound）的。
Decode 阶段（逐字生成）：
- 仅计算当前新生成 token $\mathbf{x}_t$ 的 $\mathbf{q}_t, \mathbf{k}_t, \mathbf{v}_t$ 。
- 从显存中读取历史缓存 $\mathbf{K}_{1:t-1}, \mathbf{V}_{1:t-1}$ 。
- 将 $\mathbf{k}_t, \mathbf{v}_t$ 拼接到缓存末尾，形成完整的 $\mathbf{K}_{1:t}, \mathbf{V}_{1:t}$ 。
- 执行 Attention 计算，生成 $\mathbf{x}_{t+1}$ 。

通过引入 KV Cache，Decode 阶段每步的计算复杂度从 $\mathcal{O}(t^2)$ 降低为 $\mathcal{O}(t)$ ，使得生成长序列成为可能。

<template>
  <div class="kv-cache-simulator">
    <h3>KV Cache 机制演示</h3>
    
    <div class="simulator-container">
      <div class="controls-wrapper">
        <div class="controls-left">
          <div class="control-section">
            <h4>控制面板</h4>
            
            <div class="control-group">
              <label>Prompt: </label>
              <div class="prompt-display">
                {{ promptText }}
              </div>
            </div>

            <div class="action-buttons">
              <button 
                class="btn-primary" 
                @click="startPrefill" 
                :disabled="simStage !== 'idle'"
              >
                Prefill
              </button>
              <button 
                class="btn-success" 
                @click="stepDecode" 
                :disabled="simStage !== 'decode' || isDecoding"
              >
                Decode
              </button>
              <button 
                class="btn-danger" 
                @click="reset"
              >
                Reset
              </button>
            </div>
          </div>
        </div>

        <div class="controls-right">
          <div class="control-section">
            <h4>显存占用估算 (FP16)</h4>
            <div class="metrics-grid">
              <div class="metric">
                <span class="label">Token 数:</span>
                <span class="value">{{ totalTokens }}</span>
              </div>
              <div class="metric">
                <span class="label">KV Cache 大小:</span>
                <span class="value">{{ cacheSizeMB }} MB</span>
              </div>
            </div>
            <div class="memory-bar">
              <div class="memory-used" :style="{ width: memoryUsagePercent + '%' }"></div>
            </div>
            <p class="note">假设: L=32, D=4096, Batch=1</p>
          </div>
        </div>
      </div>

      <div class="visualization">
        <div class="attention-view">
          <h4>KV Cache 状态 & Attention</h4>
          
          <div class="matrix-container-horizontal">
            <!-- Row Headers -->
            <div class="row-headers">
              <div class="row-header">K-Cache</div>
              <div class="row-header">V-Cache</div>
              <div class="row-header">Q (Query)</div>
              <div class="row-header">Token</div>
            </div>

            <!-- Scrollable Content -->
            <div class="matrix-scroll-area">
              <div class="matrix-content">
                <!-- Columns -->
                <div 
                  v-for="(token, index) in tokens" 
                  :key="'col-' + index"
                  class="matrix-col"
                  :class="{ 'active-col': activeQ === index }"
                >
                  <!-- K Cache -->
                  <div class="vector-cell k-cell cached" :class="{ 
                    'attention-target': activeQ >= 0 && index <= activeQ,
                    'prefill-flash': activeQ === -2 && index < promptTokens.length
                  }">
                    K<sub>{{index}}</sub>
                  </div>

                  <!-- V Cache -->
                  <div class="vector-cell v-cell cached" :class="{ 
                    'attention-target': activeQ >= 0 && index <= activeQ,
                    'prefill-flash': activeQ === -2 && index < promptTokens.length
                  }">
                    V<sub>{{index}}</sub>
                  </div>

                  <!-- Q Vector -->
                  <div class="vector-cell q-cell" :class="{ 'active': activeQ === index, 'faded': activeQ !== index && activeQ !== -1 }">
                    <span v-if="activeQ === index">Q<sub>{{index}}</sub></span>
                    <span v-else class="placeholder">-</span>
                  </div>

                  <!-- Token -->
                  <div class="token-cell" :class="{ 
                    'is-prompt': index < promptTokens.length, 
                    'is-generated': index >= promptTokens.length
                  }">
                    {{ token }}
                    <div class="token-idx">{{ index }}</div>
                  </div>
                </div>
                
                <!-- Placeholder for next token (Always visible to prevent layout shift) -->
                <div class="matrix-col placeholder-col">
                  <div class="vector-cell placeholder">...</div>
                  <div class="vector-cell placeholder">...</div>
                  <div class="vector-cell placeholder">...</div>
                  <div class="token-cell" :class="{ 'is-generating': isDecoding }">...</div>
                </div>
              </div>
            </div>
          </div>
          
          <div class="status-message">
            {{ statusMessage }}
          </div>
        </div>
      </div>
    </div>
  </div>
</template>

<script>
export default {
  data() {
    return {
      promptText: "春江潮水连海平，海上明月共潮生。",
      simStage: 'idle', // idle, prefill, decode
      tokens: [],
      promptTokens: [],
      isDecoding: false,
      activeQ: -1,
      statusMessage: "准备就绪。请输入 Prompt 并点击 Prefill。",
      // Mock parameters for memory calculation
      LAYERS: 32,
      HIDDEN_SIZE: 4096,
      BYTES_PER_PARAM: 2, // FP16
      // Custom tokenization for prompt
      promptTokenList: ["春江", "潮水", "连", "海平", "，", "海上", "明月", "共", "潮生", "。"],
      generatedWords: [
        "滟滟", "随波", "千万", "里", "，", "何处", "春江", "无", "月明", "！",
        "江流", "宛转", "绕", "芳甸", "，", "月照", "花林", "皆", "似霰", "。",
        "空里", "流霜", "不觉", "飞", "，", "汀上", "白沙", "看不", "见", "。",
        "江天", "一色", "无", "纤尘", "，", "皎皎", "空中", "孤", "月轮", "。",
        "江畔", "何人", "初", "见月", "？", "江月", "何年", "初", "照人", "？",
        "人生", "代代", "无穷", "已", "，", "江月", "年年", "望", "相似", "。",
        "不知", "江月", "待", "何人", "，", "但见", "长江", "送", "流水", "。",
        "白云", "一片", "去", "悠悠", "，", "青枫", "浦上", "不胜", "愁", "。",
        "谁家", "今夜", "扁舟", "子", "？", "何处", "相思", "明月", "楼", "？",
        "可怜", "楼上", "月", "徘徊", "，", "应照", "离人", "妆镜", "台", "。",
        "玉户", "帘中", "卷", "不去", "，", "捣衣", "砧上", "拂", "还来", "。",
        "此时", "相望", "不", "相闻", "，", "愿逐", "月华", "流", "照君", "。",
        "鸿雁", "长飞", "光", "不度", "，", "鱼龙", "潜跃", "水", "成文", "。",
        "昨夜", "闲潭", "梦", "落花", "，", "可怜", "春半", "不", "还家", "。",
        "江水", "流春", "去", "欲尽", "，", "江潭", "落月", "复", "西斜", "。",
        "斜月", "沉沉", "藏", "海雾", "，", "碣石", "潇湘", "无", "限路", "。",
        "不知", "乘月", "几人", "归", "，", "落月", "摇情", "满", "江树", "。",
        "End"
      ],
      genIndex: 0
    }
  },
  computed: {
    totalTokens() {
      return this.tokens.length
    },
    cacheSizeMB() {
      // KV Cache = 2 * L * SeqLen * D * 2 bytes
      const bytes = 2 * this.LAYERS * this.totalTokens * this.HIDDEN_SIZE * this.BYTES_PER_PARAM
      return (bytes / (1024 * 1024)).toFixed(2)
    },
    memoryUsagePercent() {
      // Assume max context 1024 for visualization
      return Math.min(100, (this.totalTokens / 64) * 100)
    }
  },
  mounted() {
    console.log("Component mounted (Options API)")
    this.reset()
  },
  methods: {
    async startPrefill() {
      if (!this.promptText || !this.promptText.trim()) {
        console.warn("Prompt is empty")
        return
      }
      
      // Reset visualization but keep prompt text
      this.tokens = []
      this.genIndex = 0
      this.activeQ = -1
      this.isDecoding = false
      
      this.simStage = 'prefill'
      this.statusMessage = "Prefill 阶段：并行计算 Prompt 的 KV..."
      
      // Use predefined prompt tokens instead of splitting by space
      const rawTokens = this.promptTokenList
      this.promptTokens = [...rawTokens]
      
      // Simulate prefill animation (Batch all at once to show parallelism)
      // In prefill phase, we compute K/V for all prompt tokens in parallel
      
      this.tokens = [...rawTokens]
      this.scrollToRight()
      
      // Flash all KV cells to indicate parallel computation
      this.activeQ = -2 // Special state for prefill flash
      
      await this.sleep(800) 
      
      if (this.simStage !== 'prefill') return // Stop if reset occurred
      this.statusMessage = "Prefill 完成。所有 Prompt 的 KV 已并行计算并存入 Cache。"
      this.simStage = 'decode'
      this.activeQ = -1
    },
    async stepDecode() {
      if (this.genIndex >= this.generatedWords.length) {
        this.statusMessage = "生成结束。"
        return
      }
      
      this.isDecoding = true
      this.scrollToRight()
      
      const newToken = this.generatedWords[this.genIndex]
      const currentIndex = this.tokens.length // Index of the new token to be generated
      
      // 1. Show Q for the LAST token (the one generating the new one)
      // Actually, in autoregressive, we use the LAST token's Q to query all previous K/V to predict NEXT token
      const queryIndex = currentIndex - 1
      
      this.statusMessage = `Decode 步：计算 token "${this.tokens[queryIndex]}" 的 Q 向量...`
      this.activeQ = queryIndex
      this.scrollToRight() // Ensure Q is visible
      
      await this.sleep(600)
      if (!this.isDecoding) return // Stop if reset
      
      this.statusMessage = `Attention：Q${queryIndex} 查询历史 KV Cache (K0...K${queryIndex})...`
      await this.sleep(800)
      if (!this.isDecoding) return // Stop if reset
      
      // 2. Generate new token
      this.statusMessage = `生成新 token: "${newToken}"`
      this.tokens.push(newToken)
      this.activeQ = -1 // Q is transient, gone
      this.scrollToRight() // Scroll to show new token
      
      // 3. Cache new KV
      this.statusMessage = `将 "${newToken}" 的 KV 追加到 Cache。`
      await this.sleep(400)
      if (!this.isDecoding) return // Stop if reset
      
      this.genIndex++
      this.isDecoding = false
      this.statusMessage = "等待下一步生成..."
      this.scrollToRight()
    },
    reset() {
      console.log("Resetting state...")
      this.simStage = 'idle'
      this.tokens = []
      this.promptTokens = []
      this.genIndex = 0
      this.activeQ = -1
      this.isDecoding = false
      this.statusMessage = "准备就绪。请输入 Prompt 并点击 Prefill。"
      // Reset prompt if empty or different
      if (!this.promptText) this.promptText = "春江潮水连海平，海上明月共潮生。"
    },
    sleep(ms) {
      return new Promise(resolve => setTimeout(resolve, ms))
    },
    scrollToRight() {
      this.$nextTick(() => {
        const container = this.$el.querySelector('.matrix-scroll-area')
        if (container) {
          // Use scrollTo with smooth behavior for better UX, or auto for instant jump
          // But sometimes smooth scrolling can be interrupted by updates
          container.scrollTo({
            left: container.scrollWidth,
            behavior: 'auto' // Use auto to prevent scroll position lag/bounce when content size changes
          })
        }
      })
    }
  }
}
</script>

<style scoped>
.kv-cache-simulator {
  border: 1px solid #e0e0e0;
  border-radius: 8px;
  background: #f9f9f9;
  padding: 20px;
  margin: 20px 0;
  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
}

h3 {
  margin-top: 0;
  margin-bottom: 20px;
  text-align: center;
  color: #333;
}

h4 {
  margin: 0 0 10px 0;
  font-size: 14px;
  color: #666;
  text-transform: uppercase;
  letter-spacing: 0.5px;
}

.simulator-container {
  display: flex;
  flex-direction: column;
  gap: 20px;
}

.controls-wrapper {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 20px;
}

.controls-left, .controls-right {
  display: flex;
  flex-direction: column;
}

.control-section {
  background: white;
  padding: 15px;
  border-radius: 6px;
  box-shadow: 0 2px 4px rgba(0,0,0,0.05);
  height: 100%;
}

.prompt-display {
  width: 100%;
  padding: 8px;
  background: #f5f5f5;
  border: 1px solid #ddd;
  border-radius: 4px;
  margin-top: 5px;
  box-sizing: border-box;
  color: #555;
  font-family: monospace;
}

.action-buttons {
  display: flex;
  gap: 10px;
  margin-top: 15px;
}

.action-buttons button {
  flex: 1;
}

button {
  padding: 8px 16px;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  font-weight: 600;
  transition: all 0.2s;
}

button:disabled {
  opacity: 0.5;
  cursor: not-allowed;
}

.btn-primary { background: #1890ff; color: white; }
.btn-primary:hover:not(:disabled) { background: #40a9ff; }

.btn-success { background: #52c41a; color: white; }
.btn-success:hover:not(:disabled) { background: #73d13d; }

.btn-danger { background: #ff4d4f; color: white; }
.btn-danger:hover:not(:disabled) { background: #ff7875; }

/* Metrics Styles */
.metrics-grid {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 10px;
  margin-bottom: 10px;
}

.metric {
  display: flex;
  flex-direction: column;
}

.metric .label { font-size: 12px; color: #999; }
.metric .value { font-size: 16px; font-weight: bold; color: #333; }

.memory-bar {
  height: 8px;
  background: #f0f0f0;
  border-radius: 4px;
  overflow: hidden;
  margin-bottom: 5px;
}

.memory-used {
  height: 100%;
  background: linear-gradient(90deg, #1890ff, #722ed1);
  transition: width 0.3s ease;
}

.note {
  font-size: 11px;
  color: #bbb;
  margin: 0;
}

/* Visualization Styles */
.visualization {
  background: white;
  padding: 20px;
  border-radius: 6px;
  box-shadow: 0 2px 4px rgba(0,0,0,0.05);
  display: flex;
  flex-direction: column;
  gap: 20px;
  overflow: hidden;
}

.matrix-container-horizontal {
  display: flex;
  border: 1px solid #eee;
  border-radius: 6px;
  overflow: hidden;
  position: relative;
}

.row-headers {
  display: flex;
  flex-direction: column;
  background: #f9f9f9;
  border-right: 1px solid #eee;
  z-index: 2;
  box-shadow: 2px 0 5px rgba(0,0,0,0.05);
  flex-shrink: 0;
  width: 80px;
}

.row-header {
  height: 40px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 11px;
  font-weight: bold;
  color: #666;
  border-bottom: 1px solid #eee;
}
.row-header:last-child {
  border-bottom: none;
}

.matrix-scroll-area {
  overflow-x: auto;
  flex: 1;
  padding-bottom: 5px; /* Space for scrollbar */
}

.matrix-content {
  display: flex;
  min-width: min-content;
}

.matrix-col {
  display: flex;
  flex-direction: column;
  width: 50px;
  border-right: 1px solid #f0f0f0;
  flex-shrink: 0;
  transition: background 0.3s;
}

.matrix-col.active-col {
  background: #fffbe6;
}

.vector-cell {
  height: 40px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 11px;
  font-weight: bold;
  border-bottom: 1px solid #eee;
  position: relative;
}

.token-cell {
  height: 40px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 12px;
  font-weight: bold;
  position: relative;
  background: #fafafa;
}

.token-cell.is-prompt { color: #0050b3; background: #e6f7ff; }
.token-cell.is-generated { color: #389e0d; background: #f6ffed; }
.token-cell.is-generating { color: #999; border-style: dashed; }

.token-idx {
  position: absolute;
  bottom: 2px;
  right: 2px;
  font-size: 8px;
  color: #999;
}

.q-cell.active { 
  background: #fff7e6; 
  color: #d46b08; 
  box-shadow: inset 0 0 0 2px #fa8c16;
}
.q-cell.faded { opacity: 0.3; }

.k-cell.cached { color: #531dab; background: rgba(211, 173, 247, 0.1); transition: all 0.3s; }
.v-cell.cached { color: #006d75; background: rgba(135, 232, 222, 0.1); transition: all 0.3s; }

.attention-target {
  box-shadow: inset 0 0 0 1px currentColor;
  font-weight: 800;
  transform: scale(0.95);
}
.k-cell.cached.attention-target { background: rgba(211, 173, 247, 0.3); }
.v-cell.cached.attention-target { background: rgba(135, 232, 222, 0.3); }

/* Prefill flash animation */
@keyframes prefill-pulse {
  0% { transform: scale(1); opacity: 0.5; }
  50% { transform: scale(1.05); opacity: 1; box-shadow: 0 0 8px rgba(24, 144, 255, 0.5); }
  100% { transform: scale(1); opacity: 1; }
}

.prefill-flash {
  animation: prefill-pulse 0.6s ease-in-out;
  border-color: #1890ff !important;
  color: #1890ff !important;
  background: #e6f7ff !important;
  font-weight: bold;
}

.placeholder { color: #eee; }

@media (max-width: 768px) {
  .controls-wrapper {
    grid-template-columns: 1fr;
  }
}
</style>

思考：为什么没有 Q-Cache？

既然我们缓存了 $\mathbf{K}$ 和 $\mathbf{V}$ ，一个自然的疑问是：为什么不需要缓存 $\mathbf{Q}$ ？

这涉及到 Attention 机制中 $\mathbf{Q}$ 和 $\mathbf{K}$ / $\mathbf{V}$ 的不同角色：

$\mathbf{K}$ 和 $\mathbf{V}$ 是被查询的对象：它们代表了历史信息。对于已经生成的 token（例如 $\mathbf{x}_1, \dots, \mathbf{x}_{t-1}$ ），它们的语义和位置在后续的生成步骤中是固定不变的。因此，它们的投影向量 $\mathbf{k}$ 和 $\mathbf{v}$ 计算一次后就可以永久存储，供后续所有步骤复用。
$\mathbf{Q}$ 是查询者：它代表了当前时刻 $\mathbf{x}_t$ $x_{t}$ 的注意力焦点。
- 在生成 $\mathbf{x}_{t+1}$ 时，我们用 $\mathbf{q}_t$ 去查询 $\mathbf{K}_{\le t}$ 。
- 在生成 $\mathbf{x}_{t+2}$ 时，我们用 $\mathbf{q}_{t+1}$ 去查询 $\mathbf{K}_{\le t+1}$ 。

关键点在于： $\mathbf{q}_t$ 只在生成 $\mathbf{x}_{t+1}$ 的那一刻被用到一次。一旦 $\mathbf{x}_{t+1}$ 生成完毕， $\mathbf{q}_t$ 的使命就完成了，它不会参与后续 $\mathbf{x}_{t+2}, \mathbf{x}_{t+3}$ 的计算。后续步骤需要的是新的 $\mathbf{q}_{t+1}, \mathbf{q}_{t+2}$ 。

因此， $\mathbf{q}_t$ 向量具有瞬时性，用完即弃，不需要像 $\mathbf{k}_t, \mathbf{v}_t$ 那样进行持久化缓存。

显存杀手：Memory Wall 挑战

虽然 KV Cache 解决了计算量（FLOPs）问题，但它却引发了严重的显存容量（Capacity）和显存带宽（Bandwidth）瓶颈。

计算强度分析

在 Decode 阶段，核心操作是矩阵-向量乘法（GEMV）。对于一个 Batch 的请求，GPU 需要从显存中搬运庞大的 KV Cache 矩阵，却仅与极小的 $\mathbf{q}_t$ 向量进行运算。这导致计算强度（Arithmetic Intensity）极低，即每字节数据传输所对应的浮点运算次数很少。因此，大模型推理通常受限于显存带宽（Memory Bandwidth Bound），而非计算核心速度。

显存占用估算：
以 Qwen-72B 模型为例（n_layers=80, d_model=8192），在使用 FP16精度（2 bytes）和 2048 上下文长度时：
单次请求的 KV Cache 大小为：

2 \times \text{layers} \times \text{seq\_len} \times \text{d\_model} \times 2 \text{ bytes} = 5 \text{ GiB}

若 Batch Size 增加到 32，仅 KV Cache 就需占用 160 GiB 显存，这已远超单张 A100 (80GiB) 的物理上限。显存容量直接限制了系统能够支持的最大并发数（Batch Size），进而限制了吞吐量。

精度说明

上述计算基于 FP16/BF16 (2 bytes)。如果采用 INT8 量化，显存占用减半；采用 INT4 量化，显存占用降至 1/4。这是目前缓解显存压力的重要手段之一。

优化之道：算法、系统与架构的协同突围

为了缓解 KV Cache 带来的压力，学术界和工业界提出了一系列优化方案。

算法层面的优化：MQA 与 GQA

这两种方法通过改变模型结构，直接减少需要缓存的参数量。

MHA (Multi-Head Attention): 标准 Transformer 结构，每个 Head 拥有独立的 $\mathbf{K}, \mathbf{V}$ 投影。Cache 占用最大。
MQA (Multi-Query Attention): 所有 Head 共享同一组 $\mathbf{K}, \mathbf{V}$ $K, V$ 投影，仅 $\mathbf{Q}$ $Q$ 保持多头。Cache 大小骤降为原来的 $\frac{1}{h}$ $\frac{1}{h}$ （ $h$ $h$ 为 Head 数）。
- 优势：显存占用极低，推理速度显著提升。
- 代价：模型表达能力受损，可能导致生成质量下降。
GQA (Grouped-Query Attention): LLaMA-2/3 采用的折中方案。将 Head 分组，组内共享 $\mathbf{K}, \mathbf{V}$ $K, V$ 。
- 平衡点：在保持接近 MHA 效果的同时，获得接近 MQA 的速度和显存优势。

为什么 LLaMA-3 选择 GQA？

实验表明，GQA 在大规模模型上能够在几乎不损失精度的情况下，将显存带宽需求降低数倍，是目前性价比最高的 Attention 变体。

系统层面的优化：PagedAttention

伯克利团队（vLLM）提出的 PagedAttention 解决了显存碎片化问题。

灵感来源：操作系统虚拟内存
传统的 KV Cache 内存管理通常预分配连续的显存块（基于最大序列长度），这导致了严重的内部碎片（Internal Fragmentation）。
PagedAttention 借鉴了操作系统中分页（Paging）的思想：

将 KV Cache 切分为固定大小的块（Block）。
物理显存块可以是不连续的。
维护一个页表（Block Table）来映射逻辑 token 序列与物理显存块。

核心优势：

零浪费：按需动态分配显存，消除了预分配带来的浪费。
高效共享：对于并发请求中的公共前缀（如 System Prompt），可以通过映射到相同的物理块实现写时复制（Copy-on-Write），极大降低了显存开销。
吞吐量提升：更高的显存利用率允许系统处理更大的 Batch Size，从而显著提升整体吞吐量。

架构层面的优化：PD 分离 (Prefill-Decode Disaggregation)

除了算法和内存管理的优化，PD 分离（Prefill-Decode Separation） 是一种架构层面的优化策略，旨在解决推理过程中 Prefill 和 Decode 阶段计算特性不一致的问题。

核心矛盾：计算特性的不对称

在标准推理流程中，两个阶段对硬件资源的需求截然不同：

Prefill 阶段（Throughput-Bound）：一次性处理 Prompt 中的大量 Token，是典型的计算密集型（Compute Bound）任务。它渴望强大的矩阵运算能力（TFLOPS），以便快速生成首个 Token（TTFT, Time To First Token）。
Decode 阶段（Latency-Bound）：逐个生成 Token，每次计算的数据搬运量大而计算量小，是典型的访存密集型（Memory Bound）任务。它受限于显存带宽（Bandwidth），决定了生成的流畅度（TPOT, Time Per Output Token）。

在传统的同构部署（Homogeneous Serving）中，Prefill 和 Decode 请求混合在同一个 GPU 上处理。这导致了严重的资源争抢：

Head-of-Line Blocking：当一个长 Prompt（例如 100k tokens）的 Prefill 请求到来时，它会瞬间占满计算单元。此时，其他正在进行的 Decode 请求被迫挂起，导致正在生成回复的用户感受到明显的卡顿。
资源错配：为了满足 Decode 的显存需求，我们可能会选择大显存 GPU，但这对于 Prefill 阶段的算力需求来说可能并不划算，反之亦然。

解决方案：异构流水线

PD 分离将推理集群划分为两类专用的实例节点：

Prefill 实例：专注于处理 Prompt 输入，快速计算并生成初始 KV Cache。这些节点可以配备高算力 GPU（如 H100）。
Decode 实例：专注于接收 KV Cache 并执行后续的逐字生成。这些节点可以配备大显存 GPU（如 A100-80G 或甚至更廉价的推理卡）。

工作流程：

分发：全局调度器接收请求，将 Prompt 发送给 Prefill 实例。
计算与传输：Prefill 实例计算完 KV Cache 后，不进行后续生成，而是通过高速互联网络（如 NVLink、RDMA 或 PCIe）将 Cache 迁移（Migrate） 到 Decode 实例。
生成：Decode 实例接收 Cache，接力完成后续的 Token 生成，并将结果流式返回给用户。

挑战与优势

优势：

消除干扰：Decode 请求不再受 Prefill 请求的阻塞影响，延迟（Inter-token Latency）更加稳定，P99 延迟显著降低。
独立扩缩容：可以根据实际流量特征（Prompt 长度 vs 生成长度），独立调整 Prefill 和 Decode 节点的数量。
极致吞吐：实现了流水线并行，让 GPU 始终运行在最擅长的负载下，显著提高了整个集群的吞吐量。

挑战：

传输开销：KV Cache 的体积巨大（如前文计算，动辄数 GB）。如果在 Prefill 和 Decode 节点间传输 Cache 的时间超过了 Decode 单步生成的时间，就会得不偿失。因此，PD 分离通常依赖于高速网络基础设施（如 InfiniBand 或 NVLink Switch）。
调度复杂性：需要复杂的全局调度算法来平衡负载，管理 KV Cache 的生命周期和传输。

网络带宽是新的瓶颈

在 PD 分离架构中，GPU 间的互联带宽（Inter-GPU Bandwidth）成为了新的性能瓶颈。如果网络不够快，Cache 传输反而会拖慢整体速度。

这一架构目前已成为 DeepSeek、Moonshot 等长文本大模型服务背后的主流技术方案。

总结

KV Cache 是大模型推理中双刃剑般的存在：它是实现实时生成的加速器，同时也是吞噬显存资源的巨兽。
从 MQA/GQA 的模型结构精简，到 PagedAttention 的显存管理革新，再到 PD 分离的架构演进，这一领域的变革折射出 AI 系统设计的一个重要趋势：算法、系统与架构的深度协同（Co-design）。