Sglang PD分离中的Mooncake
¶架构概述
Sglang采用的Prefill-Decode(PD)分离架构是现代大语言模型推理的重要优化方案。该架构将传统的单体推理过程拆分为两个独立的阶段:
- Prefill阶段:负责处理输入提示词,生成初始的Key-Value(KV)Cache
- Decode阶段:基于KV Cache进行自回归的token生成
Mooncake作为核心协调组件,实现了这两个物理分离实例的高效连接和协同工作。
¶工作流程详解
¶初始化与资源准备
当请求到达时,Decode节点首先通过prealloc
组件预先为KV Cache分配内存空间。这种预先分配的策略确保了后续数据传输的高效性,避免了运行时内存分配的开销。
¶Bootstrap协调机制
Decode节点的receiver
通过Mooncake向Prefill节点发送bootstrap信号,触发Prefill处理流程。这种设计使得Decode节点能够主动协调Prefill工作,实现精准的流水线控制。
¶架构设计解析
¶Prefill节点组件
- Bootstrap Server (Mooncake):核心协调组件,管理节点注册和连接
- KV Manager:负责KV Cache的生命周期管理
- KV Sender:优化数据传输,支持分块和流式传输
- Scheduler:任务调度和资源分配
¶Decode节点组件
- KV Receiver:高效接收和处理传入的KV Cache
- KV Manager:管理接收到的缓存数据
- Scheduler:解码调度和token生成控制
¶技术优势与价值
¶性能提升
- 降低TTFT:通过并行处理和流水线优化显著减少首token延迟
- 提高吞吐量:Prefill和Decode分离允许独立扩展和优化
- 资源利用率:专业化组件设计最大化硬件利用效率
¶架构灵活性
- 独立扩展:Prefill和Decode可根据负载独立扩容
- 故障隔离:单点故障不影响整个系统运行
- 混合部署:支持不同硬件配置的节点混合部署
¶运维优势
- 服务发现:自动化节点管理和连接建立
- 负载均衡:智能请求分配和资源调度
- 监控诊断:完善的监控指标和诊断能力
这种基于Mooncake的PD分离架构为大语言模型服务提供了可扩展、高性能、高可用的基础设施解决方案,代表了现代AI推理架构的重要发展方向。
sequenceDiagram autonumber box Prefill participant forward_A participant forward_B participant CPULoop participant waiting_queue_p as waiting_queue participant sender end box Decode participant reciever participant transfer_queue participant prealloc participant waiting_queue_d as waiting_queue participant scheduler end Note over forward_A,scheduler: requests arrived,TTFT start activate scheduler activate scheduler activate scheduler loop prealloc->>prealloc: alloc for KV cache end prealloc->>transfer_queue: pop_prealloc activate transfer_queue activate transfer_queue activate transfer_queue Note over transfer_queue,reciever: init reciever activate reciever activate reciever activate reciever reciever->>waiting_queue_p: bootstrap prealloc waiting_queue_p->>CPULoop: new-seqs CPULoop->>+forward_A: batch 0 start waiting_queue_p->>CPULoop: new-seqs CPULoop->>+forward_B: batch 1 start forward_A->>-CPULoop: batch 0 sync CPULoop->>+sender: trans batch 0 loop Every chunk sender->>reciever: send chunk end sender->>reciever: last chunk send aux sender->>-CPULoop: trans batch 0 finish reciever->>-transfer_queue: start decode transfer_queue->>-waiting_queue_d: pop_transfer Note right of scheduler: TTFT for req 0 waiting_queue_d->>scheduler: first token deactivate scheduler waiting_queue_p->>CPULoop: new-seqs CPULoop->>+forward_A: batch 2 start forward_B->>-CPULoop: batch 1 sync CPULoop->>+sender: trans batch 1 loop Every chunk sender->>reciever: send chunk end sender->>reciever: last chunk send aux sender->>-CPULoop: trans batch 1 finish reciever->>-transfer_queue: start decode transfer_queue->>-waiting_queue_d: pop_transfer Note right of scheduler: TTFT for req 1 waiting_queue_d->>scheduler: first token deactivate scheduler forward_A->>-CPULoop: batch 2 sync CPULoop->>+sender: trans batch 2 loop Every chunk sender->>reciever: send chunk end sender->>reciever: last chunk send aux sender->>-CPULoop: trans batch 0 finish reciever->>-transfer_queue: start decode transfer_queue->>-waiting_queue_d: pop_transfer Note right of scheduler: TTFT for req 2 waiting_queue_d->>scheduler: first token deactivate scheduler
图1:PD分离架构交互序列图 - 展示了Mooncake协调下Prefill和Decode节点的完整工作流程,包括资源预分配、bootstrap协调、并行处理、分块传输和解码启动等关键阶段
graph subgraph Prefill_Node [Prefill Node] BS[Bootstrap Server] subgraph GPU_P1 [GPU 1] subgraph SP1 [Scheduler] KVM_P1[KV Manager] KVS_P1[KV Sender] end end subgraph GPU_P2 [GPU 2] subgraph SP2 [Scheduler] KVM_P2[KV Manager] KVS_P2[KV Sender] end end end subgraph Decode_Node [Decode Node] subgraph GPU_D1 [GPU 1] subgraph SD1 [Scheduler] KVM_D1[KV Manager] KVR_D1[KV Receiver] end end subgraph GPU_D2 [GPU 2] subgraph SD2 [Scheduler] KVM_D2[KV Manager] KVR_D2[KV Receiver] end end end KVM_P1 -->|register| BS KVM_P2 -->|register| BS KVR_D1 --->|bootstrap, 发送receiver信息| BS KVR_D2 --->|bootstrap, 发送receiver信息| BS KVS_P1 -->|发送KV| KVR_D1 KVS_P2 -->|发送KV| KVR_D2
图2:Mooncake连接架构图 - 展示了星型拓扑结构中Mooncake作为中心枢纽,协调Prefill和Decode节点之间的服务注册、bootstrap连接和直接数据传输
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 JMY Space!