Seesaw: High-Throughput Inference via Model Re-Sharding

A team of researchers from CentML and the University of Toronto analyzed LLM parallelization methods and developed Seesaw, an LLM inference engine optimized for throughput-oriented tasks.

Table of Contents

Background

Large language model (LLM) inference involves two distinct stages: prefilling (processing the input) and decoding (generating new text). During prefill, the entire input prompt is processed at once, while decoding generates new tokens one at a time. These stages have fundamentally different computational characteristics:

Prefill stage: Processes multiple tokens simultaneously, with computation and communication dominating runtime
Decode stage: Processes one token at a time, with greater relative time spent on weight transfer. Batch processing is important for a high decoding throughput, since it makes better use of available computational resources

Because of their differing computational approaches, using just one parallelization strategy for both stages isn’t optimal for optimization. For example, prefill favors pipeline parallelism while decode favors tensor parallelism. However, current LLM inference systems use a single static parallelization strategy for both stages.

To reduce these inefficiencies, a team of researchers from CentML and the University of Toronto first analyzed how different parallelization methods perform in the prefill and decode stages of LLM processing, considering all computational costs. Their analysis led them to the development of Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The paper on their work has since been accepted to the 2025 MLSys proceedings. Read on for a summary of this novel approach to parallelization during inference.

The Seesaw Approach

Seesaw introduces dynamic model re-sharding, which reconfigures parallelization strategies between prefill and decode stages. This approach tailors parallelization to each stage’s unique computational demands, reduces communication overhead during prefill, and enhances memory efficiency during decode.

Technical Innovations

However, simply switching parallelization strategies creates a new challenge — the overhead from frequent transitions between stages. To balance minimal re-sharding overhead when conducting large-scale batch processing in the decode stage, Seesaw introduces two complementary techniques. First, tiered KV cache buffering uses CPU memory as additional storage for more prefill requests. By managing the KV cache efficiently across different memory tiers, the system can refill the GPU memory with prefill results in the CPU, maintaining large decoding batch sizes. Additionally, Seesaw uses asynchronous pipelining to reduce the KV cache transfer overhead between CPU and GPU by overlapping data transfers with computation.

Transition-minimizing scheduling also reduces overhead by switching to decode mode only when the CPU cache fills completely. This approach maintains large batch sizes during decoding while minimizing stage transitions, significantly improving throughput.

Results and Impact

The Seesaw approach to dynamic model re-sharding enables automatic switching between parallelization techniques during different stages of LLM inference, and uses tiered memory buffering and smart scheduling techniques to minimize switching costs. This research highlights the importance of stage-specific optimization for LLM inference systems, especially for throughput-oriented tasks where maximizing tokens processed per second is critical.

The results are impressive. Seesaw demonstrates a 1.36x average throughput improvement vs current SOTA inference engines. Read the full paper to learn more about how the team accomplished their work.

Author: Qidong Su