How CentML Achieved 2x Inference Speed on DeepSeek-R1 using Speculative Decoding
Learn about the challenges of implementing this technique, how CentML leveraged DeepSeek’s MTP module, and the open-source contributions that make this breakthrough possible.

Table of Contents
Background
Since the release of the DeepSeek-R1 language model by Chinese AI startup DeepSeek, the open-source community has been racing to reproduce the inference speed of DeepSeek’s native hosting. Improvements to low-level GPU kernels and hardware-specific tuning have improved single-request latency from the initial reproduction of 15 tokens/second to a heavily optimized 35 tokens/second (as reproduced by both vLLM and SGLang). With the low-hanging optimization fruit now picked, higher-level advancements are needed to further accelerate inference. One powerful tool is speculative decoding: a strategy which effectively allows models to produce multiple tokens at a time, while ensuring that exactly the same output is generated.
In order to accelerate inference, a smaller model is used to produce multiple candidate tokens in sequence, at a rate faster than the large model can produce them. Then the proposed tokens are evaluated in parallel, which is much faster than generating multiple tokens in sequence thanks to GPU acceleration. The result of this verification is at least one token, as well as any other proposed tokens that would have been generated by the large model anyways.
Challenges
Unfortunately, the draft-model strategy does not work with DeepSeek-R1 out of the box; since the brand-new model uses a new tokenizer, there are no draft models which are capable of generating tokens in the same vocabulary as DeepSeek-R1. To unlock speculative decoding, we repurposed an auxiliary component of the model that DeepSeek used to train R1: the multi-token prediction (MTP) modules.
During training, R1 utilized a pair of modules that predict the next tokens to be generated in the sequence, to improve stability over the long training run. The report mentions that these modules might be compatible with speculative decoding for inference but reports no details and gives no reference implementation. Further, DeepSeek silently omitted half the weights for the modules in their release, only publishing one of the two modules used during the training.
Figure from DeepSeek’s technical report on the MTP modules, of which only Module 1 is released.
Implementation
Speculative decoding can still be achieved using a single MTP module as released by DeepSeek. AtCentML, we created a reference implementation of the MTP code based on the description in the DeepSeek report and utilized EAGLE-style speculative decoding to use the single module as a drafter for speculative decoding in vLLM. To extract further efficiency from the single module, we borrowed a technique from the EAGLE paper on speculative decoding which uses a similar style of module as the DeepSeek MTP module and feeds the output of the module back into itself to auto-regressively generate additional tokens. This leads to a small falloff in quality for long draft sequences, but the boost in generated tokens makes it worthwhile.
Results
The results speak for themselves: using a single MTP module naively gives a 1.6x speedup, and enabling recursive generation from that module improves the margin to 2x over the baseline, generating up to 70 tokens/second! CentML’s implementation is available to the open-source community through our vLLM contributions, and we are continuing our contributions to the MTP implementation live in vLLM. To quickly see our results in action, give our serverless endpoint of DeepSeek-R1 a try.
