Simplicity and Speed: How CentML Sets Record Performance on DeepSeek-R1

Discover how CentML optimized DeepSeek-R1 for record-breaking inference speed using Hidet

blog post image

Introduction

We are thrilled to announce that DeepSeek-R1 AWQ is now available on the CentML Platform. While the base DeepSeek-R1 model provides superb reasoning, coding, and multilingual capabilities, it requires significant computational resources. Even though it was trained using FP8 instead of the standard BF16 to optimize memory usage, the original DeepSeek-R1 is still a large 685B parameter model that is best deployed on multiple NVIDIA H200 GPUs.

The AWQ (Activation-aware Weight Quantization) version of DeepSeek-R1 maintains the base model’s capabilities while reducing its memory requirements by half. Quantization decreases the overall size of a model by modifying the precision of its weights for more efficient performance. (For more information about the impact of quantization, refer to this footnote). As noted, the DeepSeek-R1 base model uses FP8 for inference while the AWQ version uses INT4, further reducing its memory requirements and making it more suitable for production environments. Thanks to this optimized memory usage, we have been able to deploy DeepSeek-R1 AWQ on smaller, more cost-effective NVIDIA H100 GPUs. 

Challenges in scaling

While the benefits of model quantization are clear, it has been challenging to apply at scale. Manual optimization typically achieves significantly better results than compiler optimization, but it requires a deep understanding of hardware architecture and low-level programming and is extremely time-consuming. To efficiently develop high performance quantized kernels for DeepSeek-R1 AWQ, we leveraged Hidet, our open-source deep learning compiler. Hidet uses advanced graph-level and operator-level optimizations to speed up model inference without impacting accuracy.

 

Unlocking the power of Mixture of Expert(MoE)

A key factor in the performance of DeepSeek-R1 is its MoE layer. MoE uses multiple sub-models (experts) to improve overall model quality performance, since it only activates specific parts of the model rather than the entire neural network. DeepSeek-R1 AWQ maintains these distributed processing capabilities even after quantization. There are many ways to create a specialized MoE layer in LLM serving. The open-source vLLM inference engine uses the Mixed Auto-Regressive Linear (MARLIN) kernel library to implement a MoE layer, which results in multiple invocations of per-expert CUDA kernels. A significantly more efficient fused MoE is implemented in the Triton language. While Triton allows programmers to quickly develop well-performing GPU code, its high level of abstraction and lack of explicit access to shared memory makes it difficult to maximize performance for quantized kernels. By implementing MoE in our Hidet compiler, we achieved a 1.9x to 11.3x speedup of the MoE layer implementation in Triton. 

What makes Hidet so efficient?

  • An automatic layout inference algorithm that eliminates the extremely challenging process of designing tensor layouts within a GPU memory hierarchy.  
  • Native and Pythonic support for hardware intrinsics and CUDA primitives such as cp.async, mma and wgmma instructions.
  • An extensive auto-tuning algorithm that automatically maximizes performance optimizations

To unleash the full potential of this model, we also used speculative decoding that was specifically tailored for this quantized model. As shown in the chart below, by leveraging Hidet’s MoE layer implementation, we achieved a 1.4x to 2.1x performance improvement in DeepSeek-R1 AWQ latency and throughput compared to the initial version of the model that used a MoE layer implemented in Triton.

 

Our optimized AWQ model (DeepSeek-R1-Slim-INT4) is now available on the CentML Platform running on NVIDIA H100 GPUs. Give it a try!

Conclusion

Combining performance optimizations with engineering simplicity is at the core of CentML’s mission. We continuously work to provide our customers and the broader open-source community with the most efficient, affordable AI infrastructure. Try out DeepSeek-R1-Slim-INT4, along with accelerated models from Llama, Microsoft, and Qwen, on our serverless endpoints today!

 

Footnote: Quantization Significance and Benefits

2024 was the year of open-source LLMs. Just in the past 9 months, we have witnessed the release of Mixtral, Llama3.2, Llama3.3, Qwen2.5, and DeepSeek-v3. Together, the AI research community has made significant progress in expanding the frontier of LLM capabilities and continues to challenge the reign of proprietary models. 

The biggest advantage of open-source LLMs are their customizability, including model size, hardware selection and fine-tuning. Start-ups and enterprises are realizing the value of these OSS models, due to their need for innovation, customization and control. 

The proliferation of open-source releases has also resulted in the increased adoption of quantized models that represent the original 16-bit floating-point number weights, activations, and kv-cache in a lower precision number format. Quantization techniques have proven to be extremely effective in addressing memory resource challenges. Over the last year, we have seen a diverse set of techniques, ranging from post-training quantization (AWQ, GPTQ) to quantization-aware training (Llama Guard 3). Achieving state-of-the-art inference performance is still very challenging, as CUDA kernel development is oftentimes difficult and error-prone. CentML’s compiler and inference engines automate optimization work, so customers can access the latest quantization techniques and models on the CentML Platform with a single click.


Authors: Xiao Zhang, Xin Li, Max (Yang) Hu, Emily Hutson, Tatiana Shpeisman