Democratizing the Opportunity to Build the Next Generation of AI Applications

Chances are that if you’re a developer looking to build or deploy an AI model these days, you’re running into a bottleneck — getting and paying for access to GPUs, especially top of the line ones like H100s and TPUs. There is a pervasive ambition to build experiences leveraging the bleeding edge of language and vision models. Over the past decade, these models have continued to require more resources than previous generations. What all of this means practically is that demand for GPUs has outrun supply.

Against this backdrop, CentML is building the next generation of compilers and software acceleration for AI workloads, which allow your models to make the most of the GPUs they run on. The team consists of compiler, computer architecture and distributed systems researchers from the University of Toronto and Vector Institute that came together last year in anticipation of the accelerated shift toward AI adoption that we now observe.

Having led several influential research efforts on everything from memory compression to neural network benchmarking, as well as building software in industry used by chip manufacturers and cloud providers, the team began productizing their expertise. Now, CentML is releasing compilers and fully managed solutions for inference that can drastically improve what developers can expect from their GPUs. For example, the team has sped up LLaMa2 inference by 1.69x for the A100G series GPUs and by 1.24x for the A10G series GPUs, without any loss in model accuracy. Importantly, CentML’s optimizations are orthogonal to (and can be combined with) several other approaches to speed up AI inference like FlashAttention-2QLoRA, and vLLM. What all of this means for startups and enterprises looking to offer AI products is that they can rent or buy fewer, cheaper, or more available hardware to deliver the same bleeding-edge AI experience.

In order to accomplish this, the team is rebuilding several components of the software stack we all use to leverage the capabilities of GPUs. Core to this approach, CentML provides deep learning compilers pushing the frontier beyond that of its open source release of Hidet. At a high level, deep learning compilers take in an architecture (e.g., a Transformer or Diffusion model) that AI developers wish to utilize and map the computational tasks those architectures are composed of to a set of kernels (or functions) that run on GPUs. The quality of that mapping depends on several factors, for example, how efficiently that set of kernels accesses memory or how well computation overlaps with communication. Shaving off even tiny amounts of time per operation can add up to big differences for users when you have large models looking to perform billions of calculations for every input prompt.

As startups and enterprises alike look to reimagine the next generation of software, we see some teams unable to participate fully, constrained by expertise, budget or access. After we learned about and used CentML’s technology, we saw a tremendous opportunity to democratize access to training and deploying the best and most performant models. Sign up here to begin accelerating your AI workloads with CentML.

Share this

Get started

Need fast, cost effective models in production? Book a Demo