Large language models (LLMs) have taken the machine learning community by storm, thanks to their transformative architecture that excels at learning from vast amounts of unstructured data, be it text, images, videos, or audio. Their capabilities extend to a wide range of tasks, both extractive such as text classification and generative like text summarization and text-to-image generation.
As their name suggests, LLMs are large, often exceeding the 10-billion parameter mark. Some, like the BLOOM model, boast more than 100 billion parameters. The hefty computational power required for these LLMs, typically sourced from high-end GPUs, can lead to high costs, posing a significant barrier for many organizations seeking to harness state-of-the-art LLMs in their applications.
In this blog post, we delve into the optimization techniques that can effectively reduce the size and inference latency of LLMs, enabling them to run efficiently on Intel CPUs.
The Concept of Quantization
LLMs usually train with 16-bit floating-point parameters (FP16/BF16), which means storing the value of a single weight or activation value requires 2 bytes of memory. Moreover, floating-point arithmetic is more complex and slower than integer arithmetic and demands additional computing power.
Quantization, a model compression technique, aims to tackle both these issues by limiting the range of unique values that model parameters can take. This technique can reduce models to lower precision like 8-bit integers (INT8), thereby shrinking them and replacing complex floating-point operations with simpler and faster integer operations.
Effectively, quantization rescales model parameters to smaller value ranges. When done right, it can shrink your model by at least 2x, with no impact on model accuracy.
Quantization can be applied during training (quantization-aware training - QAT), which typically yields the best results. Alternatively, if you wish to quantize an existing model, you can apply post-training quantization (PTQ), a significantly faster technique requiring minimal computing power.
Several quantization tools are available, including built-in support in PyTorch, and the Hugging Face Optimum Intel library, which offers developer-friendly APIs for QAT and PTQ.
The Challenge with Quantizing LLMs
Recent studies indicate that current quantization techniques don't mesh well with LLMs. In particular, LLMs exhibit large-magnitude outliers in specific activation channels across all layers and tokens. This phenomenon makes the layers of the Transformer less "quantization-friendly," causing truncated outliers or underflowing low-magnitude activations, both of which significantly impact model quality. Moreover, quantization-aware training, which requires additional model training, is often impractical due to the lack of compute resources and data.
Enter SmoothQuant, a novel quantization technique that effectively addresses this issue. It applies a joint mathematical transformation to weights and activations, reducing the ratio between outlier and non-outlier values for activations, albeit at the cost of increasing the ratio for weights. This transformation makes the layers of the Transformer more "quantization-friendly" and enables 8-bit quantization without compromising model quality. Consequently, SmoothQuant yields smaller, faster models that perform well on Intel CPU platforms.
The Success of SmoothQuant with Popular LLMs
Intel has successfully quantized several LLMs with SmoothQuant-O3 and evaluated their accuracy using the Language Model Evaluation Harness. The result is encouraging, with models being approximately 2x smaller compared to pre-trained 16-bit models. Most metrics improve, and those that don't are only marginally penalized.
A significant benefit of working with smaller models is a substantial reduction in inference latency. This was demonstrated with real-time text generation using the MPT-7B-chat model on a single socket Intel Sapphire Rapids CPU with 32 cores and a batch size of 1.
The emergence of relatively smaller models like Alpaca, BloomZ, and Vicuna opens up new opportunities for enterprises to lower the cost of fine-tuning and inference in production. High-quality quantization brings high-quality chat experiences to Intel CPU platforms, negating the need for running colossal LLMs and complex AI accelerators.
In collaboration with Intel, we're hosting a new exciting demo in Spaces called Q8-Chat. Q8-Chat offers you a ChatGPT-like chat experience, while only running on a single socket Intel Sapphire Rapids CPU with 32 cores and a batch size of 1.
The Future of Quantization
We're currently integrating these new quantization techniques into the Hugging Face Optimum Intel library through the Intel Neural Compressor. Soon, you'll be able to replicate these demos with just a few lines of code.
The future is 8-bit, and it's almost here. Stay tuned!
Comments