What is AI model quantization in simple terms?

Quantization is the process of reducing the numerical precision used to store a model's weights. A standard trained model stores each weight as a 32-bit floating-point number. Quantization converts those weights to 16-bit, 8-bit, or 4-bit integers. Smaller numbers take less memory, so the model fits on consumer hardware. The tradeoff is a tiny amount of precision loss — for most tasks at 8-bit, the output is indistinguishable from the original; at 4-bit, there is a slight but acceptable degradation.

Does quantizing a model make it significantly worse?

FP16 quantization (half-precision) produces virtually no detectable quality loss versus FP32. INT8 quantization introduces negligible degradation on most tasks — studies show less than 1% performance drop on standard benchmarks. INT4 introduces noticeable but task-dependent degradation: for casual chat and code assistance, most users cannot tell the difference. For demanding reasoning or math tasks, a Q4 model may make slightly more errors than Q8. Advanced quantization methods like AWQ and K-quants (used in llama.cpp GGUF files) mitigate INT4 quality loss by identifying and preserving the most important weights at higher precision.

What is the difference between GGUF, GPTQ, and AWQ?

These are three different quantization formats optimised for different hardware. GGUF (used by llama.cpp) is the most accessible — it runs on CPUs, Apple Silicon, and NVIDIA GPUs with partial GPU offloading. It is the format used by Ollama and LM Studio. See **[what is llama.cpp?](/blog/what-is-llama-cpp-run-models-locally-2026)** for install and server setup. GPTQ is optimised for NVIDIA CUDA — faster inference on GPU but requires a CUDA device and is less flexible. AWQ (Activation-aware Weight Quantization) is a more sophisticated GPU-focused format that achieves better quality than GPTQ at the same bit width by identifying which weights are most critical and protecting them at higher precision. For most local users, start with GGUF. Switch to AWQ if you have a CUDA GPU and care about inference speed.

Which quantization level should I use for a 7B model on 8GB VRAM?

Q4_K_M is the standard recommendation for 8GB VRAM. A 7B model at Q4_K_M requires approximately 4.1GB of VRAM for weights, leaving room for the KV cache during inference. Q5_K_M (approximately 4.8GB) is worth trying if your GPU has the headroom — it recovers some of the quality lost at Q4. Avoid Q2 and Q3 unless you are severely memory-constrained; the quality degradation at those levels becomes noticeable even for casual tasks. Q8_0 (approximately 7.2GB) gives near-lossless quality but leaves very little room for KV cache on an 8GB card — context length will be constrained.

Can I run a quantized model on an iPhone or Android device?

Yes, with the right model size and quantization. Phones with 6-8GB of RAM (recent flagship iPhones, Pixel 9 Pro, Galaxy S25) can run 1B-3B parameter models at Q4 quantization. Apple's Core ML and Google's AI Edge frameworks enable on-device inference. Meta's Llama 3.2 1B, Microsoft's Phi-4 Mini, and Google's Gemma 3 1B are all designed for mobile deployment. At Q4, a 3B model needs roughly 2GB — within reach of modern phones. The second YouTube embed in this article shows DeepSeek-R1 running fully offline on a phone using this approach.

What Is AI Model Quantization? Running Frontier AI Locally | explainx.ai Blog

The problem is simple: you download a state-of-the-art language model and discover it requires more VRAM than your GPU has. The solution is quantization — and understanding it is the difference between being locked out of local AI and running frontier-quality models on hardware you already own.

The Memory Problem That Makes Quantization Necessary

Before explaining what quantization is, it helps to understand exactly why it matters.

Every parameter in a language model is a number — specifically, a weight that gets multiplied and added during inference. How many bits you use to represent that number directly determines how much memory the model occupies.

The standard training precision is FP32 (32-bit floating point). Each parameter takes 4 bytes. For a model with 7 billion parameters:

7,000,000,000 × 4 bytes = 28,000,000,000 bytes = approximately 28GB of VRAM

An RTX 4090 — the most powerful consumer GPU available in 2026 — has 24GB of VRAM. That means a 7B model at full training precision does not fit on the best consumer GPU money can buy. A 13B model needs 52GB. A 70B model needs 280GB.

Running even a moderately sized open-source model at full precision requires either expensive datacenter hardware or multiple high-end GPUs combined. For most developers, researchers, and enthusiasts, full-precision local inference is simply not possible with realistic hardware budgets.

This is the problem quantization solves.

What Quantization Actually Is

Quantization is the process of representing model weights using fewer bits. Instead of 32-bit floating-point numbers, you use lower-precision formats:

FP16 / BF16: 16-bit floating point (half precision). Still floats, just half the bits.
INT8: 8-bit integer. A single whole number from -128 to 127.
INT4: 4-bit integer. A whole number from -8 to 7.

Each halving of precision roughly halves the memory requirement. This is not lossless compression — you are genuinely throwing away information. But neural networks are surprisingly robust to this kind of precision reduction, because weights are redundant and the model's behavior emerges from millions of interactions rather than any single number.

The practical insight: most of the bits in a 32-bit weight are noise from the perspective of model behavior. The significant information can be captured in far fewer bits with minimal impact on output quality.

How Precision Levels Map to Memory Requirements

Here is the concrete relationship between quantization level and VRAM requirements across three common model sizes:

Precision	Bits per weight	7B model	13B model	70B model
FP32 (baseline)	32	~28 GB	~52 GB	~280 GB
FP16 / BF16	16	~14 GB	~26 GB	~140 GB
INT8	8	~7 GB	~13 GB	~70 GB
INT4	4	~3.5 GB	~6.5 GB	~35 GB

These are weight-only numbers. Real inference also requires memory for the KV cache (which stores attention state across the context window) and activations. Add roughly 1-3GB overhead for a 7B model running typical context lengths. But the weight memory is the dominant constraint, and the table above shows why quantization is transformative: a 70B model that cannot fit in an entire server rack's worth of consumer GPUs at FP32 becomes runnable on a single RTX 3090 (24GB) at INT4.

Note that FP16 and BF16 deserve separate treatment. FP16 (IEEE 754 half-precision) and BF16 (Brain Float 16, used by Google and widely adopted since) both use 16 bits but allocate them differently: BF16 keeps the same exponent range as FP32 but reduces mantissa precision, making it more numerically stable during training. For inference, they behave similarly and either is commonly referred to as "half precision."

Does Quantization Hurt Quality?

This is the most important practical question, and the honest answer is nuanced.

FP16/BF16: Virtually no measurable quality loss. Nearly all open-source models are released in BF16 as standard. The difference from FP32 is undetectable in practice.

INT8: Negligible loss for most tasks. Benchmark studies consistently show less than 1% performance difference on standard evaluations. For tasks like summarization, coding assistance, Q&A, and translation, INT8 is effectively lossless.

INT4: Task-dependent degradation. For casual conversation, general writing, and code completion, most users cannot perceive a difference. For demanding tasks — competition mathematics, formal reasoning chains, nuanced multi-step inference — a Q4 model makes more errors than Q8. The degradation is real but often acceptable.

INT2 / INT3: Significant quality drop. These extreme quantizations are useful for research and highly constrained edge deployments, but not recommended for production use where quality matters.

The quality loss is also non-uniform across the model. Some components are far more sensitive to precision reduction than others:

Embedding layers: High sensitivity. These store the model's representation of every token and are often kept at higher precision even in aggressive quantization schemes.
Attention heads (Q, K, V projections): Moderately sensitive. The attention mechanism is critical for in-context reasoning.
Feed-forward network mid-layers: Relatively robust. These can tolerate more aggressive quantization with less quality impact.
Output projection (lm_head): High sensitivity. This layer converts internal representations to vocabulary probabilities — quantizing it aggressively degrades output quality.

This non-uniformity is why advanced quantization methods outperform naive approaches. Rather than quantizing every weight identically, they identify which parts of the model are most sensitive and protect them at higher precision.

Post-Training Quantization vs Quantization-Aware Training

There are two fundamentally different approaches to quantization, and understanding the distinction matters for evaluating quality claims.

Post-Training Quantization (PTQ)

PTQ takes a model that has already been fully trained at standard precision and converts its weights to lower precision after the fact. The model never sees quantized values during training — you are simply rounding the learned weights to fit in fewer bits.

PTQ is fast and cheap. You do not need to retrain the model; you convert the weights in minutes to hours depending on model size. The vast majority of quantized models available for download use PTQ. Tools like llama.cpp, AutoGPTQ, and AutoAWQ all operate as PTQ systems.

The limitation of PTQ is that the model was not trained to compensate for quantization error. The weights being rounded were optimized assuming FP32 precision, so there is inherent information loss.

Quantization-Aware Training (QAT)

QAT integrates simulated quantization into the training process itself. During training, the model uses quantized arithmetic (or simulated quantized arithmetic) so the gradients teach the model to be robust to the precision reduction it will face at inference time.

The result is a model that genuinely adapts to lower precision — the learned weights are shaped to work well quantized rather than being rounded versions of high-precision weights. QAT models consistently outperform PTQ models at the same bit width, especially at aggressive quantization levels like INT4.

The tradeoff: QAT requires either training from scratch or fine-tuning a substantial portion of the model under quantization simulation, which is expensive. It requires the compute and time of a significant model training run, not just a post-processing step.

Most open-source quantized models available today use PTQ. As QAT adoption grows — particularly for purpose-built edge models — expect to see quality improve even further at low bit widths without size increases.

GGUF and llama.cpp: How Local AI Became Accessible

The single most important development in consumer local AI was Georgi Gerganov's llama.cpp project and its associated file format, GGUF (GPT-Generated Unified Format, introduced in August 2023 as the successor to the earlier GGML format).

Before llama.cpp, running language models locally required GPU-specific frameworks, CUDA installations, Python environments with complex dependencies, and significant technical expertise. llama.cpp changed this: it is a pure C/C++ inference engine that runs on virtually any hardware — including CPU-only execution, Apple Silicon (via Metal), NVIDIA GPUs (via CUDA), AMD GPUs (via ROCm), and even mobile devices.

GGUF is the container format llama.cpp uses. A .gguf file contains everything needed to run a model: the quantized weights, the tokenizer, and model metadata. You download a single file and point llama.cpp (or any compatible tool) at it.

GGUF Quantization Naming Convention

GGUF files use a specific naming convention that encodes the quantization configuration:

snippet

Model-Name-Q{bits}_K_{size}.gguf

Q: The number of bits per weight. Q4 = 4 bits, Q5 = 5 bits, Q8 = 8 bits.

K: Indicates "K-quant" — a family of quantization methods developed for llama.cpp that use adaptive quantization, grouping weights and applying different scales per group for better quality than naive quantization at the same bit width.

: S (small), M (medium), or L (large) — referring to the size of the calibration dataset used during quantization. M is the most common and represents a good balance.

The full spectrum from most aggressive to least:

Format	Bits	Approx quality	Best use
Q2_K	2.x	Significantly degraded	Extreme memory constraints only
Q3_K_S	3.x	Noticeable degradation	Very tight VRAM, accept quality tradeoff
Q3_K_M	3.x	Moderate degradation	Emergency use
Q4_K_S	4.x	Good	8GB VRAM devices, prioritize fit over quality
Q4_K_M	4.x	Very good	Standard recommended choice for 8-12GB VRAM
Q5_K_S	5.x	Very good	12GB+ VRAM, slight quality improvement over Q4
Q5_K_M	5.x	Excellent	Recommended when you have the VRAM headroom
Q6_K	6.x	Near-lossless	24GB+ VRAM, negligible quality difference from Q8
Q8_0	8	Near-lossless	Maximum quality on large VRAM, slowest

The _0 in Q8_0 (versus Q8_K) indicates a different quantization implementation — legacy in this case; Q8_0 predates the K-quant method but remains widely supported.

GPTQ, AWQ, and GGUF: Choosing the Right Format

Three formats dominate quantized model distribution. They solve the same core problem but with different tradeoffs:

GGUF (llama.cpp)

The most accessible format. Runs on any hardware, does not require a GPU. llama.cpp intelligently splits model layers between GPU and CPU memory ("GPU offloading"), so if your GPU has 8GB of VRAM but the model needs 12GB, it loads as much as fits in VRAM and handles the rest on CPU. Inference is slower in this split configuration, but it works.

Use GGUF if: You want maximum compatibility, you are using Ollama or LM Studio, you have limited or no GPU, or you are on Apple Silicon.

GPTQ

GPTQ (Generative Pre-Training Quantization) is a PTQ method specifically optimized for CUDA GPUs. It uses second-order information (the Hessian of the loss) to minimize quantization error layer by layer, producing better quality than naive rounding at the same bit width. Inference via AutoGPTQ or ExLlamaV2 is fast on CUDA.

Use GPTQ if: You have a CUDA GPU and want faster inference than llama.cpp provides, and the model you need exists in GPTQ format.

AWQ (Activation-Aware Weight Quantization)

AWQ is a more sophisticated PTQ method. The key insight: not all weights are equally important. Some weights are multiplied by large activations during inference — those are the ones that matter most, and rounding them aggressively causes large output errors. AWQ identifies the salient weights by analyzing activation statistics from a small calibration dataset, then protects those weights at higher precision while aggressively quantizing the rest.

The result is measurably better quality than GPTQ at the same bit width. AutoAWQ is the standard implementation, and many models are available pre-quantized in AWQ format on Hugging Face.

Use AWQ if: You have a CUDA GPU, quality at INT4 matters, and you want the best possible output from 4-bit weights.

Summary comparison

Format	Hardware	CPU fallback	Quality (INT4)	Speed	Ease of use
GGUF	Any	Yes	Good (K-quants)	Medium	Excellent
GPTQ	CUDA GPU	No	Good	Fast	Moderate
AWQ	CUDA GPU	No	Best	Fast	Moderate

What to Run Where: The Practical Hardware Guide

Setting up a quantized LLM locally — the complete workflow from download to inference.

The following table maps common hardware configurations to realistic model recommendations. VRAM is the binding constraint; system RAM matters for CPU offloading.

Hardware	VRAM	What fits	Recommended quantization
RTX 4060 8GB	8 GB	7B models	Q4_K_M (fits cleanly, leaves KV cache room)
RTX 3060 12GB	12 GB	7B-13B models	7B at Q5_K_M or Q8; 13B at Q4_K_M
RTX 3080 10GB	10 GB	7B models	Q5_K_M, some 13B at Q4_K_S
RTX 3090 / 4090 24GB	24 GB	Up to 30B	13B at Q8, 30B at Q4_K_M, 7B at full BF16
2× RTX 3090 (NVLink)	48 GB	Up to 70B	70B at Q4_K_M, 30B at Q8
M2 Mac 32GB unified	32 GB	13B-30B	MLX native BF16, or Q5 GGUF via Ollama
M3 Max / M4 Max 64GB	64 GB	Up to 70B	MLX BF16 or Q8; exceptional performance
Phone 6-8GB RAM	6-8 GB	1B-3B	Q4 or Q3 via Core ML / AI Edge

A few important notes on this table:

Apple Silicon uses unified memory — the same pool serves both CPU and GPU. This means an M2 Mac with 32GB of RAM can load a model up to ~28GB into "GPU" memory. MLX (Apple's machine learning framework) exploits this for native-speed inference on Apple models. For GGUF files, Ollama uses Metal acceleration automatically on M-series Macs.

GPU offloading in llama.cpp means you can run models larger than your VRAM with a speed penalty. If a 13B Q4 model needs 8GB and you have 6GB, you can offload most layers to GPU and run the rest on CPU. The speed will be slower but the model will work.

Context length affects VRAM. The KV cache scales with both context length and number of layers. Running a 7B model at 8K context needs more VRAM than the same model at 2K context. If you are near the edge of your VRAM budget, reduce context length before reducing quantization level.

VibeThinker 3B: What Quantization Enables in 2026

The most striking illustration of what quantization makes possible in 2026 is VibeThinker 3B — a 3-billion-parameter model based on Qwen 2.5-Coder that reportedly achieves 94.3 on AIME 2026 and 96.1% on unseen LeetCode contests, placing it in the performance tier of Claude Opus 4.5 on coding benchmarks.

Here is what quantization means for running VibeThinker 3B:

Precision	VRAM required	Hardware that works
BF16 (full)	~6 GB	RTX 3080 10GB, RTX 4060 Ti 16GB, M-series 8GB+
Q8_0	~3.5 GB	Any GPU with 4GB+ VRAM
Q4_K_M	~2 GB	Laptop GPUs, integrated graphics, phones

A model achieving Opus 4.5-class coding performance fitting in 2GB of VRAM — runnable on a gaming laptop, an older discrete GPU, or even Apple Silicon iPad hardware — would have been implausible two years ago. It is now real, and quantization is the enabling technology.

This is the larger trend: the quality-per-bit of compressed models is improving fast enough that the effective performance gap between local quantized AI and cloud API AI is narrowing considerably. The comparison of closed-source vs local open-source models in 2026 tells a very different story than the same comparison in 2024.

Mobile and Edge Deployment: Quantization at the Extreme

Phones represent the most constrained deployment environment for language models: 6-8GB of total system RAM (not VRAM), thermal envelopes designed for sustained use at milliwatts rather than watts, and no discrete GPU in the traditional sense.

Despite these constraints, 2026 has seen working on-device AI deployments on consumer phones become genuinely viable:

Quantization taken to mobile: running a capable model on-device with no cloud dependency.

On-device inference frameworks for mobile:

Apple Core ML / MLX: Apple Silicon neural engines (ANE) in iPhones and iPads accelerate quantized model inference natively. The ANE is specifically designed for INT8 and lower-precision matrix operations. Models under 2GB at Q4 are practical on current flagship iPhones.

Google AI Edge (formerly MediaPipe LLM): Google's on-device inference stack supports INT4 and INT8 models on Pixel and compatible Android devices. Gemma 3 1B was explicitly designed for this deployment path.

llama.cpp on Android: For advanced users, llama.cpp compiles for Android via Termux and supports GGUF files directly. Performance is slower than native framework options but maximum compatibility.

ExecuTorch (Meta): Meta's mobile inference framework, used to deploy Llama 3.2 models on iOS and Android. Supports INT4 quantization with the same K-quant quality improvements developed for llama.cpp.

The viable model range for phones in 2026:

Model size	Q4 VRAM	Phone viability	Use case
1B	~700 MB	Any modern phone	Always-on assistant, quick Q&A
3B	~2 GB	Flagship phones (8GB+ RAM)	Coding help, writing, reasoning
7B	~4 GB	Not practical	Requires more RAM than most phones have

The second YouTube embed above demonstrates DeepSeek-R1 running fully offline on a phone — no internet connection, no API call, all inference on-device. At Q3 or Q4 quantization of the distilled 1.5B version, this is achievable on a current flagship phone.

The implications for privacy are significant: sensitive documents, personal correspondence, and confidential code can be processed by an LLM without any data leaving the device.

The Quantization Toolchain: What to Use

Getting started with quantized local AI requires picking an inference tool. Here is the 2026 landscape:

Ollama

The simplest path. One command downloads, quantizes, and runs a model:

bash

ollama run qwen2.5-coder:7b

Ollama automatically selects a quantization level based on your available VRAM, handles GPU detection, and exposes a local API compatible with OpenAI's client libraries. For most users starting with local AI, Ollama is the right first tool.

Quantization note: Ollama uses GGUF files under the hood. The models in its library come pre-quantized, typically at Q4_K_M for consumer hardware. You can pull specific quantizations:

bash

ollama pull qwen2.5:14b-instruct-q8_0

LM Studio

LM Studio is a GUI application that wraps llama.cpp. Download models from Hugging Face directly within the app, select your quantization level from a dropdown, and run inference with a chat interface. No terminal required. The best option for non-developers who want to explore local models.

llama.cpp Directly

For maximum control, compile and run llama.cpp directly. This gives you access to every GGUF quantization parameter, batch size tuning, context length control, and CPU thread configuration:

bash

./llama-cli -m ./models/qwen2.5-7b-q4_k_m.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  -p "Explain transformer attention in plain English"

The --n-gpu-layers flag controls how many layers to offload to GPU — tune this to match your VRAM.

MLX (Apple Silicon)

For M-series Macs, Apple's MLX framework is purpose-built for unified memory inference and outperforms llama.cpp's Metal backend for many models:

bash

pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Write a Python quicksort implementation"

MLX models on Hugging Face are pre-converted to Apple's format, often at 4-bit (equivalent to Q4) quantization. The mlx-community organization maintains many current models.

AutoGPTQ and AutoAWQ

For users who want to quantize a model themselves rather than downloading a pre-quantized version, these libraries handle the process:

python

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantize_config={"bits": 4, "group_size": 128}
)
model.quantize(calibration_dataset)
model.save_quantized("./qwen2.5-7b-gptq-4bit")

This is primarily useful when a model you want has not yet been quantized by the community, or when you need a specific configuration (different group size, different calibration data).

Why Quantization Quality Is Better Than You'd Expect

A common intuition is that reducing from 32 bits to 4 bits — an 8x reduction — should cause catastrophic quality loss. It usually does not, and understanding why is useful for calibrating your expectations.

Neural networks are massively redundant. The millions of parameters in a model do not each encode unique irreplaceable information. The model's knowledge is distributed across the entire parameter space, which means small perturbations to individual weights (rounding to a nearby integer) are absorbed by the redundancy.

The important information is coarser than you think. FP32 provides about 7 decimal digits of precision. INT8 provides about 2.4 decimal digits. But the critical signal — the sign and rough magnitude of a weight — survives in 4 bits. The low-order precision bits in FP32 weights are effectively noise at the scale of what matters for model behavior.

K-quants and grouped quantization preserve structure. Rather than quantizing every weight independently, modern quantization methods group weights (e.g., 128 weights per group) and store a scale factor per group. Weights within a group are quantized relative to the group's scale. This means the relative magnitudes within a group are preserved, which is what the model actually uses during computation.

Outlier handling is critical. The failure mode for naive INT8 quantization is outlier weights — individual parameters with magnitudes much larger than the distribution. If you map the range [-100, +100] to 256 integer values, a weight of 0.5 gets rounded so aggressively that its value is nearly meaningless. LLM.int8() (a foundational INT8 quantization paper) identified this problem and proposed handling outliers in FP16 while quantizing the rest to INT8. AWQ takes a similar approach at INT4.

The 2026 Context: Why This Matters Now

Three years ago, running a capable language model locally was a research exercise. The models were smaller, the tools were rougher, and the gap between local and cloud quality was enormous.

In 2026, that gap has narrowed to single digits on most benchmarks. The open-weight model ecosystem has accelerated dramatically. Quantization tooling has matured from research prototypes to consumer-friendly applications. And the hardware accessible to individual developers has enough capability to run quantized models at genuinely useful speeds.

The local AI vs cloud AI tradeoffs are no longer theoretical. Developers are choosing local models for the volume of their work and reserving cloud API credits for tasks that genuinely need frontier capability. Building a full personal AI system — model, inference engine, workflow automation, and tool integrations running entirely on local hardware — is now a practical weekend project, not a multi-month infrastructure undertaking.

Quantization is the technical foundation making all of this possible. Understanding how many parameters a model has gives you the raw count; quantization determines whether those parameters can fit in your hardware.

The models keep improving. DeepSeek V4-Pro's architecture with its Mixture-of-Experts design, for instance, is particularly quantization-friendly: only a fraction of parameters are active per token, so the effective compute reduction from quantization compounds with the efficiency from MoE sparsity. The result is increasingly capable models that fit in increasingly modest hardware.

The trajectory is clear: quantization will continue to improve, the quality floor will rise, and the hardware required to run capable local AI will continue to fall. Understanding quantization fundamentals positions you to take advantage of every improvement as it arrives.

Getting Started: The One-Minute Path to Running a Quantized Model

If you want to run a quantized model right now:

Step 1: Install Ollama from ollama.com (macOS, Linux, or Windows).

Step 2: Pull and run a model:

bash

# 7B model, runs on 6GB+ VRAM or Apple Silicon
ollama run llama3.2:7b

# 3B model for lower-resource machines
ollama run phi4-mini

# Coding-focused model
ollama run qwen2.5-coder:7b-instruct-q4_K_M

Step 3: Interact via the terminal interface, or point any OpenAI-compatible client at http://localhost:11434/v1.

Ollama handles quantization selection automatically based on your hardware. The model it pulls will be a GGUF file at an appropriate quantization level for your GPU — most commonly Q4_K_M for consumer hardware. You will be running a quantized frontier-class model on local hardware with a single command.

For more advanced configuration — custom quantization levels, GPU layer control, or building a complete local AI workflow — the personal AI system guide covers the full stack from hardware selection to workflow automation.

The Memory Problem That Makes Quantization Necessary

Before explaining what quantization is, it helps to understand exactly why it matters.

The standard training precision is FP32 (32-bit floating point). Each parameter takes 4 bytes. For a model with 7 billion parameters:

7,000,000,000 × 4 bytes = 28,000,000,000 bytes = approximately 28GB of VRAM

This is the problem quantization solves.

What Quantization Actually Is

Quantization is the process of representing model weights using fewer bits. Instead of 32-bit floating-point numbers, you use lower-precision formats:

FP16 / BF16: 16-bit floating point (half precision). Still floats, just half the bits.
INT8: 8-bit integer. A single whole number from -128 to 127.
INT4: 4-bit integer. A whole number from -8 to 7.

How Precision Levels Map to Memory Requirements

Here is the concrete relationship between quantization level and VRAM requirements across three common model sizes:

Precision	Bits per weight	7B model	13B model	70B model
FP32 (baseline)	32	~28 GB	~52 GB	~280 GB
FP16 / BF16	16	~14 GB	~26 GB	~140 GB
INT8	8	~7 GB	~13 GB	~70 GB
INT4	4	~3.5 GB	~6.5 GB	~35 GB

Does Quantization Hurt Quality?

This is the most important practical question, and the honest answer is nuanced.

FP16/BF16: Virtually no measurable quality loss. Nearly all open-source models are released in BF16 as standard. The difference from FP32 is undetectable in practice.

INT2 / INT3: Significant quality drop. These extreme quantizations are useful for research and highly constrained edge deployments, but not recommended for production use where quality matters.

The quality loss is also non-uniform across the model. Some components are far more sensitive to precision reduction than others:

Embedding layers: High sensitivity. These store the model's representation of every token and are often kept at higher precision even in aggressive quantization schemes.
Attention heads (Q, K, V projections): Moderately sensitive. The attention mechanism is critical for in-context reasoning.
Feed-forward network mid-layers: Relatively robust. These can tolerate more aggressive quantization with less quality impact.
Output projection (lm_head): High sensitivity. This layer converts internal representations to vocabulary probabilities — quantizing it aggressively degrades output quality.

Post-Training Quantization vs Quantization-Aware Training

There are two fundamentally different approaches to quantization, and understanding the distinction matters for evaluating quality claims.

Post-Training Quantization (PTQ)

The limitation of PTQ is that the model was not trained to compensate for quantization error. The weights being rounded were optimized assuming FP32 precision, so there is inherent information loss.

Quantization-Aware Training (QAT)

GGUF and llama.cpp: How Local AI Became Accessible

GGUF Quantization Naming Convention

GGUF files use a specific naming convention that encodes the quantization configuration:

snippet

Model-Name-Q{bits}_K_{size}.gguf

Q: The number of bits per weight. Q4 = 4 bits, Q5 = 5 bits, Q8 = 8 bits.

: S (small), M (medium), or L (large) — referring to the size of the calibration dataset used during quantization. M is the most common and represents a good balance.

The full spectrum from most aggressive to least:

Format	Bits	Approx quality	Best use
Q2_K	2.x	Significantly degraded	Extreme memory constraints only
Q3_K_S	3.x	Noticeable degradation	Very tight VRAM, accept quality tradeoff
Q3_K_M	3.x	Moderate degradation	Emergency use
Q4_K_S	4.x	Good	8GB VRAM devices, prioritize fit over quality
Q4_K_M	4.x	Very good	Standard recommended choice for 8-12GB VRAM
Q5_K_S	5.x	Very good	12GB+ VRAM, slight quality improvement over Q4
Q5_K_M	5.x	Excellent	Recommended when you have the VRAM headroom
Q6_K	6.x	Near-lossless	24GB+ VRAM, negligible quality difference from Q8
Q8_0	8	Near-lossless	Maximum quality on large VRAM, slowest

The _0 in Q8_0 (versus Q8_K) indicates a different quantization implementation — legacy in this case; Q8_0 predates the K-quant method but remains widely supported.

GPTQ, AWQ, and GGUF: Choosing the Right Format

Three formats dominate quantized model distribution. They solve the same core problem but with different tradeoffs:

GGUF (llama.cpp)

Use GGUF if: You want maximum compatibility, you are using Ollama or LM Studio, you have limited or no GPU, or you are on Apple Silicon.

GPTQ

Use GPTQ if: You have a CUDA GPU and want faster inference than llama.cpp provides, and the model you need exists in GPTQ format.

AWQ (Activation-Aware Weight Quantization)

The result is measurably better quality than GPTQ at the same bit width. AutoAWQ is the standard implementation, and many models are available pre-quantized in AWQ format on Hugging Face.

Use AWQ if: You have a CUDA GPU, quality at INT4 matters, and you want the best possible output from 4-bit weights.

Summary comparison

Format	Hardware	CPU fallback	Quality (INT4)	Speed	Ease of use
GGUF	Any	Yes	Good (K-quants)	Medium	Excellent
GPTQ	CUDA GPU	No	Good	Fast	Moderate
AWQ	CUDA GPU	No	Best	Fast	Moderate

What to Run Where: The Practical Hardware Guide

Setting up a quantized LLM locally — the complete workflow from download to inference.

The following table maps common hardware configurations to realistic model recommendations. VRAM is the binding constraint; system RAM matters for CPU offloading.

Hardware	VRAM	What fits	Recommended quantization
RTX 4060 8GB	8 GB	7B models	Q4_K_M (fits cleanly, leaves KV cache room)
RTX 3060 12GB	12 GB	7B-13B models	7B at Q5_K_M or Q8; 13B at Q4_K_M
RTX 3080 10GB	10 GB	7B models	Q5_K_M, some 13B at Q4_K_S
RTX 3090 / 4090 24GB	24 GB	Up to 30B	13B at Q8, 30B at Q4_K_M, 7B at full BF16
2× RTX 3090 (NVLink)	48 GB	Up to 70B	70B at Q4_K_M, 30B at Q8
M2 Mac 32GB unified	32 GB	13B-30B	MLX native BF16, or Q5 GGUF via Ollama
M3 Max / M4 Max 64GB	64 GB	Up to 70B	MLX BF16 or Q8; exceptional performance
Phone 6-8GB RAM	6-8 GB	1B-3B	Q4 or Q3 via Core ML / AI Edge

A few important notes on this table:

VibeThinker 3B: What Quantization Enables in 2026

Here is what quantization means for running VibeThinker 3B:

Precision	VRAM required	Hardware that works
BF16 (full)	~6 GB	RTX 3080 10GB, RTX 4060 Ti 16GB, M-series 8GB+
Q8_0	~3.5 GB	Any GPU with 4GB+ VRAM
Q4_K_M	~2 GB	Laptop GPUs, integrated graphics, phones

Mobile and Edge Deployment: Quantization at the Extreme

Despite these constraints, 2026 has seen working on-device AI deployments on consumer phones become genuinely viable:

Quantization taken to mobile: running a capable model on-device with no cloud dependency.

On-device inference frameworks for mobile:

The viable model range for phones in 2026:

Model size	Q4 VRAM	Phone viability	Use case
1B	~700 MB	Any modern phone	Always-on assistant, quick Q&A
3B	~2 GB	Flagship phones (8GB+ RAM)	Coding help, writing, reasoning
7B	~4 GB	Not practical	Requires more RAM than most phones have

The implications for privacy are significant: sensitive documents, personal correspondence, and confidential code can be processed by an LLM without any data leaving the device.

The Quantization Toolchain: What to Use

Getting started with quantized local AI requires picking an inference tool. Here is the 2026 landscape:

Ollama

The simplest path. One command downloads, quantizes, and runs a model:

bash

ollama run qwen2.5-coder:7b

Quantization note: Ollama uses GGUF files under the hood. The models in its library come pre-quantized, typically at Q4_K_M for consumer hardware. You can pull specific quantizations:

bash

ollama pull qwen2.5:14b-instruct-q8_0

LM Studio

llama.cpp Directly

For maximum control, compile and run llama.cpp directly. This gives you access to every GGUF quantization parameter, batch size tuning, context length control, and CPU thread configuration:

bash

./llama-cli -m ./models/qwen2.5-7b-q4_k_m.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  -p "Explain transformer attention in plain English"

The --n-gpu-layers flag controls how many layers to offload to GPU — tune this to match your VRAM.

MLX (Apple Silicon)

For M-series Macs, Apple's MLX framework is purpose-built for unified memory inference and outperforms llama.cpp's Metal backend for many models:

bash

pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Write a Python quicksort implementation"

MLX models on Hugging Face are pre-converted to Apple's format, often at 4-bit (equivalent to Q4) quantization. The mlx-community organization maintains many current models.

AutoGPTQ and AutoAWQ

For users who want to quantize a model themselves rather than downloading a pre-quantized version, these libraries handle the process:

python

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantize_config={"bits": 4, "group_size": 128}
)
model.quantize(calibration_dataset)
model.save_quantized("./qwen2.5-7b-gptq-4bit")

This is primarily useful when a model you want has not yet been quantized by the community, or when you need a specific configuration (different group size, different calibration data).

Why Quantization Quality Is Better Than You'd Expect

The 2026 Context: Why This Matters Now

Three years ago, running a capable language model locally was a research exercise. The models were smaller, the tools were rougher, and the gap between local and cloud quality was enormous.

Getting Started: The One-Minute Path to Running a Quantized Model

If you want to run a quantized model right now:

Step 1: Install Ollama from ollama.com (macOS, Linux, or Windows).

Step 2: Pull and run a model:

bash

# 7B model, runs on 6GB+ VRAM or Apple Silicon
ollama run llama3.2:7b

# 3B model for lower-resource machines
ollama run phi4-mini

# Coding-focused model
ollama run qwen2.5-coder:7b-instruct-q4_K_M

Step 3: Interact via the terminal interface, or point any OpenAI-compatible client at http://localhost:11434/v1.

The Memory Problem That Makes Quantization Necessary

What Quantization Actually Is

How Precision Levels Map to Memory Requirements

Does Quantization Hurt Quality?

Post-Training Quantization vs Quantization-Aware Training

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GGUF and llama.cpp: How Local AI Became Accessible

GGUF Quantization Naming Convention

GPTQ, AWQ, and GGUF: Choosing the Right Format

GGUF (llama.cpp)

GPTQ

AWQ (Activation-Aware Weight Quantization)

Summary comparison

What to Run Where: The Practical Hardware Guide

VibeThinker 3B: What Quantization Enables in 2026

Mobile and Edge Deployment: Quantization at the Extreme

The Quantization Toolchain: What to Use

Ollama

LM Studio

llama.cpp Directly

MLX (Apple Silicon)

AutoGPTQ and AutoAWQ

Why Quantization Quality Is Better Than You'd Expect

The 2026 Context: Why This Matters Now

Getting Started: The One-Minute Path to Running a Quantized Model

The Memory Problem That Makes Quantization Necessary

What Quantization Actually Is

How Precision Levels Map to Memory Requirements

Does Quantization Hurt Quality?

Post-Training Quantization vs Quantization-Aware Training

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GGUF and llama.cpp: How Local AI Became Accessible

GGUF Quantization Naming Convention

GPTQ, AWQ, and GGUF: Choosing the Right Format

GGUF (llama.cpp)

GPTQ

AWQ (Activation-Aware Weight Quantization)

Summary comparison

What to Run Where: The Practical Hardware Guide

VibeThinker 3B: What Quantization Enables in 2026

Mobile and Edge Deployment: Quantization at the Extreme

The Quantization Toolchain: What to Use

Ollama

LM Studio

llama.cpp Directly

MLX (Apple Silicon)

AutoGPTQ and AutoAWQ

Why Quantization Quality Is Better Than You'd Expect

The 2026 Context: Why This Matters Now

Getting Started: The One-Minute Path to Running a Quantized Model

Related posts

Musk: Long-Term, 99.99% of AI Compute Goes to Space

Kimi K3 1-Bit GGUF: 1.56TB Shrunk to 594GB, ~79% Accuracy Kept

Top 10 Open-Weight Models You Can Actually Run on a Laptop

Related posts

Musk: Long-Term, 99.99% of AI Compute Goes to Space

Kimi K3 1-Bit GGUF: 1.56TB Shrunk to 594GB, ~79% Accuracy Kept

Top 10 Open-Weight Models You Can Actually Run on a Laptop