There is a persistent myth in the local AI community: to run a 70B model, you need 140GB of VRAM. A 4-bit quantized version requires ~40GB. Consumer GPUs top out at 24GB. Therefore, 70B at anything approaching full precision requires a multi-GPU server or an expensive cloud instance.
AirLLM breaks this assumption. Not by magic — by trading speed for accessibility in a principled way that the community has found genuinely useful (21,000+ GitHub stars).
The technique is not new. Loading model layers sequentially from disk was used in academic settings before AirLLM made it accessible. What AirLLM did was turn it into a clean open-source library that works with Hugging Face models and takes three lines of code.
The Core Insight: You Don't Need All Layers at Once
A transformer language model is a stack of attention layers. During a forward pass, the computation flows sequentially:
- Layer 1 processes the input, produces an output
- Layer 2 takes Layer 1's output, processes it, produces an output
- ... repeated through all layers
- Final layer produces the logits (predictions)
Standard inference loads every layer into GPU memory before any computation starts. This is fast — everything is resident — but requires the full model to fit in VRAM simultaneously.
AirLLM's observation: at any point in the computation, only one layer's weights are actually being used. The rest are just waiting.
So why load them all at once?
AirLLM's approach:
- Split the model into individual layer shards, save to disk
- During inference, load layer 1 → compute → free layer 1 from VRAM
- Load layer 2 → compute → free layer 2 from VRAM
- Continue through all layers
Peak VRAM usage: one layer's weights + working activations. For a 70B model with ~80 layers, each layer at FP16 is roughly 1.75GB. A 4GB GPU can handle that.
What This Costs: The Speed Trade-off
Every forward pass reads the full model from disk — sequentially. That I/O is the bottleneck.
| Hardware | 70B speed estimate | Comment |
|---|---|---|
| NVMe SSD + GPU | 1–3 tokens/sec | Best-case local setup |
| SATA SSD + GPU | 0.5–1 tokens/sec | Meaningfully slower |
| HDD + GPU | < 0.3 tokens/sec | Not practical |
Compare to full-VRAM inference on appropriately-sized hardware: 15–30 tokens/sec for 70B on high-end multi-GPU. AirLLM is 5–30x slower depending on disk speed.
This is the honest trade-off. AirLLM does not make 70B fast on a 4GB GPU. It makes 70B possible on a 4GB GPU.
The use cases where this trade-off makes sense:
- Batch generation tasks where latency doesn't matter (summarizing documents overnight)
- Low-frequency queries where the alternative is not running the model at all
- Development and experimentation where you need full-size model behavior
- Hardware-constrained environments (edge devices, old workstations)
Quick Start
pip install airllm
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("meta-llama/Llama-3-70B-Instruct")
input_text = ['Explain the trade-offs in distributed system design.']
input_tokens = model.tokenizer(
input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False
)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True
)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Important: On first run, AirLLM downloads the model from Hugging Face and splits it into layer shards. This requires approximately 2x the model size in disk space during the splitting process (the original plus the shards). A 70B FP16 model needs ~280GB free during conversion. After conversion, set delete_original=True to reclaim half the space.
Getting 3x Faster With Compression
Layer-by-layer loading is slow because loading from disk is slow. Make the layers smaller, loading gets faster.
AirLLM's compression is block-wise quantization of weights only — not activations. This distinction matters:
Standard quantization quantizes both weights and activations because both contribute to compute bottlenecks. AirLLM's bottleneck is disk I/O, not compute. It can quantize only weights, which is easier to do accurately because you're not dealing with dynamic runtime values.
pip install -U bitsandbytes
pip install -U airllm
model = AutoModel.from_pretrained(
"meta-llama/Llama-3-70B-Instruct",
compression='4bit' # or '8bit' for a quality/speed balance
)
Reported speedup: ~3x. Accuracy impact: comparable to standard 4-bit quantization — very small for most tasks, slightly more on specialized domains.
With compression enabled, a 70B model's layer shards drop to ~18GB total (from ~140GB), which also means the loading-to-VRAM step is proportionally faster.
Configuration Reference
model = AutoModel.from_pretrained(
"model-id-or-local-path",
compression='4bit', # '4bit', '8bit', or None
profiling_mode=False, # True to log per-layer timing
layer_shards_saving_path=None, # Custom path for shards
hf_token=None, # For gated models
prefetching=True, # Overlap load + compute (~10% speed boost)
delete_original=False, # Remove original after splitting
)
The prefetching parameter is the most impactful free optimization — it starts loading the next layer into a buffer while the current layer is computing. Enabled by default for Llama models.
Supported Models and AutoModel Detection
AutoModel.from_pretrained() detects the model class automatically. No need to import AirLLMLlama2 specifically.
Supported:
- Llama 2, 3, 3.1 (7B through 405B)
- Mistral and Mixtral
- QWen and QWen2.5
- ChatGLM (use AutoModel, not the Llama class)
- Baichuan
- InternLM
- All safetensors-format models in the Open LLM Leaderboard top 10
For gated models:
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
hf_token="your_hf_token"
)
MacOS: Unified Memory Changes the Math
Apple Silicon machines have unified memory — VRAM and system RAM are the same pool. A Mac Studio with 192GB unified memory can technically hold a 70B FP16 model in memory without AirLLM. A Mac with 16GB can't.
AirLLM works on Apple Silicon (M1/M2/M3 only). Install mlx and torch, ensure Python native is available (not Rosetta), and run the same code as Linux.
The relevant consideration: a 4GB limit on Mac corresponds to using an old MacBook Air or a non-unified-memory setup. Most modern Apple Silicon Macs have 16–192GB unified memory, which changes when AirLLM is useful. On those machines, the layer-loading approach may still be worthwhile for models larger than available unified memory.
The Parametric Limit
AirLLM does not solve the problem for production serving. It solves the problem for individual inference where hardware is the constraint.
If you need:
- High throughput (dozens of concurrent users)
- Low latency (< 2 second response times)
- Reliable uptime
...AirLLM is not the answer. Those requirements need hardware that fits the model.
AirLLM's value is exactly where its trade-off is acceptable: personal use, development workflows, research environments, and batch tasks where the alternative is not running the model at all.
At 21,000+ stars, the community has clearly found that constraint useful often enough.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
Related
- AI tools directory — full landscape of local AI and inference tools
- AI skills registry — reusable AI workflows including LLM integrations
- Browse LLM tools — complete directory of language models and their capabilities