On June 25, 2026, Liquid AI released LFM2.5-230M — its smallest foundation model yet, and one of the clearest 2026 statements about where the edge-AI market is heading: not bigger models in the cloud, but fast, open-weight models that run agentic tool loops on the device you already have.
Liquid AI's framing on X (@liquidai) and in the official blog post is explicit: LFM2.5-230M is built to run anywhere — cloud GPUs, phone CPUs, Raspberry Pi boards, and robot onboard computers — and to power data extraction pipelines and lightweight on-device agentic workloads, not frontier math or long-form creative writing.
TL;DR
| Spec | LFM2.5-230M |
|---|---|
| Parameters | 230M (smallest in LFM2.5 family) |
| Architecture | LFM2 (Liquid Foundation Model v2) |
| Pre-training | 19T tokens + 32K context extension |
| Post-training | SFT (distilled from LFM2.5-350M) → DPO → multi-domain RL |
| Variants | LFM2.5-230M-Base, LFM2.5-230M (post-trained) |
| Availability | Hugging Face (open-weight) |
| Phone CPU decode | 213 tok/s (Galaxy S25 Ultra, Snapdragon Gen4) |
| Pi 5 CPU decode | 42 tok/s (Raspberry Pi 5) |
| Best for | Tool use, data extraction, instruction following |
| Avoid for | Advanced math, code generation, creative writing |
| Inference | llama.cpp, MLX, vLLM, SGLang, ONNX |
Why Liquid AI Built a 230M Model
The small-model landscape in mid-2026 splits into two camps:
- Reasoning specialists — models like VibeThinker-3B that compress verifiable math and coding into compact parameter counts.
- Edge agents — models optimized for speed, tool calling, and structured extraction on constrained hardware.
LFM2.5-230M is firmly in the second camp. Liquid AI is not trying to beat Claude Fable 5 on SWE-Bench. It is trying to make "hold still for 2 seconds, then walk forward at 1 meter per second" parse into a valid multi-step robot skill plan — on a Jetson Orin, with no cloud round-trip.
That use case — natural language → structured tool calls → physical action — is the same pattern emerging across home automation, industrial IoT, and phone-based agents. The bottleneck is not raw IQ. It is latency, memory footprint, and inference cost per tool loop.
Training Recipe
Liquid AI's post-training pipeline is designed to preserve downstream fine-tuning flexibility while shipping strong default capability:
Stage 1: Supervised fine-tuning with distillation
The 230M model learns from LFM2.5-350M — a larger sibling in the same architecture family. Distillation from a bigger in-family model is a proven pattern for small models: the teacher provides richer supervision signals than raw pre-training alone, without requiring the student to match the teacher on every task class.
Stage 2: Direct preference optimization (DPO)
DPO aligns the model with human-preferred outputs without a separate reward model training loop — lighter-weight than classic RLHF for a model this size.
Stage 3: Multi-domain reinforcement learning
RL across multiple domains pushes tool-use and extraction behavior beyond what SFT alone achieves — similar in spirit to the multi-domain RL stage in other 2026 small-model pipelines, but tuned for applied tasks rather than competition math.
The base checkpoint (LFM2.5-230M-Base) skips post-training for developers who want a clean starting point for domain-specific fine-tunes.
Benchmarks: Beats Models Twice Its Size — on the Right Tasks
Liquid AI evaluated LFM2.5-230M across ten benchmarks. The headline from the blog post: despite 230M parameters, it competes with and often beats models more than twice as large on instruction following, data extraction, and tool use.
Knowledge and instruction following
| Model | GPQA Diamond | MMLU-Pro | IFEval | IFBench | Multi-IF |
|---|---|---|---|---|---|
| LFM2.5-230M | 25.41 | 20.25 | 71.71 | 38.40 | 37.70 |
| LFM2.5-350M | 30.64 | 20.01 | 76.96 | 40.69 | 44.92 |
| LFM2-350M | 27.58 | 19.29 | 64.96 | 18.20 | 32.92 |
| Granite 4.0-H-350M | 22.32 | 13.14 | 61.27 | 17.22 | 28.70 |
| Qwen3.5-0.8B (Instruct) | 27.41 | 37.42 | 59.94 | 22.87 | 41.68 |
| Gemma 3 1B IT | 23.89 | 14.04 | 63.49 | 20.33 | 44.25 |
On IFEval and IFBench, LFM2.5-230M leads Gemma 3 1B and Qwen3.5-0.8B despite being 3–4× smaller. On broad knowledge (MMLU-Pro), Qwen3.5-0.8B still wins — consistent with the Parametric Compression-Coverage pattern: knowledge coverage scales with parameters differently than instruction-following discipline.
Tool use and data extraction
| Model | CaseReportBench | BFCLv3 | BFCLv4 | τ²-Bench Telecom | τ²-Bench Retail |
|---|---|---|---|---|---|
| LFM2.5-230M | 22.51 | 43.26 | 21.03 | 5.26 | 13.68 |
| LFM2.5-350M | 32.45 | 44.11 | 21.86 | 18.86 | 17.84 |
| LFM2-350M | 11.67 | 22.95 | 12.29 | 10.82 | 5.56 |
| Granite 4.0-H-350M | 12.44 | 43.07 | 13.28 | 13.74 | 6.14 |
| Qwen3.5-0.8B (Instruct) | 13.83 | 35.08 | 18.70 | 12.57 | 6.14 |
BFCLv3 (Berkeley Function Calling Leaderboard) scores above 43 put LFM2.5-230M in the same tier as Granite 4.0-H-350M — a model with ~50% more parameters. CaseReportBench (structured medical/clinical data extraction) at 22.51 beats Qwen3.5-0.8B (13.83) and LFM2-350M (11.67) by wide margins.
The τ²-Bench telecom scores are low across the board for 230M — multi-turn customer-service simulation is hard at this scale. Retail is relatively stronger (13.68), suggesting the model handles simpler structured agent scenarios better than long conversational tool chains.
CPU Speed: 213 tok/s on a Phone, 42 tok/s on a Pi
Raw benchmark scores matter less if inference is too slow for real-time agents. Liquid AI's CPU numbers are the release's most practical signal:
| Platform | Hardware | Decode throughput |
|---|---|---|
| Samsung Galaxy S25 Ultra | Qualcomm Snapdragon Gen4 (CPU) | 213 tok/s |
| Raspberry Pi 5 | ARM CPU | 42 tok/s |
Liquid AI compares LFM2.5-230M against similar-sized attention-based and hybrid models (SSM hybrids, Gated Delta Networks) and reports the highest prefill and decode throughput in its class with the smallest memory footprint.
Flash-attention tuning is device-specific: enabled (-fa 1) on Raspberry Pi 5, disabled (-fa 0) on Snapdragon Gen4 — a reminder that edge deployment is as much about per-platform tuning as model selection. See our quantization guide for the broader stack of techniques that make sub-billion models viable on consumer hardware.
Inference Ecosystem: Day-One Support
LFM2.5-230M ships with checkpoints across the edge-to-cloud inference stack:
| Runtime | Use case |
|---|---|
| llama.cpp | GGUF checkpoints for Raspberry Pi, phones, embedded |
| MLX | Apple Silicon (Mac, iPhone via future MLX ports) |
| vLLM / SGLang | GPU-accelerated production serving |
| ONNX | Cross-platform deployment across diverse accelerators |
For production GPU serving, Liquid AI also benchmarks an internal inference stack against SGLang-served competitors — reporting lower end-to-end latency across concurrency levels for LFM2.5 models.
Unitree G1 Demo: Natural Language → Robot Skills
The most visually compelling demo in the release is not a benchmark table — it is a Unitree G1 humanoid robot running LFM2.5-230M entirely on-device on its onboard NVIDIA Jetson Orin.
The architecture:
- User speaks a free-form natural-language command.
- LFM2.5-230M (after a quick fine-tune) acts as a skill-selection layer.
- The model decomposes the instruction into a sequence of tool calls.
- Each tool call invokes a pre-trained low-level skill from NVIDIA's SONIC framework — timed walking, velocity targets, one-legged kneel holds, etc.
Example command from Liquid AI's blog:
"Hold still for 2 seconds, then walk forward at 1 meter per second for 3 meters, hold a forward one-leg kneel for 5 seconds, and walk backward at 0.5 meters per second for 3 meters"
The model outputs a structured multi-step plan chaining skills like timed walking and kneel holds — without cloud inference.
This parallels the Gemma 4 + Open Duck Mini demo at Google I/O 2026 — but with a different model class: 230M parameters focused on tool decomposition rather than 2B multimodal conversation. Both demos point to the same product direction — robots and edge devices need a language-to-action compiler, not a chatbot.
Liquid AI's demo video: YouTube Shorts — Unitree G1 + LFM2.5-230M
Where LFM2.5-230M Fits in the Small-Model Landscape
| Model | Params | Strength | Weakness |
|---|---|---|---|
| LFM2.5-230M | 230M | Speed, tool use, extraction, edge agents | Math, code, creative writing |
| MiniCPM5-1B | 1B | Broad open-model intelligence at 0.5GB | Heavier than 230M for pure tool loops |
| VibeThinker-3B | 3B | AIME 94.3, frontier verifiable reasoning | Too large for Pi-class real-time agents |
| Gemma 4 E2B | 2B | Multimodal on-device (vision + speech) | Different deployment path (LiteRT) |
Liquid AI's honest limitation statement is refreshing: do not use LFM2.5-230M for advanced math, code generation, or creative writing. That clarity helps developers route tasks correctly — use a 230M model for the tool-selection layer in a pipeline, and call a larger model (or cloud API) only when the subtask requires it.
For agentic coding on developer machines, models like Claude Opus 4.8 or OpenRouter Fusion remain the practical choice while Fable 5 stays suspended. LFM2.5-230M targets a different surface: phones, robots, home automation, and high-volume extraction pipelines where cost and latency dominate.
Get Started
Both checkpoints are available now:
- LFM2.5-230M — post-trained, ready for tool-use and extraction workloads
- LFM2.5-230M-Base — pre-trained base for custom fine-tuning
Download from Hugging Face and follow Liquid AI's documentation for local run and fine-tune instructions.
Liquid AI's broader LFM2.5 family spans base models, audio variants, and vision variants under one architecture — positioning the company as an efficiency-first alternative to scaling-parameter frontier labs.
Related ExplainX coverage
| Post | Connection |
|---|---|
| Gemma 4 + Open Duck Mini | On-device robot demo on Pi 5 and Jetson Orin |
| MiniCPM5-1B | Another open small-model breakthrough at sub-1B scale |
| VibeThinker-3B | Opposite end: frontier reasoning in a compact model |
| AI Model Quantization Guide | How sub-billion models run on phones and edge boards |
| NVIDIA N1X at Computex 2026 | On-device AI compute trend on consumer hardware |
Summary
LFM2.5-230M is Liquid AI's bet that the next wave of useful AI is not another 100B-parameter cloud model — it is a 230M-parameter open-weight agent that runs at 213 tok/s on a phone CPU, parses natural language into tool calls on a humanoid robot, and beats models twice its size on instruction following and data extraction.
It is explicitly not a reasoning or coding model. It is an edge agent compiler — fast, small, and deployable everywhere from a Raspberry Pi to a Jetson Orin to a Snapdragon phone.
Last updated: June 26, 2026. Specs and benchmarks sourced from Liquid AI's blog post and @liquidai on X, published June 25, 2026.