Per its README, Gemma Chat is a local-first desktop app: Electron + Vite + React 19 + TypeScript + Tailwind on the surface, MLX-LM underneath for Gemma 4 on Apple Silicon, with optional Ollama compatibility called out in the repo description. The project bills itself as “vibe code without the internet” after the initial model pull—no API keys in the local narrative, MIT license.
This article is an ExplainX field guide: stack, model sizing, how the agent loop is described upstream, and what to validate if you fork it for your team.
TL;DR
| Question | Short answer |
|---|---|
| What is it? | Desktop chat + coding agent for Gemma 4, running via MLX on Mac (Apple Silicon). |
| Why care? | A concrete open-source reference for offline-capable assistant UX tied to Google’s open Gemma line and Apple’s MLX runtime. |
| Primary source | github.com/ammaarreshi/gemma-chat |
| Creator signal | Ammaar Reshi—public launch thread and Google Gemma account amplification (April 2026); star/fork counts change—check the repo badge row. |
| License | MIT (per repository LICENSE). |
What shipped
The README frames two modes:
- Build mode — A coding agent with a live preview: the model writes multi-file HTML/CSS/JS-style trees into a sandboxed workspace while the UI streams updates.
- Chat mode — Conversational use with tools (upstream mentions web search, URL fetch, calculator, bash in feature list).
Supporting pieces called out there include model switching across several Gemma 4 variants, voice input via Whisper (transformers.js / WASM in-browser path per stack table), and first-run automation: Python venv + MLX provisioning.
How the agent loop is described
The README’s architecture section is worth reading directly. In Build mode the story is:
- Stream tokens from a local MLX server.
- Parse XML
<action>blocks from the stream (upstream notes small models behaving more reliably with XML than JSON tool calls). - Execute actions (file writes, bash, etc.) and feed results back—up to ~40 rounds per user message in the documented design.
- Flush partial file writes on a timer so the preview iframe can reload while generation is in flight.
That pattern—stream → parse imperative actions → mutate workspace → loop—is the same family of “local Codex-style” loops teams are standardizing on in 2026; here it is bound to Gemma + MLX instead of a hosted API.
Models and memory (from upstream table)
The project’s README publishes a simple matrix. Paraphrased here—re-verify on the repo before you buy hardware:
| Variant (as labeled upstream) | Approximate size | Notes |
|---|---|---|
| Gemma 4 E2B | ~1.5 GB | Faster, lighter tasks |
| Gemma 4 E4B | ~3 GB | Recommended balance in README |
| Gemma 4 27B MoE | ~8 GB | Stronger reasoning; 16 GB+ RAM class machine |
| Gemma 4 31B | ~18 GB | Heaviest; 32 GB+ RAM class machine |
Community replies on X have asked the same question your laptop will ask: which row is “enough” for acceptable latency on your thermal budget—there is no substitute for local profiling on the exact chip and cooling you ship with.
Getting started (upstream commands)
From the README’s Getting Started block:
git clone https://github.com/ammaarreshi/gemma-chat.git
cd gemma-chat
npm install
npm run dev
Note: Some README snapshots on the web have referenced alternate clone URLs; use the repository you intend to fork and verify default branch and package scripts in package.json before documenting runbooks internally.
Packaging:
npm run dist
Upstream states this yields a .dmg for drag-to-Applications installs.
Tradeoffs practitioners are already naming
- Offline inference ≠ offline everything. Installing
npmdependencies, reading live API docs, and shipping CI/CD still want a network—even when the model weights never leave the machine. That distinction matters for security reviews (“data never hits OpenAI”) vs program reality (“the loop still phones home for packages”). - First-run downloads are the fragile step: public replies mention crashes during model download—triage via Issues and pinned guidance rather than assumptions.
- Ecosystem routing: Comments ask for tighter integration with existing local weight stores (for example pointing at Ollama or LM Studio). The repo description already mentions Ollama; whether that satisfies “use my existing cache” is an integration detail to confirm in code and docs.
- Speech-to-text: A reply thread references MLX-VLM-style server paths for STT—interesting for forks, not something to assert without matching commit and IPC in this repo.
Why ExplainX readers should care
ExplainX indexes skills, tools, agents, and MCP servers for teams that ship with assistants. Gemma Chat is a reference for one slice of that map: desktop shell + local weights + tool protocol + workspace sandbox. Whether you adopt it directly or borrow patterns, the artifact is inspectable in MIT-licensed source.
Related on ExplainX
- What are agent skills? — how portable SKILL.md-style capability fits next to local agent loops
- Google Chrome “skills” and Gemini — Google’s product-surface automation story vs local Gemma runs
- What is MCP? — when your local app exposes tools to hosts
- GLM-5.1, Hugging Face, Ollama — another local weights + stack matrix explainer for comparison
Sources
- Repository: github.com/ammaarreshi/gemma-chat
- Gemma (Google DeepMind open models): positioning and ecosystem context via Google Gemma on X and official Gemma documentation—use those for model policy and license nuance beyond this app.
- MLX: Apple’s machine learning research materials on MLX / MLX-LM for runtime semantics.
Star counts, default models, and README clone URLs drift quickly after a viral launch. Reconcile any numbers in this post with the live GitHub page and Issues before budgeting hardware or support.