← Blog
explainx / blog

Gemma Chat: offline vibe coding with Gemma 4 and MLX on Mac

Electron app runs Gemma 4 on Apple Silicon with MLX-LM: build + chat modes, model sizes, setup, when offline helps vs when you still need the network. MIT: github.com/ammaarreshi/gemma-chat

5 min readExplainX Team
GemmaLocal LLMApple SiliconMLXOpen sourceVibe coding

Includes frontmatter plus an attribution block so copies credit explainx.ai and the canonical URL.

Gemma Chat: offline vibe coding with Gemma 4 and MLX on Mac

Per its README, Gemma Chat is a local-first desktop app: Electron + Vite + React 19 + TypeScript + Tailwind on the surface, MLX-LM underneath for Gemma 4 on Apple Silicon, with optional Ollama compatibility called out in the repo description. The project bills itself as “vibe code without the internet” after the initial model pull—no API keys in the local narrative, MIT license.

This article is an ExplainX field guide: stack, model sizing, how the agent loop is described upstream, and what to validate if you fork it for your team.

TL;DR

QuestionShort answer
What is it?Desktop chat + coding agent for Gemma 4, running via MLX on Mac (Apple Silicon).
Why care?A concrete open-source reference for offline-capable assistant UX tied to Google’s open Gemma line and Apple’s MLX runtime.
Primary sourcegithub.com/ammaarreshi/gemma-chat
Creator signalAmmaar Reshi—public launch thread and Google Gemma account amplification (April 2026); star/fork counts change—check the repo badge row.
LicenseMIT (per repository LICENSE).

What shipped

The README frames two modes:

  1. Build mode — A coding agent with a live preview: the model writes multi-file HTML/CSS/JS-style trees into a sandboxed workspace while the UI streams updates.
  2. Chat modeConversational use with tools (upstream mentions web search, URL fetch, calculator, bash in feature list).

Supporting pieces called out there include model switching across several Gemma 4 variants, voice input via Whisper (transformers.js / WASM in-browser path per stack table), and first-run automation: Python venv + MLX provisioning.

How the agent loop is described

The README’s architecture section is worth reading directly. In Build mode the story is:

  • Stream tokens from a local MLX server.
  • Parse XML <action> blocks from the stream (upstream notes small models behaving more reliably with XML than JSON tool calls).
  • Execute actions (file writes, bash, etc.) and feed results back—up to ~40 rounds per user message in the documented design.
  • Flush partial file writes on a timer so the preview iframe can reload while generation is in flight.

That pattern—stream → parse imperative actions → mutate workspace → loop—is the same family of “local Codex-style” loops teams are standardizing on in 2026; here it is bound to Gemma + MLX instead of a hosted API.

Models and memory (from upstream table)

The project’s README publishes a simple matrix. Paraphrased here—re-verify on the repo before you buy hardware:

Variant (as labeled upstream)Approximate sizeNotes
Gemma 4 E2B~1.5 GBFaster, lighter tasks
Gemma 4 E4B~3 GBRecommended balance in README
Gemma 4 27B MoE~8 GBStronger reasoning; 16 GB+ RAM class machine
Gemma 4 31B~18 GBHeaviest; 32 GB+ RAM class machine

Community replies on X have asked the same question your laptop will ask: which row is “enough” for acceptable latency on your thermal budget—there is no substitute for local profiling on the exact chip and cooling you ship with.

Getting started (upstream commands)

From the README’s Getting Started block:

git clone https://github.com/ammaarreshi/gemma-chat.git
cd gemma-chat
npm install
npm run dev

Note: Some README snapshots on the web have referenced alternate clone URLs; use the repository you intend to fork and verify default branch and package scripts in package.json before documenting runbooks internally.

Packaging:

npm run dist

Upstream states this yields a .dmg for drag-to-Applications installs.

Tradeoffs practitioners are already naming

  • Offline inference ≠ offline everything. Installing npm dependencies, reading live API docs, and shipping CI/CD still want a network—even when the model weights never leave the machine. That distinction matters for security reviews (“data never hits OpenAI”) vs program reality (“the loop still phones home for packages”).
  • First-run downloads are the fragile step: public replies mention crashes during model download—triage via Issues and pinned guidance rather than assumptions.
  • Ecosystem routing: Comments ask for tighter integration with existing local weight stores (for example pointing at Ollama or LM Studio). The repo description already mentions Ollama; whether that satisfies “use my existing cache” is an integration detail to confirm in code and docs.
  • Speech-to-text: A reply thread references MLX-VLM-style server paths for STT—interesting for forks, not something to assert without matching commit and IPC in this repo.

Why ExplainX readers should care

ExplainX indexes skills, tools, agents, and MCP servers for teams that ship with assistants. Gemma Chat is a reference for one slice of that map: desktop shell + local weights + tool protocol + workspace sandbox. Whether you adopt it directly or borrow patterns, the artifact is inspectable in MIT-licensed source.

Related on ExplainX

Sources

  • Repository: github.com/ammaarreshi/gemma-chat
  • Gemma (Google DeepMind open models): positioning and ecosystem context via Google Gemma on X and official Gemma documentation—use those for model policy and license nuance beyond this app.
  • MLX: Apple’s machine learning research materials on MLX / MLX-LM for runtime semantics.

Star counts, default models, and README clone URLs drift quickly after a viral launch. Reconcile any numbers in this post with the live GitHub page and Issues before budgeting hardware or support.

Related posts