Why do local LLMs get stuck in repetition loops?

Repetition loops in quantized local models are a sampling problem, not just a model quality problem. Small errors introduced by low-bit quantization accumulate over long generations, and default samplers (plain top_p/top_k) don't correct for it, so the model falls into a repeating pattern.

What is the DRY sampler and does it fix looping?

DRY (Don't Repeat Yourself) is a repetition penalty sampler merged into llama.cpp that detects and penalizes repeated sequences rather than just repeated tokens. Users report it substantially reduces looping in models like Qwen 3.6 27B and GLM-5.2 quants without needing a bigger or less-compressed model.

What are top_n_sigma and XTC samplers?

top_n_sigma is a sampling method that dynamically narrows the candidate token set based on statistical spread rather than a fixed top_k or top_p cutoff. XTC (Exclude Top Choices) is presented as a diversification alternative to temperature that avoids picking the most predictable token, both aimed at reducing repetitive, generic output.

Which harness is best for running local LLMs as coding agents?

There is no single winner as of mid-2026. Pi is favored for extensibility, Vibe and OpenCode for simplicity, and Zerostack for minimalism. Whichever harness is used, it needs a good eval loop and tool-calling setup — the model alone is not enough to get good agentic behavior out of a local rig.

How do you safely sandbox a local AI agent that has full filesystem access?

Options range from lightweight namespace isolation (bubblewrap) to microVMs (libkrun, crun-vm) to full VMs with QEMU/libvirt. Full or microVMs are recommended when running an agent in "yolo mode" with unrestricted tool access, since they provide an actual security boundary rather than a permissions convention.

Is Whisper still the best local speech-to-text model in 2026?

No — Whisper large-v3 remains usable but is no longer state of the art. NVIDIA's Parakeet TDT v3 is roughly half the size, faster, and has a lower word error rate for non-streaming transcription, though its tooling ecosystem (ONNX/Nemotron-based) is less mature than Whisper's.

Local LLMs Keep Looping? Fix It With Samplers, Not More VRAM | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Local LLMs Keep Looping? Fix It With Samplers, Not More VRAM | explainx.ai Blog | explainx.ai

Jamesob's $46,000 four-GPU rig for running GLM-5.2 locally got the headline attention on Hacker News. But the more useful conversation happened in the comments, where people running much cheaper local setups — a couple of RTX 3090s, an M5 MacBook, a single Arc B70 — traded fixes for the problem that actually shows up day to day: the model gets stuck in a loop halfway through a task, and buying a less-quantized model isn't the only way to fix it.

Here's what the thread converged on, organized by what you'd actually do about it.

TL;DR: What Actually Fixes Local LLM Problems

Problem	Fix people reported working	Not the fix
Model loops mid-task on long context	DRY repetition penalty + top_n_sigma / XTC samplers in llama.cpp	Buying a bigger or less-quantized model alone
Agent forgets which harness it's running in	Smaller/quantized dense models (e.g. Qwen 3.6 27B at 8-bit) lose tool-call format fidelity on long-horizon tasks; keep tasks short or use a MoE with a stronger harness	Assuming any local model is drop-in for Claude Code / Codex
Agent has full filesystem + shell access	microVMs (libkrun) or full VMs (QEMU/libvirt) as a real security boundary	Relying on bubblewrap alone if you're running "yolo mode"
Choosing a local speech-to-text model	Parakeet TDT v3 (smaller, faster, lower WER) for most cases; Whisper large-v3 if you need mature tooling	Assuming Whisper is still SOTA by default
$2-3k budget, want a real coding assistant	2× RTX 3090 (48GB) running Qwen 3.6 27B beats an M-series MacBook on memory bandwidth for sustained agent workloads	Buying a MacBook purely for local inference speed

The Looping Problem Is a Sampling Problem, Not Just a Quantization Problem

explainx.ai has already covered why 4-bit and REAP-pruned quantization degrades quality on long-horizon tasks — errors compound over a long generation. What the HN thread added is that a meaningful chunk of that degradation is fixable at the sampler level, without touching the weights at all.

One commenter (running Qwen 3.6 27B dense, undisclosed harness) put it directly:

"Looping, like most other phenomenons related to LLMs, is a sampling problem and can be easily solved with the DRY penalty... The same guy who wrote heretic invented the SOTA antilooping and diversification strategies."

That's a direct reference to the creator of Heretic, explainx.ai's guide to automated LLM censorship removal, whose sampling work has apparently carried over into anti-repetition tooling too.

The three samplers worth knowing

DRY (Don't Repeat Yourself) — detects repeated sequences, not just repeated tokens, and penalizes continuing them. Already merged into llama.cpp's default sampler chain (penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature per the project docs).
top_n_sigma — narrows the candidate token pool based on the statistical spread of logits rather than a fixed top_k/top_p cutoff, which adapts better across different generation contexts than a static threshold.
XTC (Exclude Top Choices) — deliberately excludes the single most predictable next token some of the time, used as an alternative to raising temperature for diversity without just adding noise.

One user reported pushing a model as low as 1.58-bit quantization and still getting usable output "by simply not using the garbage default top_p and top_k that vendors continue to wrongly recommend" — an aggressive claim, but directionally consistent with everyone else's experience that default sampler configs leave real quality on the table.

Practical takeaway: before concluding a quantized model is "too dumb" for a task, check whether DRY and top_n_sigma are actually enabled in your inference server — llama.cpp ships them, but not every wrapper (Ollama, LM Studio, custom vLLM configs) exposes or enables them by default.

What "Convincing Itself It's a Different Harness" Tells You

A sharper failure mode than looping: a quantized model losing track of which agent harness it's running inside, and calling tools that don't exist in that environment. One user described an 8-bit, non-MoE 26B model running in the Pi harness that "somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn't exist."

This is a tool-calling format fidelity issue, and it gets worse the longer the context runs — which is the same compounding-error mechanism behind looping, just manifesting as hallucinated tool syntax instead of repeated text. If you're building agent workflows on local models, treat long-horizon, many-tool-call sessions as the stress test, not simple chat.

Choosing a Harness for Local Models

There's no consensus winner yet, but the thread converged on a rough shortlist:

Harness	Reported strength
Pi	Extensibility — favored for custom tool integration and MCP-style setups
Vibe	Clean UI, "good enough" for most agentic tasks
OpenCode	Straightforward, works well paired with Qwen 3.6 27B
Zerostack	Minimal — least abstraction, fewest surprises
Dirac	Feature-rich but reported as buggier than the alternatives

Whichever harness you pick, the sampler configuration and eval loop matter more than the harness UI — a well-tuned Qwen 3.6 27B in a bare-bones harness will outperform a poorly-sampled GLM-5.2 REAP quant in a polished one.

Sandboxing Agents With Real Filesystem Access

Running a local model with tool-calling access to your actual filesystem and shell raises the same question the industry keeps re-learning: what's the actual security boundary, and does it hold up if the agent (or a prompt injected into its context) tries to escape it?

The thread laid out three tiers, roughly in order of isolation strength:

bubblewrap — namespace-based sandboxing, lightweight, but requires manually bootstrapping each dev tool's config dirs, package caches, and mount points. Workable for constrained agents, more fragile for "give it a full dev environment" setups.
microVMs (libkrun, crun-vm) — a real hardware-virtualization security boundary with much less setup friction than bubblewrap for a full filesystem; you can hand the agent a whole disk image and run in permissive mode without hand-tuning namespace mounts.
Full VMs (QEMU/libvirt) — the most battle-tested and configurable option, standard for GPU passthrough scenarios, at the cost of more resource overhead.

One nuance worth internalizing: you don't need to give the sandboxed VM GPU passthrough. Run the inference server (llama.cpp/vLLM) on the host, and only put the agent/harness — the part with shell and filesystem access — inside the VM or microVM, connecting to the inference server over the local network as an OpenAI-compatible API. That keeps your GPU driver stack out of the trust boundary and reduces what the sandbox actually needs to virtualize. It also means a compromised sandbox can't touch host secrets, session data, or auth keys, provided the harness process itself stays on the host and only tool execution is delegated to the VM.

If you're just running a chat interface with no autonomous shell access, none of this applies — the risk is specific to giving a model unattended tool-calling and filesystem permissions, which is the entire point of running agentic coding harnesses like the ones covered in explainx.ai's agent skills security coverage.

Local Speech-to-Text: Whisper Isn't the Default Answer Anymore

If your local stack includes voice input (dictation, agent handoff, meeting transcription), Whisper large-v3 is still workable but no longer the automatic pick:

Model	Size	Notes
Whisper large-v3	~3GB (fp16)	Mature ecosystem — vLLM support, desktop apps like Buzz, widest tooling
Parakeet TDT v3	~half the size	Faster, lower word-error-rate for non-streaming transcription; needs ONNX/Nemotron tooling on non-CUDA hardware
Voxstral	Larger	Reported as even more accurate than Parakeet, at a bigger resource footprint

For most local setups, Parakeet v3 is the pragmatic default now — it fits comfortably alongside a coding model on a single consumer GPU (under 600MB VRAM in some configs) and doesn't meaningfully compromise transcription quality.

The Budget Tier Nobody's GPU Envy Mentions

Not everyone needs 384GB of VRAM. The HN thread's most-repeated budget recommendation was 2× RTX 3090 (48GB total, ~$1,500-2,000 on the used market) running Qwen 3.6 27B at 4-bit — with real numbers from someone who benchmarked both a 2×3090 rig and an M3 MacBook Pro on the same model:

Qwen 3.6 27B (int4) on 2× RTX 3090: 68 tok/s at concurrency 1, up to 363 tok/s at concurrency 32, ~1,520 tok/s prompt processing.
Qwen 3.6 27B (int4) on M3 MacBook Pro (36GB): 18 tok/s at concurrency 1 — under a third of the GPU rig's speed.

The gap comes down to memory bandwidth: dual 3090s deliver roughly 1.87 TB/s combined versus an M-series MacBook's 0.3-0.63 TB/s depending on config, and token generation is bandwidth-bound. explainx.ai's existing MacBook vs. dedicated GPU comparison covers the general trade-off; this is the concurrency data point that makes it concrete for agentic workloads specifically, where you're often running multiple parallel tool calls rather than a single chat stream.

If you already own an M-series Mac, it remains a fine way to get started with local models — just don't expect it to match a dedicated GPU rig once you're running agents at any real concurrency.

Local LLMs Keep Looping? Fix It With Samplers, Not More VRAM

Related posts

SOTA LLMs Locally: Jamesob’s $46k RTX PRO 6000 Hardware Guide

What Is llama.cpp? Install, Run GGUF Models, and Serve OpenAI-Compatible APIs

How to Run Open Source Models Locally and Wire Them Into OpenCode (2026)