Local LLMs Keep Looping? Fix It With Samplers, Not More VRAM
Hacker News dissects why quantized local models like Qwen 3.6 and GLM-5.2 loop mid-task, and how DRY, top_n_sigma, and XTC sampling fix it without buying more hardware β plus harness and sandboxing picks.
Jamesob's $46,000 four-GPU rig for running GLM-5.2 locally got the headline attention on Hacker News. But the more useful conversation happened in the comments, where people running much cheaper local setups β a couple of RTX 3090s, an M5 MacBook, a single Arc B70 β traded fixes for the problem that actually shows up day to day: the model gets stuck in a loop halfway through a task, and buying a less-quantized model isn't the only way to fix it.
Here's what the thread converged on, organized by what you'd actually do about it.
TL;DR: What Actually Fixes Local LLM Problems
Problem
Fix people reported working
Not the fix
Model loops mid-task on long context
DRY repetition penalty + top_n_sigma / XTC samplers in llama.cpp
Buying a bigger or less-quantized model alone
Agent forgets which harness it's running in
Smaller/quantized dense models (e.g. Qwen 3.6 27B at 8-bit) lose tool-call format fidelity on long-horizon tasks; keep tasks short or use a MoE with a stronger harness
Assuming any local model is drop-in for Claude Code / Codex
Agent has full filesystem + shell access
microVMs (libkrun) or full VMs (QEMU/libvirt) as a real security boundary
Relying on bubblewrap alone if you're running "yolo mode"
Choosing a local speech-to-text model
Parakeet TDT v3 (smaller, faster, lower WER) for most cases; Whisper large-v3 if you need mature tooling
Assuming Whisper is still SOTA by default
$2-3k budget, want a real coding assistant
2Γ RTX 3090 (48GB) running Qwen 3.6 27B beats an M-series MacBook on memory bandwidth for sustained agent workloads
Buying a MacBook purely for local inference speed
The Looping Problem Is a Sampling Problem, Not Just a Quantization Problem
explainx.ai has already covered why 4-bit and REAP-pruned quantization degrades quality on long-horizon tasks β errors compound over a long generation. What the HN thread added is that a meaningful chunk of that degradation is fixable at the sampler level, without touching the weights at all.
One commenter (running Qwen 3.6 27B dense, undisclosed harness) put it directly:
"Looping, like most other phenomenons related to LLMs, is a sampling problem and can be easily solved with the DRY penalty... The same guy who wrote heretic invented the SOTA antilooping and diversification strategies."
That's a direct reference to the creator of Heretic, explainx.ai's guide to automated LLM censorship removal, whose sampling work has apparently carried over into anti-repetition tooling too.
The three samplers worth knowing
DRY (Don't Repeat Yourself) β detects repeated sequences, not just repeated tokens, and penalizes continuing them. Already merged into llama.cpp's default sampler chain (penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature per the project docs).
top_n_sigma β narrows the candidate token pool based on the statistical spread of logits rather than a fixed top_k/top_p cutoff, which adapts better across different generation contexts than a static threshold.
XTC (Exclude Top Choices) β deliberately excludes the single most predictable next token some of the time, used as an alternative to raising temperature for diversity without just adding noise.
One user reported pushing a model as low as 1.58-bit quantization and still getting usable output "by simply not using the garbage default top_p and top_k that vendors continue to wrongly recommend" β an aggressive claim, but directionally consistent with everyone else's experience that default sampler configs leave real quality on the table.
Practical takeaway: before concluding a quantized model is "too dumb" for a task, check whether DRY and top_n_sigma are actually enabled in your inference server β llama.cpp ships them, but not every wrapper (Ollama, LM Studio, custom vLLM configs) exposes or enables them by default.
What "Convincing Itself It's a Different Harness" Tells You
A sharper failure mode than looping: a quantized model losing track of which agent harness it's running inside, and calling tools that don't exist in that environment. One user described an 8-bit, non-MoE 26B model running in the Pi harness that "somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn't exist."
This is a tool-calling format fidelity issue, and it gets worse the longer the context runs β which is the same compounding-error mechanism behind looping, just manifesting as hallucinated tool syntax instead of repeated text. If you're building agent workflows on local models, treat long-horizon, many-tool-call sessions as the stress test, not simple chat.
Choosing a Harness for Local Models
There's no consensus winner yet, but the thread converged on a rough shortlist:
Harness
Reported strength
Pi
Extensibility β favored for custom tool integration and MCP-style setups
Vibe
Clean UI, "good enough" for most agentic tasks
OpenCode
Straightforward, works well paired with Qwen 3.6 27B
Zerostack
Minimal β least abstraction, fewest surprises
Dirac
Feature-rich but reported as buggier than the alternatives
Whichever harness you pick, the sampler configuration and eval loop matter more than the harness UI β a well-tuned Qwen 3.6 27B in a bare-bones harness will outperform a poorly-sampled GLM-5.2 REAP quant in a polished one.
Sandboxing Agents With Real Filesystem Access
Running a local model with tool-calling access to your actual filesystem and shell raises the same question the industry keeps re-learning: what's the actual security boundary, and does it hold up if the agent (or a prompt injected into its context) tries to escape it?
The thread laid out three tiers, roughly in order of isolation strength:
bubblewrap β namespace-based sandboxing, lightweight, but requires manually bootstrapping each dev tool's config dirs, package caches, and mount points. Workable for constrained agents, more fragile for "give it a full dev environment" setups.
microVMs (libkrun, crun-vm) β a real hardware-virtualization security boundary with much less setup friction than bubblewrap for a full filesystem; you can hand the agent a whole disk image and run in permissive mode without hand-tuning namespace mounts.
Full VMs (QEMU/libvirt) β the most battle-tested and configurable option, standard for GPU passthrough scenarios, at the cost of more resource overhead.
One nuance worth internalizing: you don't need to give the sandboxed VM GPU passthrough. Run the inference server (llama.cpp/vLLM) on the host, and only put the agent/harness β the part with shell and filesystem access β inside the VM or microVM, connecting to the inference server over the local network as an OpenAI-compatible API. That keeps your GPU driver stack out of the trust boundary and reduces what the sandbox actually needs to virtualize. It also means a compromised sandbox can't touch host secrets, session data, or auth keys, provided the harness process itself stays on the host and only tool execution is delegated to the VM.
If you're just running a chat interface with no autonomous shell access, none of this applies β the risk is specific to giving a model unattended tool-calling and filesystem permissions, which is the entire point of running agentic coding harnesses like the ones covered in explainx.ai's agent skills security coverage.
Local Speech-to-Text: Whisper Isn't the Default Answer Anymore
If your local stack includes voice input (dictation, agent handoff, meeting transcription), Whisper large-v3 is still workable but no longer the automatic pick:
Faster, lower word-error-rate for non-streaming transcription; needs ONNX/Nemotron tooling on non-CUDA hardware
Voxstral
Larger
Reported as even more accurate than Parakeet, at a bigger resource footprint
For most local setups, Parakeet v3 is the pragmatic default now β it fits comfortably alongside a coding model on a single consumer GPU (under 600MB VRAM in some configs) and doesn't meaningfully compromise transcription quality.
The Budget Tier Nobody's GPU Envy Mentions
Not everyone needs 384GB of VRAM. The HN thread's most-repeated budget recommendation was 2Γ RTX 3090 (48GB total, ~$1,500-2,000 on the used market) running Qwen 3.6 27B at 4-bit β with real numbers from someone who benchmarked both a 2Γ3090 rig and an M3 MacBook Pro on the same model:
Qwen 3.6 27B (int4) on 2Γ RTX 3090: 68 tok/s at concurrency 1, up to 363 tok/s at concurrency 32, ~1,520 tok/s prompt processing.
Qwen 3.6 27B (int4) on M3 MacBook Pro (36GB): 18 tok/s at concurrency 1 β under a third of the GPU rig's speed.
The gap comes down to memory bandwidth: dual 3090s deliver roughly 1.87 TB/s combined versus an M-series MacBook's 0.3-0.63 TB/s depending on config, and token generation is bandwidth-bound. explainx.ai's existing MacBook vs. dedicated GPU comparison covers the general trade-off; this is the concurrency data point that makes it concrete for agentic workloads specifically, where you're often running multiple parallel tool calls rather than a single chat stream.
If you already own an M-series Mac, it remains a fine way to get started with local models β just don't expect it to match a dedicated GPU rig once you're running agents at any real concurrency.
This article reflects sampler names, harness comparisons, and hardware pricing discussed as of July 4, 2026. llama.cpp sampler defaults and harness feature sets change frequently β check the project docs before relying on specific flags.