What is the GPT-5.5 Codex "516 bug"?

Developers analyzing Codex's own token_count telemetry found that GPT-5.5 responses disproportionately stop reasoning at exactly 516 tokens, with secondary spikes at 1034, 1552, 2070, 2588, and 3106 — each roughly 518 tokens apart. Runs that land on these exact values are far more likely to produce wrong answers than runs with organically varying reasoning-token counts, which is what makes the pattern look like a bug rather than normal variation.

Is this confirmed to be an intentional change by OpenAI?

No. As of publication, OpenAI has not issued a public technical explanation. The GitHub issue (openai/codex #30364) asks the Codex team to investigate whether it's a reasoning-budget cap, a routing/fallback behavior, or a batching artifact. Commenters proposed theories — parallel reasoning generation in fixed-size chunks, a thresholded budget, prompt-cache-ratio correlation — but none is confirmed.

Which models are affected?

The clustering is heavily concentrated in GPT-5.5, with multiple independent local-telemetry analyses showing it 15-45x more likely to land exactly on these threshold values than GPT-5.4, GPT-5.3-codex, or GPT-5.2. Several commenters report the pattern is much less prevalent or absent entirely in earlier Codex-branded models.

Is there a workaround?

One HN commenter reported that prefacing the prompt with "Use maximum effort as principal software architect to think and reason for at least 60s" caused a previously-failing test prompt to pass consistently — though this is an anecdotal single-prompt result, not a verified fix. Switching to gpt-5.3-codex or gpt-5.4 for high-stakes tasks, or cross-checking suspicious short-reasoning answers with a second model, are more reliable mitigations until OpenAI responds.

GPT-5.5 Codex 516-Token Bug: Evidence and Theories Explained | explainx.ai Blog

Q: How can I check if I'm affected?

Scan your local Codex session logs (~/.codex/sessions/**/*.jsonl and ~/.codex/archived_sessions/) for event_msg.payload.type == "token_count" events, and look at payload.info.last_token_usage.reasoning_output_tokens. If a meaningful share of your GPT-5.5 responses land on exactly 516, 1034, or 1552, you're seeing the same pattern reported in the GitHub issue.

explainx.ainewsletter3.5k

workshops ↗

GPT-5.5 Codex 516-Token Bug: Evidence and Theories Explained | explainx.ai Blog | explainx.ai

If GPT-5.5 has felt inconsistent in Codex lately — brilliant on one run, confidently wrong on the next identical prompt — there's now a specific, numeric pattern behind at least some of that feeling. A GitHub issue opened against openai/codex documents GPT-5.5 responses disproportionately terminating their reasoning at exactly 516 tokens, with secondary spikes at 1034, 1552, 2070, 2588, and 3106 — each about 518 tokens apart — and those exact-value runs are measurably more likely to be wrong.

This isn't a single anecdote. By the time the discussion reached Hacker News, at least six developers had independently mined their own local Codex telemetry and found the same comb-like pattern, with dataset sizes ranging from tens of thousands to over 200,000 token records each.

The Original Report

The GitHub issue, filed by user vguptaa45, analyzed Codex's token_count telemetry across a February–June 2026 window:

Metric	Value
Response-level token records analyzed	390,195
Sessions represented	865
Exact `reasoning_output_tokens = 516` events	3,363
GPT-5.5 share of all responses	19.3%
GPT-5.5 share of exact-516 events	82.0%
GPT-5.5 exact-516 / ≥516 ratio	44.0%
Non-GPT-5.5 exact-516 / ≥516 ratio	1.3%

GPT-5.5 accounted for only about a fifth of all analyzed responses but over 80% of the exact-516 events — a rate roughly 34x higher than every other model combined. The report also found that overall reasoning-token intensity had been falling month over month (mean reasoning tokens dropped from 268 in February to 107 in May) at the same time exact-516 clustering was rising sharply (from 0.11% in February to over 53% in May) — the opposite of what you'd expect if GPT-5.5 were simply reasoning less on easier tasks.

The issue explicitly avoided overclaiming: "I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior."

Independent Reproductions

What makes this credible rather than a single skewed dataset is how many people reproduced it on their own logs, with different tooling and different time windows:

One developer scanning 1,615 JSONL files (204,959 token records) found GPT-5.5/xhigh hitting exact-516 at a 31.9% rate versus a much smaller share for gpt-5.3-codex-spark in the same window, and further broke the pattern down by cache-ratio bucket — finding the exact-516 concentration was actually highest in lower cached-input-token bins, not simply an artifact of prompt caching.
Another, scanning 4,487 files (180,559 records) across February–July, found GPT-5.5's exact-516 rate at 50.1% of all responses reaching ≥516 tokens, against just 0.28% for gpt-5.3-codex in the same dataset — a roughly 180x difference between models on the exact same local machine and account.
On a smaller, single-model-affected sample (all GPT-5.5, 57,813 records), one commenter found the exact-516 rate climbing from 26.7% in May to 48.1% by early July — suggesting the pattern, whatever its cause, has been getting more pronounced over time, not resolving on its own.
A community reproduction using the CLI directly ran the same logic puzzle prompt five times and got reasoning-token counts of 24, 27, 12, 21, and 21 — with the 516-token outlier runs correlating with wrong answers in a separate small-sample test (4 of 10 identical runs hit exactly 516 reasoning tokens, and all four were wrong).

What's Actually Being Debated

Three open questions define the discussion, and none of them has a confirmed answer as of publication:

Is this a budget cap, a batching artifact, or a scheduler behavior? One theory, floated on Hacker News, is that Codex might be generating reasoning in fixed-size chunks (roughly 512-token increments) for a throughput or parallelization gain — which would explain the evenly-spaced thresholds (516, 1034, 1552 are each ~518 apart) without implying an intentional quality cut. Another theory ties it to prompt-cache ratio, though the data on that specific correlation is mixed across the independent reproductions.

Is it new, or has it always been there and only recently got measured? The monthly breakdowns in multiple reports show the clustering was near-zero in February 2026 and climbed sharply from March onward — which argues against "this is just how reasoning models have always worked" and toward "something changed."

Does hitting exactly 516 reasoning tokens cause wrong answers, or correlate with them? The strongest evidence here is the direct A/B: identical prompts run multiple times, where runs landing on the suspicious exact values were wrong far more often than runs with organically varying token counts. That's a correlation from a small sample, not a controlled causal test, but it's the specific claim that turned this from "the model feels inconsistent" into a reproducible, falsifiable bug report.

How to Check Your Own Logs

If you use Codex CLI or the Codex Desktop app, you can run the same check the GitHub issue's commenters did. Codex writes session telemetry to JSONL files under ~/.codex/sessions/ and ~/.codex/archived_sessions/. Each token_count event includes payload.info.last_token_usage.reasoning_output_tokens — scan for how often that value lands on exactly 516, 1034, or 1552 versus varying naturally, and split by model and effort level (the pattern is reported as strongest at xhigh effort).

Reported mitigations, none independently verified at scale:

Prefacing prompts with an explicit reasoning-time instruction. One commenter reported that adding "Use maximum effort as principal software architect to think and reason for at least 60s" to the start of a previously-failing prompt made it pass consistently across retries — but this is a single anecdotal test case, not a benchmarked fix.
Falling back to gpt-5.3-codex or gpt-5.4 for high-stakes tasks where a wrong answer is costly, since both showed dramatically lower exact-516 rates across every independent reproduction.
Cross-checking suspiciously short or unusually confident GPT-5.5 answers with a second model — the same model-routing pattern developers already use to route work by cost, intelligence, and taste applies directly here: route anything unsupervised and high-stakes to a second opinion rather than trusting a single model's output blindly, especially while a known-but-unexplained anomaly is active.

Why This Matters Beyond Codex

This isn't the first time a coding agent's users have suspected a "quiet" quality regression — Claude Code went through a similar public debate in prior months, and the same "is it real or is it vibes" argument played out on Hacker News both times. What's different here is that this report has a specific, checkable signature: an exact numeric value, a reproducible telemetry field, and multiple independent datasets landing on the same number. That's a meaningfully higher bar of evidence than "it feels dumber lately," and it's part of why OpenAI is likely to face pressure to respond with a technical explanation rather than a generic "we haven't detected any regression."

If you're deciding between Codex and Claude Code for a given task right now, our Claude Code vs. Codex comparison and the rate-limit boost coverage are useful companions — both tools have had their own credibility moments in 2026, and neither is immune to this kind of underlying-infrastructure uncertainty.

GPT-5.5 Codex's "516 Bug": Reasoning-Token Clustering Explained

Related posts

Codex Slash Commands: Complete CLI Reference (2026)

Claude Usage Limits in 2026: Every Change Explained (Timeline)

Run AI Coding Agents From Your Phone: pocketdev, Cursor iOS, and OpenClaw