A GitHub issue with over 390,000 analyzed token records shows GPT-5.5 in Codex disproportionately stopping reasoning at exactly 516 tokens β and those runs are far more likely to be wrong. Here's the evidence, the theories, and how to check if it's affecting you.
If GPT-5.5 has felt inconsistent in Codex lately β brilliant on one run, confidently wrong on the next identical prompt β there's now a specific, numeric pattern behind at least some of that feeling. A GitHub issue opened against openai/codex documents GPT-5.5 responses disproportionately terminating their reasoning at exactly 516 tokens, with secondary spikes at 1034, 1552, 2070, 2588, and 3106 β each about 518 tokens apart β and those exact-value runs are measurably more likely to be wrong.
This isn't a single anecdote. By the time the discussion reached Hacker News, at least six developers had independently mined their own local Codex telemetry and found the same comb-like pattern, with dataset sizes ranging from tens of thousands to over 200,000 token records each.
The Original Report
The GitHub issue, filed by user vguptaa45, analyzed Codex's token_count telemetry across a FebruaryβJune 2026 window:
Metric
Value
Response-level token records analyzed
390,195
Sessions represented
865
Exact reasoning_output_tokens = 516 events
3,363
GPT-5.5 share of all responses
19.3%
GPT-5.5 share of exact-516 events
82.0%
GPT-5.5 exact-516 / β₯516 ratio
44.0%
Non-GPT-5.5 exact-516 / β₯516 ratio
1.3%
GPT-5.5 accounted for only about a fifth of all analyzed responses but over 80% of the exact-516 events β a rate roughly 34x higher than every other model combined. The report also found that overall reasoning-token intensity had been falling month over month (mean reasoning tokens dropped from 268 in February to 107 in May) at the same time exact-516 clustering was rising sharply (from 0.11% in February to over 53% in May) β the opposite of what you'd expect if GPT-5.5 were simply reasoning less on easier tasks.
The issue explicitly avoided overclaiming: "I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior."
Independent Reproductions
What makes this credible rather than a single skewed dataset is how many people reproduced it on their own logs, with different tooling and different time windows:
One developer scanning 1,615 JSONL files (204,959 token records) found GPT-5.5/xhigh hitting exact-516 at a 31.9% rate versus a much smaller share for gpt-5.3-codex-spark in the same window, and further broke the pattern down by cache-ratio bucket β finding the exact-516 concentration was actually highest in lower cached-input-token bins, not simply an artifact of prompt caching.
Another, scanning 4,487 files (180,559 records) across FebruaryβJuly, found GPT-5.5's exact-516 rate at 50.1% of all responses reaching β₯516 tokens, against just 0.28% for gpt-5.3-codex in the same dataset β a roughly 180x difference between models on the exact same local machine and account.
On a smaller, single-model-affected sample (all GPT-5.5, 57,813 records), one commenter found the exact-516 rate climbing from 26.7% in May to 48.1% by early July β suggesting the pattern, whatever its cause, has been getting more pronounced over time, not resolving on its own.
A community reproduction using the CLI directly ran the same logic puzzle prompt five times and got reasoning-token counts of 24, 27, 12, 21, and 21 β with the 516-token outlier runs correlating with wrong answers in a separate small-sample test (4 of 10 identical runs hit exactly 516 reasoning tokens, and all four were wrong).
What's Actually Being Debated
Three open questions define the discussion, and none of them has a confirmed answer as of publication:
Is this a budget cap, a batching artifact, or a scheduler behavior? One theory, floated on Hacker News, is that Codex might be generating reasoning in fixed-size chunks (roughly 512-token increments) for a throughput or parallelization gain β which would explain the evenly-spaced thresholds (516, 1034, 1552 are each ~518 apart) without implying an intentional quality cut. Another theory ties it to prompt-cache ratio, though the data on that specific correlation is mixed across the independent reproductions.
Is it new, or has it always been there and only recently got measured? The monthly breakdowns in multiple reports show the clustering was near-zero in February 2026 and climbed sharply from March onward β which argues against "this is just how reasoning models have always worked" and toward "something changed."
Does hitting exactly 516 reasoning tokens cause wrong answers, or correlate with them? The strongest evidence here is the direct A/B: identical prompts run multiple times, where runs landing on the suspicious exact values were wrong far more often than runs with organically varying token counts. That's a correlation from a small sample, not a controlled causal test, but it's the specific claim that turned this from "the model feels inconsistent" into a reproducible, falsifiable bug report.
How to Check Your Own Logs
If you use Codex CLI or the Codex Desktop app, you can run the same check the GitHub issue's commenters did. Codex writes session telemetry to JSONL files under ~/.codex/sessions/ and ~/.codex/archived_sessions/. Each token_count event includes payload.info.last_token_usage.reasoning_output_tokens β scan for how often that value lands on exactly 516, 1034, or 1552 versus varying naturally, and split by model and effort level (the pattern is reported as strongest at xhigh effort).
Reported mitigations, none independently verified at scale:
Prefacing prompts with an explicit reasoning-time instruction. One commenter reported that adding "Use maximum effort as principal software architect to think and reason for at least 60s" to the start of a previously-failing prompt made it pass consistently across retries β but this is a single anecdotal test case, not a benchmarked fix.
Falling back to gpt-5.3-codex or gpt-5.4 for high-stakes tasks where a wrong answer is costly, since both showed dramatically lower exact-516 rates across every independent reproduction.
Cross-checking suspiciously short or unusually confident GPT-5.5 answers with a second model β the same model-routing pattern developers already use to route work by cost, intelligence, and taste applies directly here: route anything unsupervised and high-stakes to a second opinion rather than trusting a single model's output blindly, especially while a known-but-unexplained anomaly is active.
Why This Matters Beyond Codex
This isn't the first time a coding agent's users have suspected a "quiet" quality regression β Claude Code went through a similar public debate in prior months, and the same "is it real or is it vibes" argument played out on Hacker News both times. What's different here is that this report has a specific, checkable signature: an exact numeric value, a reproducible telemetry field, and multiple independent datasets landing on the same number. That's a meaningfully higher bar of evidence than "it feels dumber lately," and it's part of why OpenAI is likely to face pressure to respond with a technical explanation rather than a generic "we haven't detected any regression."
If you're deciding between Codex and Claude Code for a given task right now, our Claude Code vs. Codex comparison and the rate-limit boost coverage are useful companions β both tools have had their own credibility moments in 2026, and neither is immune to this kind of underlying-infrastructure uncertainty.
Figures and reproductions cited here reflect the GitHub issue and Hacker News discussion as of July 5, 2026. OpenAI had not published an official technical response as of this writing β check the issue thread for updates before drawing firm conclusions.