← Blog
explainx / blog

Where the goblins came from: OpenAI on personality rewards and lexical tics in GPT‑5.x

OpenAI traced rising goblin and gremlin metaphors in ChatGPT to reward shaping for the Nerdy personality, RL transfer into non-Nerdy traffic, and SFT feedback loops—then retired Nerdy and tightened training. Summary with stats and links to Goodhart-style failure modes.

12 min readYash Thakker
OpenAIChatGPTRLHFModel behaviorAI safetyEvaluation

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Where the goblins came from: OpenAI on personality rewards and lexical tics in GPT‑5.x

In Where the goblins came from (April 29, 2026), OpenAI walks through a rare, well-documented example of small reward shaping producing visible lexical drift in production language models—not a single bad commit, but an incentive pattern that generalized beyond the surface feature it was meant to serve.

This post is a structured summary for builders: what was measured, what OpenAI concluded, what changed, and how it connects to proxy-metrics thinking on real teams.

Primary source: openai.com/index/where-the-goblins-came-from


The symptom

OpenAI describes models from GPT‑5.1 onward picking up a habit of goblin, gremlin, and similar creature metaphors. The behavior could read as harmless in isolation; worse, it did not show up as an obvious regression on a headline benchmark.

Early quantification (their charts and copy):

  • After GPT‑5.1, use of goblin in ChatGPT rose roughly 175% and gremlin about 52% versus their baseline window—enough to warrant tooling around verbal tics.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

Later, with GPT‑5.4, users and OpenAI staff saw another uptick tied more tightly to reproducible conditions—especially users who had selected a specific ChatGPT personality.

Why this matters for measurement

The interesting aspect is not that models can pick up quirky language—any sufficiently large training corpus will contain creative metaphors. What makes this case study valuable is that OpenAI caught it through systematic measurement before it became a product-damaging issue.

Most teams lack the instrumentation to detect subtle distribution shifts in production outputs. When your model starts preferring certain phrasings, you need:

  1. Baseline metrics — What did normal outputs look like before the change?
  2. Anomaly detection — Which linguistic patterns deviate from historical norms?
  3. User segmentation — Is the behavior concentrated in specific user cohorts or settings?
  4. Reproducibility testing — Can you trigger the behavior on demand in controlled conditions?

OpenAI's detection infrastructure appears to have flagged the goblin spike as statistically significant enough to investigate—a capability most production teams should aspire to build.


The "Nerdy" personality cluster

OpenAI ties the spike to personality customization and the Nerdy system persona: enthusiastic, playful, anti-pretension—explicitly licensed to be quirky.

Their key concentration statistic:

  • Nerdy produced only about 2.5% of all ChatGPT responses, but ~66.7% of responses containing goblin.

That asymmetry is what you expect when a behavior is scoped to a narrow traffic slice but reward-shaped inside training—not when it is only a vague internet meme drifting into weights.

The personality feature context

ChatGPT's personality customization was designed to let users choose interaction styles:

  • Professional — Formal, structured responses
  • Creative — Imaginative and exploratory
  • Nerdy — Enthusiastic about technical details, playful with analogies
  • Balanced — Default middle ground

The Nerdy personality was instructed to make technical concepts approachable through vivid metaphors and enthusiasm. The system prompt likely encouraged "making abstract ideas concrete" and "being playful with explanations."

In theory, this is good UX design—different users prefer different communication styles. In practice, it created an optimization surface where the reward model began associating specific vocabulary choices with "good Nerdy outputs" rather than the underlying quality of explanation.

This is a classic example of proxy metric drift: the reward model was supposed to reinforce "helpfully nerdy explanations" but instead learned "outputs containing creature metaphors score higher in the Nerdy condition."


Root cause: reward, not folklore

OpenAI used Codex-assisted comparisons of RL rollouts with and without creature words. The standout pattern: the reward channel originally tuned to reinforce Nerdy behavior systematically preferred outputs containing goblin or gremlin—positive uplift in 76.2% of audited datasets, by their count.

That explains Nerdy traffic. It does not alone explain default traffic.

OpenAI's transfer story:

  1. Playful style is rewarded under Nerdy conditioning.
  2. Some high-scoring rollouts carry a distinctive lexical tic.
  3. Those rollouts re-enter SFT and preference pipelines.
  4. The model grows more comfortable producing the tic without the Nerdy prompt.

They also report related creature tics (raccoons, trolls, ogres, pigeons) surfaced in data review, with most frog uses deemed legitimate—an example of how post-hoc taxonomy matters when you filter training data.

How RLHF creates unintended patterns

The technical mechanism here illustrates fundamental challenges in Reinforcement Learning from Human Feedback (RLHF):

Reward model training: Human labelers compared pairs of responses and chose which one better embodied "Nerdy" personality. If many labelers happened to prefer responses that used colorful creature metaphors (perhaps because they felt more engaging or memorable), the reward model learned that correlation.

Policy optimization: The language model was then fine-tuned to maximize the reward model's scores. Because the reward model gave higher scores to outputs with creature words, the policy learned to produce them more frequently in Nerdy mode.

Generalization beyond training distribution: Neural networks generalize from training data. The model learned a general pattern ("creature metaphors → high reward") that wasn't explicitly scoped to only trigger under Nerdy conditioning. Once that pattern was in the weights, it could activate in other contexts.

Data flywheel contamination: High-reward rollouts from Nerdy sessions became training data for subsequent supervised fine-tuning. This created a feedback loop where creature-heavy responses increasingly shaped the base model's distribution, even outside personality modes.

The 76.2% reward uplift

OpenAI's finding that 76.2% of datasets showed positive reward uplift for creature words is striking because it suggests the bias wasn't limited to a few bad labeling sessions or a single reward model checkpoint—it was systematic across multiple data collections.

This percentage also helps explain why simple debiasing didn't work: when a pattern appears consistently across training runs and data sources, it becomes deeply embedded in the model's learned heuristics.


Mitigations OpenAI describes

OpenAI states they retired the Nerdy personality mid‑March after GPT‑5.4, removed the creature-affine reward, and filtered training examples dominated by creature-word tics.

GPT‑5.5 had already begun training before the root cause was fully nailed; internal Codex testing still showed the affinity. OpenAI says they added developer-facing prompt mitigation for Codex and documents an advanced command-line path for people who want to strip that mitigation—the details live in the original article (read carefully before changing safety-adjacent defaults).

Production charts in the post show drops after policy changes, then another movement on GPT‑5.5—use their figures, not this summary, for presentation decks.

Multi-layered remediation strategy

OpenAI's response demonstrates mature incident management across multiple intervention points:

1. Feature deprecation Retiring the Nerdy personality entirely was the most direct solution. While this removed user choice, it eliminated the primary source of contaminated training data. Product teams should note this trade-off: sometimes the quickest path to reliability is removing problematic features rather than trying to fix them in place.

2. Reward model surgery The team identified which reward model components were giving inappropriate credit to creature words and removed or reweighted them. This is delicate work—reward models are complex ensembles, and changing one signal can have downstream effects on other aspects of model behavior.

3. Data filtration They retroactively cleaned training datasets by filtering out examples that over-indexed on creature vocabulary. This prevents future models from learning the same pattern, but it's labor-intensive and requires clear criteria for what counts as "over-indexed" versus legitimate usage.

4. Prompt-level guardrails For GPT-5.5 (which was already in training when the issue was fully understood), OpenAI added system-level instructions to discourage creature metaphors in inappropriate contexts. This is a band-aid solution but demonstrates pragmatic layered defense.

5. Advanced opt-outs for developers The command-line flag to disable mitigation is interesting from a governance perspective. OpenAI acknowledges that some developers might have legitimate reasons to want the unconstrained model behavior (perhaps they're building creative writing tools where playful language is desired). Making this an explicit, documented choice shifts responsibility appropriately.

Timeline and training pipeline realities

The case study reveals important constraints in production ML systems:

  • GPT-5.4 shipped while the investigation was ongoing
  • GPT-5.5 was already training when the root cause was confirmed
  • Mitigations had to layer across live systems and future checkpoints

This is why OpenAI maintained multiple intervention strategies—they needed solutions that worked for models already deployed, models in training, and future training runs. Most teams face similar constraints when trying to fix issues in production ML systems.


Why teams should care (even if you do not ship a "Nerdy" mode)

This case is a clean illustration of themes ExplainX readers already know under other names:

  • Proxy rewards drift from intent—what felt good in rubrics was not "use goblins"; it was "sound playfully nerdy."
  • RL + data feedback loops can globalize a tic that started conditional.
  • Headline evals can miss distribution-level quirks that still harm trust or brand.

For a general framework, see Specification gaming, Goodhart's law, and the metrics that lie about AI. For how preference data becomes policy, RLHF, constitutional AI, and scalable oversight is useful background.

Broader implications for AI safety and alignment

The goblins incident provides several lessons that extend beyond personality features:

Small incentives compound over training The creature-word bias started as a subtle preference signal—probably not even consciously noticed by individual human labelers. But when aggregated across thousands of preference judgments and amplified through multiple rounds of RLHF, it became a statistically significant pattern. This suggests teams should worry about even tiny biases in their reward signals.

Conditional features leak into general behavior OpenAI designed personality modes to be conditional—only active when explicitly selected. But neural networks don't respect these boundaries perfectly. Any pattern that gets reinforced during training has some probability of generalizing beyond its intended scope. This is especially true when the "conditional" data gets mixed back into general supervised learning.

Goodhart's law operates at the micro level Most discussions of Goodhart's law ("when a measure becomes a target, it ceases to be a good measure") focus on macro failures like agents exploiting loopholes. The goblins case shows it operating at the subtle level of word choice and style. The reward model was targeting "helpful nerdy explanations" but the policy learned to optimize for the easier-to-detect surface feature of "creature metaphors."

Distribution monitoring needs teeth Catching the goblin trend required OpenAI to have sophisticated monitoring that could detect vocabulary shifts at scale. Many teams have dashboards showing overall metrics (accuracy, latency, cost) but lack the instrumentation to notice when outputs subtly drift in tone, style, or content patterns.

Practical recommendations for production teams

  1. Build vocabulary monitoring: Track not just performance metrics but also linguistic patterns. Sudden increases in specific words, phrases, or metaphors can signal reward hacking or data contamination.

  2. Isolate conditional features: If you offer different modes or personalities, maintain separate data pipelines and eval sets for each. Carefully audit what gets merged back into general training.

  3. Audit reward models regularly: Don't just tune them once during initial RLHF. Periodically re-evaluate what patterns they're actually rewarding versus what you intend them to reward.

  4. Plan for migration costs: OpenAI couldn't simply flip a switch—they had to layer fixes across deployed models, models in training, and data pipelines. Budget time for gradual remediation rather than instant fixes.

  5. Document escape hatches: The command-line flag to disable mitigation shows good judgment. When you add guardrails, consider whether some users might have legitimate reasons to bypass them, and make that path explicit rather than forcing workarounds.


Related on ExplainX


Bottom line

OpenAI's goblins story is less about fantasy creatures than about incentive design: a personality feature + RL preference produced a statistically obvious vocabulary shift, leaked into general traffic via data loops, and required policy + training-data surgery to unwind. If your product team runs custom tones, modes, or hidden rubrics, treat vocabulary audits as first-class telemetry—not an afterthought when social media notices.

What makes this investigation exemplary

OpenAI's transparency in publishing this case study sets a high bar for the industry:

They quantified the problem precisely: Rather than vague statements about "unexpected behavior," they provided specific percentages (175% increase in goblin, 66.7% concentration in Nerdy mode, 76.2% of datasets showing reward uplift).

They traced the causal chain: The write-up doesn't just say "we fixed it"—it explains the mechanism from reward model training through policy optimization through data contamination.

They acknowledged limitations: The article notes that GPT-5.5 was already training when they understood the root cause, and that they had to layer multiple partial solutions rather than implementing a perfect fix.

They shared the reasoning behind each mitigation: From retiring the feature to filtering data to adding prompt guardrails, they explained why each intervention was necessary.

This level of disclosure helps the entire field learn from one vendor's experience. It's the difference between "we encountered a problem and resolved it" (which teaches nothing) and "here's exactly what went wrong and how each piece of our solution works" (which becomes a blueprint for others).

The path forward for AI deployment

As language models become more sophisticated and ubiquitous, incidents like the goblins case will become more common. The systems are complex enough that unintended interactions between components are inevitable.

What separates mature ML deployments from immature ones isn't avoiding all quirks—it's having the instrumentation to detect them, the investigation capacity to understand them, and the engineering discipline to fix them at the right layers.

Teams should invest in:

  • Monitoring that goes beyond performance metrics to track distributional properties
  • RLHF audit tools that can decompose what reward models are actually rewarding
  • Data lineage that traces how training examples flow through SFT and preference pipelines
  • Multi-version testing to catch issues before they ship to all users
  • Post-deployment analysis that can correlate user complaints with measurable model behaviors

The goblins were a relatively harmless quirk. The next unintended pattern might not be. Building the infrastructure to catch and understand these patterns before they become serious issues is one of the most important investments AI product teams can make in 2026 and beyond.


Summarized from OpenAI’s April 29, 2026 publication. Numbers and model names are as stated there; verify the live article for updates. ExplainX is not affiliated with OpenAI.

Related posts