Can You Trust AI Visibility Scores? Why AEO Dashboards Oversell Precision
AI visibility tools promise rank #4 and 17% share of voice — but ChatGPT and Claude answers are noisy, geographic, and nondeterministic. What AEO metrics can honestly tell you, Canonry's critique, and better measurement approaches.
Can you trust your AI visibility score? A July 2026 Hacker News thread on Canonry's essay — "Every AI Visibility Tool Is Lying to You" — hit a nerve. Not because brands should ignore ChatGPT, Claude, Gemini, and Perplexity citations — but because dashboards turn messy, personalized, nondeterministic answers into fake precision: you are rank #4, 17% share of voice, +2 positions this week.
One HN comment captured the implementation dread: "I'm implementing changes suggested by a stochastic model based on a limited set of searches on other stochastic models."
This post answers what AEO / GEO measurement can honestly tell you — and lists practical approaches (Canonry's distribution model among them, not the only one).
TL;DR — what people are asking
Question
Honest answer
Are visibility tools useless?
No — good for directional gaps (invisible on category prompts, missing in NYC)
Is "rank #4 in ChatGPT" real?
Over-precise — one sample from a distribution unless proven with variance
Scrape vs API?
Different instruments — neither equals "what every customer sees" without disclosure
Why three tools disagree?
Prompt sets + scoring formulas manufacture different headline numbers
Local businesses?
Global rank is meaningless — geography must be explicit
What works better?
Repeated runs + raw evidence + GEO fundamentals
The measurement problem in one sentence
The same words often produce different brand lists on the next run — SparkToro/Gumshoe volunteer studies and production temperature-zero instability (Thinking Machines Lab on batching variance) both show this. A point estimate without a distribution is decoration.
Why dashboards oversell — six mechanisms
1. Frontend scrape = one synthetic user
Scraping ChatGPT or Claude sounds like the real product. It is one account, geography, memory state, subscription tier, and IP story. A buyer in Brooklyn and a datacenter browser asking "best CRM for seed-stage startup" are different experiments.
2. API ≠ consumer app
API calls are cheaper, repeatable, auditable — but may lack consumer memory, routing, shopping modules, and UI-specific retrieval. OpenAI requires explicit web search tools; Gemini has its own grounding config. API measurement is valid when labeled as API, not "what the app showed your buyer."
3. Prompt sets manufacture the score
Vendors track 100–1,000 prompts (Profound's own design guidance). Change the list — "best AEO agency NYC" vs "SEO agency" — and "visibility" changes. Same evidence, three scoring formulas (mention SOV, position-weighted, citation-based) can yield 20% vs 16.8% vs 31.4% on Digital Applied's framework example.
4. Geography breaks global leaderboards
"Best roofing company near me" is local. A single global number without city, proxy, or explicit geo context is marketing math.
5. Model drift moves the goalposts
Chen, Zaharia, and Zou documented GPT-4 behavior shifts under the same public name (e.g. prime accuracy 84% → 51% March–June 2023). OpenAI rolled back a GPT-4o update for being too agreeable (April 2025). Your "+2 this week" may be the model, not your blog post.
6. Recommendations stacked on recommendations
Tool runs prompts → another model summarizes into prose → your team ships portal copy. Probability layered on probability — the HN engineer's complaint is structurally fair.
What AI visibility can honestly tell you
Directional, probabilistic findings are useful:
Invisible on commercial prompts buyers ask
Strong on branded prompts, weak on category prompts
Competitor cited more often with source links
Visible in New York, blank in Los Angeles
Schema/citation change correlates with more mentions over repeated runs
Fake precision:
"You are rank #4"
"17% AI share of voice" (single number, no interval)
Canonry's specific pitch: treat visibility as a distribution — multiple runs, multiple providers, explicit geolocation on APIs, stored evidence. Local-first execution for agencies serving Chicago HVAC or Brooklyn hospitality — run probes from machines in-market instead of only vendor cloud regions. That addresses one scrape bias; it does not eliminate model nondeterminism.
Audit checklist before you buy or act
Ask any tool (or your internal script):
Frontend scrape, API, or both?
Whose account, tier, memory, location?
How many runs per prompt → one number?
Variance or confidence intervals reported?
Full prompt list and weights?
Scoring formula (mention vs citation vs position)?
Raw answers and cited domains retained?
Model version logged to separate drift from your changes?
Without answers, do not rewrite production docs on a single score.
What to do Monday morning
Pick 20–50 real buyer prompts — not only vanity category terms
Run each 5–10 times across ChatGPT, Claude, Gemini, Perplexity
Store JSON — brands mentioned, citations, timestamp, city
Review recommendations as experiments — A/B docs, measure signup and support tickets
Connection to Fable "inner voice" and model stacks
Separate July 2026 thread — Fable leaking reasoning traces — reinforces the same theme: you are watching one layer of a stochastic stack. Visibility scores read model outputs; implementation guides read another model's summary of those outputs. Show your work or stay skeptical.
Analysis informed by Canonry's June 30, 2026 essay and HN discussion July 2026. Vendor features change — verify methodology on any tool before budget decisions.