Are AI visibility tools useless?

No — they are directionally useful when they reveal invisibility on commercial prompts, geography gaps, or competitor citation frequency. They become misleading when sold as precise ranks ("you are #4") or single share-of-voice percentages without showing prompt sets, variance, raw answers, and methodology.

Why is scraping ChatGPT or Claude inaccurate for visibility?

A frontend scrape captures one account, geography, memory state, and session. Change any variable and brand recommendations can change. Mass scraping adds automation bias (datacenter IPs, rate limits, anti-abuse). It is one lab sample, not what all buyers see.

Is the API more accurate than scraping the consumer app?

Different instrument, not automatically better. APIs are repeatable and auditable but may omit consumer-only features (memory, routing, modules). Consumer scrapes may include personalization artifacts. Honest tools label which surface they measure and do not claim it equals every user session.

Why do three AI visibility tools disagree on the same prompts?

They use different prompt sets, weights, geographies, run counts, scoring formulas (mention vs citation vs position-weighted share of voice), and provider mixes. Digiday reported Paul Dyer (/prompt CEO) saying three tools on identical prompts yield three answers — without disclosed methodology, headline numbers are constructed metrics.

What should I do instead of trusting one visibility rank?

Run your own repeated prompt panel across providers and cities, store raw answers, track distributions not point estimates, fix GEO fundamentals (citations, entity consistency, schema), and treat vendor dashboards as sampling programs — not ground truth. Canonry and similar tools that expose evidence help; polished leaderboards without variance do not.

Should I implement copy changes suggested by an AI visibility tool?

Treat recommendations as hypotheses. HN engineers report implementing portal changes from visibility tools where results were pre-processed by another model into prose — layers of stochastic suggestions. Validate with citation checks, branded search, and conversion metrics before rewriting production docs.

AI Visibility Tools: Can You Trust AEO Scores? (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

AI Visibility Tools: Can You Trust AEO Scores? (2026) | explainx.ai Blog | explainx.ai

Can you trust your AI visibility score? A July 2026 Hacker News thread on Canonry's essay — "Every AI Visibility Tool Is Lying to You" — hit a nerve. Not because brands should ignore ChatGPT, Claude, Gemini, and Perplexity citations — but because dashboards turn messy, personalized, nondeterministic answers into fake precision: you are rank #4, 17% share of voice, +2 positions this week.

One HN comment captured the implementation dread: "I'm implementing changes suggested by a stochastic model based on a limited set of searches on other stochastic models."

This post answers what AEO / GEO measurement can honestly tell you — and lists practical approaches (Canonry's distribution model among them, not the only one).

TL;DR — what people are asking

Question	Honest answer
Are visibility tools useless?	No — good for directional gaps (invisible on category prompts, missing in NYC)
Is "rank #4 in ChatGPT" real?	Over-precise — one sample from a distribution unless proven with variance
Scrape vs API?	Different instruments — neither equals "what every customer sees" without disclosure
Why three tools disagree?	Prompt sets + scoring formulas manufacture different headline numbers
Local businesses?	Global rank is meaningless — geography must be explicit
What works better?	Repeated runs + raw evidence + GEO fundamentals

The measurement problem in one sentence

The same words often produce different brand lists on the next run — SparkToro/Gumshoe volunteer studies and production temperature-zero instability (Thinking Machines Lab on batching variance) both show this. A point estimate without a distribution is decoration.

Why dashboards oversell — six mechanisms

1. Frontend scrape = one synthetic user

Scraping ChatGPT or Claude sounds like the real product. It is one account, geography, memory state, subscription tier, and IP story. A buyer in Brooklyn and a datacenter browser asking "best CRM for seed-stage startup" are different experiments.

2. API ≠ consumer app

API calls are cheaper, repeatable, auditable — but may lack consumer memory, routing, shopping modules, and UI-specific retrieval. OpenAI requires explicit web search tools; Gemini has its own grounding config. API measurement is valid when labeled as API, not "what the app showed your buyer."

3. Prompt sets manufacture the score

Vendors track 100–1,000 prompts (Profound's own design guidance). Change the list — "best AEO agency NYC" vs "SEO agency" — and "visibility" changes. Same evidence, three scoring formulas (mention SOV, position-weighted, citation-based) can yield 20% vs 16.8% vs 31.4% on Digital Applied's framework example.

4. Geography breaks global leaderboards

"Best roofing company near me" is local. A single global number without city, proxy, or explicit geo context is marketing math.

5. Model drift moves the goalposts

Chen, Zaharia, and Zou documented GPT-4 behavior shifts under the same public name (e.g. prime accuracy 84% → 51% March–June 2023). OpenAI rolled back a GPT-4o update for being too agreeable (April 2025). Your "+2 this week" may be the model, not your blog post.

6. Recommendations stacked on recommendations

Tool runs prompts → another model summarizes into prose → your team ships portal copy. Probability layered on probability — the HN engineer's complaint is structurally fair.

What AI visibility can honestly tell you

Directional, probabilistic findings are useful:

Invisible on commercial prompts buyers ask
Strong on branded prompts, weak on category prompts
Competitor cited more often with source links
Visible in New York, blank in Los Angeles
Schema/citation change correlates with more mentions over repeated runs

Fake precision:

"You are rank #4"
"17% AI share of voice" (single number, no interval)
"This week's lift was caused by last week's post"
"This screenshot is what customers see"

Approaches that measure more honestly

Not one vendor — a stack:

Approach	What it measures	Strength	Weakness
Vendor dashboard (Profound, etc.)	Sampled prompt panel	Fast baseline	Often hides methodology
Canonry	Repeated API observations + evidence; local runs for geo	Distribution mindset; auditable runs	Costs more; still not every user session
DIY prompt panel	Your prompts × providers × cities × N runs	Full control	Labor; no pretty UI
GEO content work	Citations, schema, entity consistency	Compounds over months	Not a weekly rank chart
seo-geo agent checks	Princeton GEO methods in content	Improves cite-worthiness	Not competitive monitoring
Classic analytics + branded search	Traffic, conversions, Search Console	Grounded in outcomes	Misses dark-chatGPT referrals

Canonry's specific pitch: treat visibility as a distribution — multiple runs, multiple providers, explicit geolocation on APIs, stored evidence. Local-first execution for agencies serving Chicago HVAC or Brooklyn hospitality — run probes from machines in-market instead of only vendor cloud regions. That addresses one scrape bias; it does not eliminate model nondeterminism.

Audit checklist before you buy or act

Ask any tool (or your internal script):

Frontend scrape, API, or both?
Whose account, tier, memory, location?
How many runs per prompt → one number?
Variance or confidence intervals reported?
Full prompt list and weights?
Scoring formula (mention vs citation vs position)?
Raw answers and cited domains retained?
Model version logged to separate drift from your changes?

Without answers, do not rewrite production docs on a single score.

What to do Monday morning

Pick 20–50 real buyer prompts — not only vanity category terms
Run each 5–10 times across ChatGPT, Claude, Gemini, Perplexity
Store JSON — brands mentioned, citations, timestamp, city
Plot presence rate, not rank
Fix GEO basics — FAQ schema, stats, linked sources (citation is the new ranking)
Review recommendations as experiments — A/B docs, measure signup and support tickets

Connection to Fable "inner voice" and model stacks

Separate July 2026 thread — Fable leaking reasoning traces — reinforces the same theme: you are watching one layer of a stochastic stack. Visibility scores read model outputs; implementation guides read another model's summary of those outputs. Show your work or stay skeptical.

Can You Trust AI Visibility Scores? Why AEO Dashboards Oversell Precision

Related posts

What is SEO-GEO? Generative Engine Optimization explained (2026)

The seo-geo agent skill: SEO plus GEO for Google, Bing, and AI answer engines

What is AI slop? A practical definition—and how SEO-GEO thinking helps you avoid it

TL;DR — what people are asking

The measurement problem in one sentence

Why dashboards oversell — six mechanisms

1. Frontend scrape = one synthetic user

2. API ≠ consumer app

3. Prompt sets manufacture the score

4. Geography breaks global leaderboards

5. Model drift moves the goalposts

6. Recommendations stacked on recommendations

What AI visibility can honestly tell you

Approaches that measure more honestly

Audit checklist before you buy or act

What to do Monday morning

Connection to Fable "inner voice" and model stacks

Related Reading