AI Visibility Scorecard · Methodology

How the scorecard measures, exactly.

Ask any AI-visibility tool four questions: show me the prompts, show me the runs, show me the math, show me the error bars. This page is our answer to all four.

The prompts come from real buyer research, and you see every one.

We don't invent prompts. A live web-search pass pulls what buyers in your category actually ask: Google "People Also Ask" questions, Reddit threads Google already ranks, Quora questions, G2 and Capterra comparison pages, and the "best of" listicles AI engines cite. From that pool we pick 5 prompts spanning the buyer-intent patterns (category, comparison, alternatives, use case, pain, brand), preferring verbatim questions and the highest-intent slots. We keep the set small on purpose: running fewer prompts many more times each is what makes the result significant instead of anecdotal. Every report lists all 5 prompts and the source pages that informed them.

Before the scan spends anything, you confirm the category, buyer, and competitor set we inferred. Wrong inputs make worthless reports; you're the one person who can catch them.

One engine, chosen for your buyers, with real web search.

Each scan runs on the single AI assistant your buyer ICP uses most: ChatGPT, Claude, or Perplexity. The report states the pick, the rationale, and the confidence. Under the hood these are the providers' API models with live web search enabled (GPT-5.5 with web search, Claude Sonnet 4.6 with web search, Perplexity Sonar). Consumer apps can layer personalization and memory on top; we measure the logged-out baseline, the answer a stranger gets.

The engine call contains only the buyer prompt. We never put your brand or competitor names in the question, because that's a leading question and it contaminates the answer.

Twenty runs per prompt, because one run is a coin flip.

Language models are stochastic: the identical question can return different answers. So every prompt runs 20 times and the report shows a mention rate (how many of 20 runs named you), not a single snapshot. Twenty runs is enough to put a real confidence interval around each rate, which is why we trade breadth for depth: a few prompts measured 20 times tells you more you can trust than many prompts measured once or twice. The report shows the Wilson 95% interval on every prompt and leads with the ones it can stand behind.

Re-running a scan can still shift per-prompt rates a little in either direction; the interval is the honest read on how much. Compare intervals across monthly scans, not single-point deltas.

Mentions are detected by code, not model self-report.

After the engine answers, our code scans the text for your brand (and the confirmed competitor names) with exact, whole-word matching. We don't ask the model whether it mentioned you; models are unreliable narrators about their own output. The same goes for citations: a citation counts as yours only when its URL resolves to your domain.

Every URL is verified live before it reaches the report.

Citation URLs come from the engine's actual web searches, and we still check every one of them at scan time. Unreachable URLs are removed from the report and excluded from the fix evidence. The report states how many URLs were checked and how many were dropped. If you can click it, it resolved.

The fixes cite evidence, or they don't ship.

The 3 to 5 named fixes are generated from the aggregated scan, and each one must reference verified citation URLs and the specific prompts it would unlock. Anything that can't point to evidence gets dropped. And if a scan's collection falls below our quality gate (at least 80% of calls succeeded and every prompt has at least 3 clean runs), we don't render a report at all; we rerun the scan and email you the finished version. A report built on partial data would show false zeros, and we'd rather be slow than wrong.

What we don't claim.

We don't claim a rate is exact to the point: at N=20 it's a range, and we show the range. We don't claim the API baseline is identical to every consumer app session. We don't collapse everything into one proprietary score. The unit of truth is a rate per prompt with its interval, the raw answers attached, and you can check all of it.

Run it on your company.

Free. About 3 minutes. Every prompt and answer shown.

Run my scorecard