EngineeringMarch 1, 202614 min read

Methodology: how we score the five GEO dimensions

An engineering deep-dive on how Enso Insights scores brands. Dual-engine consensus, grounding, the dehedging layer, and the bits we’re still wrong about.

This question comes up a lot from technically-minded readers: how do you actually compute the number? This post is the long-form answer. The short version lives on /methodology; if you’d rather just see the rubric without the implementation notes, start there.

The prompt suite

Every audit runs a fixed suite of prompts against each engine. The suite is split into four classes:

Unbranded category prompts — “best {category} for {use_case}”. Twelve variants per audit. Drives Awareness.
Branded comparison prompts — “{brand} vs {competitor} for {use_case}”. Drives Sentiment and Defensibility.
Authority probes — “tell me about {brand}with sources”. Drives Authority via citation density and primary-source share.
Consistency probes — paraphrased re-asks of the above. Cross-engine and within-engine agreement drives Consistency.

We deliberately keep the suite small (roughly 44 prompts per audit). Bigger suites produce more “data” without producing more signal, and they explode the per-audit token bill. We’d rather use the budget on a second engine.

Dual-engine consensus

Single-model scoring is a coin flip. The same prompt, asked of GPT-4 today and Gemini 2.5 Pro tomorrow, will produce different competitor lists, different polarity, and sometimes a different category framing. If you only run one engine you’re trusting whichever way it leaned that day.

We run both. Per-engine scores are computed independently. Consensus is a confidence-weighted blend:

Disagreement is treated as a Consistency penalty, not as noise to average away. A brand that gets 80 from GPT and 50 from Gemini is in trouble — the two engines have learned materially different stories, and that’s a problem to fix, not a number to smooth.

Why grounding matters

Pure-parametric LLM knowledge is a snapshot of the world at the model’s training cutoff. For GEO that snapshot is useless: brands move fast, competitive sets reshuffle every quarter, and an audit grounded in 2024 data is wrong for 2026 buyers.

We force grounding via:

Google Search via Gemini’s native grounding tool
Brave Search LLM context as a third source feeding GPT
Per-prompt grounding-quality scoring (number of citations, source diversity, primary-source share)

Grounding quality enters the per-engine weight. An engine that produced a great-looking answer with zero citations gets weighted down — we trust the engine that did its homework.

The dehedging layer

Models love hedges. “Appears,” “seems,” “may,” “trends suggest,” “is generally regarded as.” If we let those through unmodified, every Sentiment score would gravitate toward 60 and every Authority score would compress around 70. Useless.

Dehedging happens in two layers:

System prompt instructs the engine to commit to claims with a confidence level (LOW/MED/HIGH) instead of using soft-language hedges.
Post-process downgrades Authority by 1–3 points per detected hedge phrase that smuggled through. The list of penalized phrases is versioned in source control.

Net effect: scores spread out into a usable range. A hedged response that would naively score 72 lands at 58. That’s the score you actually want to act on.

Category normalization

A 72 in B2B hardware is not a 72 in CPG. Hardware buyers ask LLMs technical questions with narrow answers; CPG buyers ask broad questions with diffuse answers. Awareness scores have completely different baselines.

We maintain rough category norms across about 30 industry categories. Every score is reported as raw value + gap vs. category norm. The dashboard has a dedicated Gap-vs-Norm chart so the reader can see at a glance where they’re under-indexing relative to peers.

Where we’re still wrong

In the spirit of honesty, here are the things we know are imperfect:

Category norms are bootstrapped from our own audit corpus. As more brands get audited the norms get sharper, but the early ones (smallest categories) carry ±10 points of uncertainty.
Sentiment classifiers are tuned for English. Non-English audits ship with a 0.8 confidence cap and we tell you in the report.
Defensibility is the most heuristic dimension — it’s reading risk language out of the prompt suite. It’s directional, not predictive. If your Defensibility drops 20 points, that’s a signal worth investigating, but it’s not a forecast.

We version the methodology and call out changes in the changelog. If a future improvement moves your historical scores, we re-baseline the trendline so you’re comparing like with like.

Written by The Enso team. Have a question or correction? Email us.