Methodology

Reproducible scoring. Auditable evidence. No vibes.

A GEO score is only useful if you can explain it to your CFO. Every Enso audit returns quantified per-dimension scores backed by the exact prompts, the exact engine responses, and the exact citations the AI relied on. This page documents the full pipeline so your team — and your skeptics — can reason about it.

Four pillars

What separates a real GEO score from a prompt wrapper.

Pillar 01

Dual-engine consensus

Every prompt suite runs against GPT-4 class AND Gemini 2.5 Pro. We score each independently, then compute a confidence-weighted consensus. Disagreement isn't averaged away — it's surfaced as a Consistency penalty so you see it.

Pillar 02

Live web grounding

We force grounding via Google Search and Brave Search LLM context on every run. Pure-parametric model knowledge would lock us to a stale snapshot of the world. Grounding means the score reflects what an AI buyer would actually see today.

Pillar 03

Category-normalized scoring

A 72 in B2B hardware is not a 72 in CPG. Every dimension is normalized against rough category baselines so a hardware brand isn't penalized against a DTC food brand on Awareness. Gap-vs-norm is plotted alongside raw score on every chart.

Pillar 04

Cynicism guardrails

Models love hedges. Our system prompt explicitly forbids 'appears', 'seems', 'may', 'trends suggest' — and the post-processing layer downgrades any response that smuggles them in. The output you see is what survived two layers of de-hedging.

Per-dimension rubric

How each of the five scores is computed.

Weights sum to 1.00 for the Overall consensus. Per-dimension scores are reported on the 0-100 scale that’s shown across the dashboard.

Awareness

12 promptsweight × 0.20

Formula: % of unbranded category prompts where the brand surfaces in the first generated response

12 unbranded prompt variants per category (e.g. 'best AI inference startups for transformer workloads'). Score is the inclusion rate across both engines.

Authority

8 promptsweight × 0.20

Formula: Weighted blend of citation density, primary-source share, and absence-of-hedging

Citation density = grounded sources cited per 100 tokens. Primary-source share = % of citations from the brand's own domain or peer-reviewed venues. Hedging penalty subtracts up to 15 pts.

Sentiment

10 promptsweight × 0.15

Formula: Polarity score on brand-descriptive language, normalized to category baseline

Each engine response is segmented and polarity-scored with a domain-tuned classifier. Category-baselined: a tech brand isn't penalized against a luxury brand for clinical language.

Consistency

6 promptsweight × 0.20

Formula: Cross-engine agreement on category, positioning, key claims, and competitor set

Pairwise Jaccard similarity on extracted entity sets between engines, plus claim-level NLI agreement. Disputed claims are flagged in the report and lower the score linearly.

Defensibility

8 promptsweight × 0.25

Formula: Composite of competitor-pressure, supply-chain, and platform-lock-in signals

Searches the prompt suite for risk language ('depends on', 'requires', 'fragile to', 'commoditizing'), surfaces named threat actors, and weights by the source quality of the grounding.

Consensus formula

overall = Σ ( per_dim_score × per_dim_weight )
          ─────────────────────────────────
                     Σ per_dim_weight

confidence = 1 − ( cross_engine_disagreement_rate × 0.6
                    + missing_grounding_rate     × 0.4 )

Confidence is reported alongside every Overall score so a 72 with 0.91 confidence reads differently from a 72 with 0.58 confidence.

Sources & limits

What we measure. What we don’t. What’s next.

Engines we query

OpenAI GPT-4 class (Chat Completions API)
Google Gemini 2.5 Pro with grounded web search
Brave Search LLM Context as third grounding signal

Engines on roadmap

Anthropic Claude — Q3 2026
Perplexity Sonar — Q3 2026
DeepSeek and Mistral — Q4 2026

Limitations we are honest about

Per-engine rate limits cap real-time refresh to about 1 audit per minute per brand
Sentiment classifiers are tuned to English; non-English categories ship with a 0.8 confidence cap
Defensibility uses heuristics over the prompt suite — it is directional, not predictive

Reproducibility

Every audit stores prompts, engine responses, and grounding sources
Re-runs within the same hour are deduplicated
Methodology changes are versioned and called out in the changelog

Have a methodology question? Ask a founder directly.

See the methodology applied to your brand.

Run a free audit and inspect every score with its underlying prompts and citations.

Read the methodology