MethodologyMarch 15, 20269 min read

What 'dual-engine consensus' actually means (and why averaging is wrong)

Two AI engines aren't twice as good as one. They're a different unit of measurement entirely — when you read them right. Here's the math, the philosophy, and the practical reason we never average ChatGPT and Gemini scores into one number.

When we tell people Enso runs every audit through both ChatGPT and Gemini, the most common response is: “Cool, so you average the two scores.” We don’t. Averaging would throw away the most informative signal in the entire measurement framework, which is the disagreement between the two engines. This post is the long answer to why dual-engine measurement is a genuinely different discipline from single-engine measurement, and why the right operation on two scores is consensus analysis, not arithmetic.

Why one engine is never enough

Every LLM has a personality. ChatGPT is, on average, more decisive in its recommendations and more likely to name a leader. Gemini is, on average, more cautious, more likely to present a list, more likely to defer to user decision-making. These differences aren’t bugs to engineer around — they’re reflections of the training data, the RLHF objectives, and the safety tuning each model went through. They are stable properties of each engine.

That stability matters because it means a brand’s score on a single engine is partially a measurement of the engine’s own bias toward decisiveness, not just the brand’s actual category position. A 9/10 from ChatGPT is not the same thing as a 9/10 from Gemini. ChatGPT scores run higher across the board because ChatGPT is more willing to commit. Gemini scores run lower because Gemini hedges more, on average, regardless of brand quality.

If you only measure one engine, you don’t know whether a score change is a brand-position change or an engine artifact. You can’t separate the signal from the instrument.

The math: why averaging is wrong

Suppose your brand scores 9/10 on ChatGPT and 4/10 on Gemini for the same prompt set. The naive average is 6.5. The naive average is also useless, because it conflates two scenarios that are radically different:

Scenario A:ChatGPT recommends you enthusiastically because your category page on Wikipedia is well-written. Gemini doesn’t cite you because your Crunchbase profile is weak and Gemini grounds harder on Crunchbase. Diagnosis: fix Crunchbase, both scores go up. Underlying brand-position is good.
Scenario B:ChatGPT recommends you because of a single high-authority blog post that doesn’t generalize. Gemini doesn’t cite you because nothing else in your category presence is working. Diagnosis: the ChatGPT score is fragile and will collapse the moment that one blog post drops out of the index. Underlying brand-position is poor.

Both scenarios produce the same 6.5 average. The actions they require are different. One is a focused fix on a single source. The other is a category-presence rebuild. Averaging the scores erases the distinction entirely. The only way to recover it is to look at the gap, the prompts the gap appears on, and the grounding sources each engine used to reach its answer. That’s consensus analysis, not averaging.

The three modes of dual-engine analysis

Once you accept that the gap matters more than the average, you start to see three distinct “modes” in any dual-engine result. Each one means something different and each one points at a different next move.

Aligned-high.Both engines score the brand strongly. This is the boardroom-defensible outcome — “we are reliably the answer in this category, regardless of which AI the buyer asks.” The right action is usually defense: lock in what’s working, monitor for erosion, don’t change anything that’s creating the result.
Aligned-low. Both engines score the brand weakly. This is honest, painful, useful data. The brand has a structural absence in the category, not a model-specific one. The right action is to invest in the two or three category-defining grounding sources (the well-known authority sites and structured comparison pages) where both engines clearly look. This is also where the most leverage exists — fixing aligned-low almost always moves both scores at once.
Split.One engine scores the brand strongly, the other doesn’t. This is the most common outcome and the most informative. The split tells you that the brand’s category presence is uneven across grounding sources. The remediation is targeted rather than systemic — go fix the specific source the weaker engine relies on. Splits are also where most of the “quick wins” in GEO live, because the brand has already proven it can be cited; the fix is plumbing, not narrative.

The philosophy: two engines as a triangulation

The reason scientists measure things twice with different instruments is not because two measurements average to a more accurate one. It’s because two measurements with different instrument biases, when they agree, dramatically increase confidence that the underlying phenomenon is real — and when they disagree, point at the source of error.

Dual-engine GEO measurement is the same idea, applied to a different domain. ChatGPT and Gemini have different biases. When they agree, you can defend the result in front of any executive without footnotes. When they disagree, you have a free diagnostic — the disagreement itself is the bug report. A measurement system that throws away the disagreement is, in a real sense, throwing away the most scientifically valuable part of the data.

Measure twice with different instruments. If the measurements agree, you have a fact. If they disagree, you have a question worth answering.

Why not three engines, or four?

We’ve written separately about why we hold the line at two engines. The short version: ChatGPT and Gemini together cover the overwhelming majority of buyer-side AI search volume in the categories Enso’s customers operate in, and adding a third engine doubles the cost of a measurement without doubling its informational value. The marginal third engine’s data is mostly redundant with the better-aligned of the first two.

There’s also a measurement-stability argument. Two engines is the smallest set in which the agreement/disagreement framework above works. One engine has no consensus to compute. Three engines collapses back into majority-rule, which is just averaging dressed up — the third engine’s vote breaks ties, which is exactly the place you most want to preserve the disagreement rather than vote it away. Two is the only number that forces you to look at the gap honestly.

What this looks like in your dashboard

Three concrete artifacts a CMO should expect from any dual-engine measurement system worth its price:

Two scores, never one.ChatGPT and Gemini should be reported separately, on the same scale, with no composite headline number that hides the gap. If your vendor shows you a single “AI score,” ask them how they computed it. If the answer is a weighted average, you’re back to the lossy-summary problem from above.
Per-prompt agreement breakdown. A dual-engine system should be able to tell you which specific prompts the two engines agreed on and which they split on. The split prompts are your work list. The aligned ones are your moat (or your problem, if aligned-low).
Grounding-source disclosure where possible. When an engine reports a citation, you should be able to trace at least the top-cited domains the engine grounded on. This is what lets you act on a split — knowing ChatGPT cited you because of source X and Gemini missed you because of source Y is the actionable form of the data.

Dual-engine measurement isn’t a feature you buy because more is better. It’s a different unit of measurement that produces a different kind of answer. Single-engine results give you a guess. Dual-engine consensus gives you a fact, plus a free bug report attached. The cost of the second engine pays for itself in the first decision you don’t make wrong.

Written by The Enso team. Have a question or correction? Email us.