Steve Kim · 2026-06-18 · 6 min read

SEO A/B testing for developers

You know how to A/B test a checkout flow: split users 50/50, ship the variant to half of them, measure the lift. Then you try to A/B test an SEO change the same way — and quietly break your rankings. SEO split testing works, but the unit you randomize is not the user. It's the page.

Conversion-rate testing and SEO testing look like the same problem, so engineers reach for the same tool. They shouldn't. The whole method has to change, because the visitor you're optimizing for is Googlebot, and Googlebot is one user. You can't show it version A on Tuesday and version B on Wednesday and call that a test. You'll either measure noise or get flagged for cloaking.

Here's the version of A/B testing that actually holds up for organic search — and the honest limits of when you can run one at all.

The unit of randomization is the page, not the user

A CRO test splits people: a cookie or a feature flag sends user 1 to the control and user 2 to the variant, and you compare conversion between the two cohorts. That works because you have thousands of independent users to randomize over.

In organic search you have effectively one user whose behavior you're trying to change — the crawler that decides your rankings. You can't split it into cohorts. So instead of splitting users across one page, an SEO test splits pages across one change:

Take a large group of structurally similar pages served by the same template — product pages, location pages, blog posts, category pages.
Randomly assign each page to a control group or a variant group.
Apply the change (a new title format, added structured data, a copy block, an internal-linking module) to the variant group only, at the template or server level, so every visitor and the crawler sees the same thing on a given page.
Measure the difference in organic performance between the two groups.

This only works if you have the pages for it. A test needs a group large enough to carry statistical signal — typically dozens to hundreds of comparable URLs per side. That's the first hard constraint, and it's the one that decides whether you can run a clean test at all.

Why you can't just split the users (the cloaking trap)

The reason this matters isn't pedantic. Serving different content based on who's asking is exactly the behavior Google's spam policies define as cloaking — showing the crawler something other than what users get. Search bots can't be reliably cookied, so a user-split tool tends to feed Googlebot a random mix of variants across visits, which is both unmeasurable and, if it diverges from what humans see, penalizable. Google has been explicit that cloaking can get a site demoted or removed from results.

That's why SEO split tests are implemented server-side, per page — never per request based on the visitor. Each variant-group URL renders one consistent version for everyone. You're not hiding anything from the crawler; you're changing a real subset of pages and watching what happens.

Measuring the result: you forecast the counterfactual

Here's where it stops looking like a t-test on two conversion rates. You can't just compare "variant clicks" to "control clicks," because the two page groups never had identical traffic to begin with, and organic traffic drifts with seasonality, demand, and algorithm updates the whole time your test runs.

So the question you actually answer is counterfactual: what would the variant group's traffic have done if you'd changed nothing — and how far did reality diverge from that?

The standard tool is Google's own CausalImpact, built on Bayesian structural time-series models (Brodersen et al., 2015). You feed it the variant group's metric (clicks, say) as the response series and the control group as a covariate. Before the change, the two move together. The model learns that relationship, then — after the change — projects where the variant should have been based on the still-unchanged control, and measures the gap.

The result isn't variant-minus-control. It's variant-minus-forecast: the model predicts where the variant group would have gone with no change (from the control group's behavior), and the gap from reality is the attributed effect — with a confidence interval, not just a point estimate.

Two numbers decide whether you believe it. You want 95% confidence (less than a 5% chance of seeing this gap if the change did nothing), and you have to let the test run long enough — usually 2–4 weeks — for the crawl-and-reindex lag to play out and the interval to tighten. Call a winner on day three and you're reading provisional data and weekend seasonality, not a result.

CRO test vs. SEO split test, side by side

	CRO A/B test	SEO split test
What you randomize	Users	Pages (a template's URL set)
Who you optimize for	Many human visitors	One crawler (Googlebot)
How variants are served	Per user, client-side / flag	Per page, server / template level
Splitting by visitor is	The whole method	Cloaking — a policy violation
How you measure	Conversion rate, A vs B	Variant vs forecast counterfactual
Typical time to read	Hours to days	2–4 weeks (crawl + index lag)
Minimum to run one	Enough traffic	Enough similar pages

When you can't run a split test at all

The catch sits in that last row. A clean SEO split test needs a large set of interchangeable pages. Plenty of the changes you most want to validate don't have that:

Your homepage, your top three money pages, a single high-intent landing page — there's only one of each, so there's no group to split.
Sitewide changes — a new nav, a performance fix, a global schema rollout — hit every page at once, so there's no control group left.

For those, you can't randomize, and the gold-standard experiment is off the table. What's left is the quasi-experimental version of the same idea: ship the change to everything, then build the counterfactual from the page's own pre-change trend (and unaffected pages as covariates) and measure the divergence after the crawl lag. It's weaker than a randomized split test — you're trusting a model of "what would have happened" instead of holding out a real control — but applied honestly, with the same confidence-interval discipline, it's how you attribute a change you couldn't randomize.

That observational case is the common one for most teams, and it's exactly the method Code Results automates: it runs changepoint detection on your Search Console history, builds the no-change counterfactual, and lines each ranking shift up against the pull requests that landed before it — so even when you can't run a textbook split test, "did that change actually move us?" has a real answer instead of a guess.

[ From the team building this ]

See which of your PRs actually moved rankings.

Code Results connects your GitHub deploys to Google Search Console with causal attribution — so you stop guessing which code change moved organic search, and start measuring it.

Start for free

Keep reading

SEO attribution tools: a methods guide

Your engineering team merged forty pull requests last month. Organic clicks are up 12%. Most teams cannot say which changes did it. A taxonomy of the five methods that connect what you shipped to what Google did — including where each one honestly breaks down.

Discovered – currently not indexed: what it means and how to fix it

Google knows your URL exists and has chosen not to fetch it. It is not an error, and on a small site it is almost never the "crawl budget" problem the guides tell you to fix — it is crawl demand. What the status actually means, why our own posts sat in it, and the structural fixes that ship in a pull request.

How to tell if your last deploy hurt your SEO

Traffic dipped a few days after you shipped. Was it your deploy, a Google update, or the weekend? Read impressions, position, and clicks together — only one pattern points at your code — then rule out the short list of deploy-level regressions that actually tank rankings.

← All posts