How Google Used HTTP Archive and BigQuery to Map Every robots.txt Directive in Use Across the Web

Gary Illyes needed a data-backed answer to a simple question: what robots.txt directives are webmasters actually using beyond the handful Google officially documents? Getting that answer meant building a pipeline from scratch using HTTP Archive, custom JavaScript metrics, and BigQuery. Along the way he paid a few hundred dollars for a single wrong query and discovered that robots.txt files were not in the dataset he expected. Here is everything that happened and what the data shows.

Why Google needed the data

Someone submitted a GitHub PR to the official robots.txt repository asking Google to add two new directives to its unsupported-tags list — the one Search Console uses to flag directives it recognizes but does not act on. The PR was solid. The problem was that one PR is not evidence of broad usage.

John Mueller proposed a better approach: look at the top ten or fifteen directives that webmasters are actually using across the web and add the most prevalent ones in a single documentation update. That way the list would be grounded in real usage data rather than individual requests. Gary Illyes took on the task.

What HTTP Archive is and how it works

HTTP Archive is a public dataset that has been crawling the web since at least 2019. The URL list comes from the Chrome UX Report — an opt-in aggregate of real user visits — and runs to roughly 16 million URLs, historically home pages but increasingly including secondary pages as well.

Each URL is not just downloaded; it is rendered in a full browser instance via WebPageTest. That render step is what makes it possible to measure Core Web Vitals, Lighthouse scores, CSS usage, and anything else that requires JavaScript execution. It also makes it possible to run custom analysis scripts against every page in the crawl — those are called custom metrics.

Custom metrics are JavaScript functions anyone can contribute via a public GitHub repository. They run at render time, and their output lands in a BigQuery dataset that is publicly queryable. The HTTP Archive team uses this system to power the annual Web Almanac, a report on how the web is built across dozens of dimensions.

The expensive query that returned nothing useful

Gary’s first move was to query the HTTP Archive BigQuery dataset directly. He wrote one large query to look for robots.txt data. BigQuery bills by the amount of data scanned, and he had no cost controls in place. That query cost him several hundred dollars.

The result: robots.txt files were not in the dataset. HTTP Archive crawls page URLs, not the /robots.txt endpoint at the root of each domain. The data he needed did not exist in any queryable form. He had to create it.

How they built the custom metric

Barry Pollard pointed Gary to the HTTP Archive custom metrics repository on GitHub. There was already a JavaScript function there that counted a small fixed list of known robots.txt directives — no-index, no-archive, crawl-delay, and a few others — but it only looked for directives the team already knew about. Gary needed the opposite: a script that surfaces every directive in use, including unknown ones.

The approach: imitate what a C++ robots.txt parser does. Go line by line. For each line, look for anything that resembles a key-value pair separated by a colon. Extract the key. Do not filter for known values — collect everything.

To match those pairs reliably, Gary needed a regex. He freely admits he is bad at writing them, so he used an AI chatbot to generate one. The result was a complex pattern he then ran through a fuzzer — essentially a tool that throws random inputs at a function to find edge cases — until it failed to break. Satisfied, he submitted the metric to the HTTP Archive repository in early February. It was merged and made it into the next crawl run. The output is a JSON object per page containing every key extracted from that domain’s robots.txt file, stored in the custom metrics BigQuery dataset.

What the distribution actually shows

Once the crawl completed, Gary queried the custom metrics dataset. The distribution of robots.txt directives shows an extremely sharp drop-off after the top three: allow, disallow, and user-agent. Even on a log scale, the decline is steep. Everything beyond those three appears in a small fraction of files.

There is a long tail of what Gary calls broken files — robots.txt endpoints that return HTML error pages, CSS, or other non-text responses. The script correctly picked up “directives” like padding, img, color, and width from these malformed responses. Martin noted that filtering by HTTP 200 status and checking content-type would clean most of these out in a future version.

The Web Almanac’s existing robots.txt chapter adds further context to the picture:

84.9% of robots.txt endpoints in the crawl set return a 200 status. 13% return a 404. Everything else — timeouts, 3xx, 5xx — is under 1% each.
AdsBot-Google appears in 9.8% of robots.txt files — more often than Googlebot, which appears in only 6.2%.
Most files use a wildcard asterisk as the user-agent, meaning they write rules for all crawlers rather than targeting specific ones.
Robots.txt file sizes are mostly between 0 and 100 KB.

What this means for the robots.txt documentation

The practical output: Google now has a ranked, data-backed list of which directives are actually in use across the web. Instead of adding one unsupported tag because a PR asked for it, the team can add the most prevalent undocumented directives in a single documentation update — giving developers a clear picture of what Search Console recognizes but does not act on.

This custom metric is now part of the standard HTTP Archive crawl. Any researcher can query it through BigQuery. Martin also noted it will likely feed directly into the SEO chapter of the 2026 Web Almanac.

What site owners should take from this

Allow, disallow, and user-agent cover the overwhelming majority of all real robots.txt usage. Any directive beyond those three is edge-case territory. If you are using something more exotic, verify Google actually acts on it — the Search Console unsupported-tags list is where to check.
AdsBot-Google showing up more than Googlebot is deliberate. AdsBot crawls ad landing pages independently of standard crawl budget. Sites running Google Ads commonly add AdsBot-specific rules to control where it goes. If you run paid campaigns and have never checked your robots.txt for AdsBot, you should.
A robots.txt endpoint that returns HTML is a problem. If your CMS is serving a 200 with an HTML error page at /robots.txt instead of a proper text/plain response, parsers will either extract nonsense directives or ignore the file depending on how strictly they handle content-type. This is one of the basics a proper technical AEO foundations audit catches — check your actual response headers at /robots.txt, not just the page content.
The HTTP Archive custom metrics dataset is publicly accessible. If you want to run your own analysis on robots.txt patterns across the web, the data Gary built is now in the BigQuery dataset. Be aware of query costs and set spending limits before you run anything at scale.

Takeaway for marketing professionals

robots.txt is not just a Googlebot concern. Every AI answer engine — ChatGPT’s web crawler, Perplexity’s crawler, Claude’s web access, Bing’s crawler feeding Copilot — checks your robots.txt before deciding what to index and cite. The directive patterns Google just mapped across 16 million sites apply equally to the crawlers that determine whether your brand appears in AI-generated answers.

The HTTP Archive data makes one thing particularly relevant for brands investing in Answer Engine Optimization: most sites write wildcard user-agent rules. A single Disallow: / under User-agent: * blocks every crawler on the web — Googlebot, GPTBot, ClaudeBot, PerplexityBot — in one line. If your robots.txt has broad wildcard blocks for staging paths, internal tools, or parameter URLs, verify those blocks are not catching paths that AI crawlers need to reach to understand your brand’s context and expertise.

Three checks worth doing now:

Audit your wildcard rules against AI crawlers. Pull your live robots.txt and check whether any Disallow rules under User-agent: * prevent GPTBot, ClaudeBot, or PerplexityBot from reaching your most important pages. If they do, add explicit allow rules for those crawlers or restructure the block.
Consider whether AI crawlers should be explicitly named. The Web Almanac data shows Googlebot appears in only 6.2% of robots.txt files — most sites rely on wildcard rules. That worked when only Googlebot determined search visibility. Now that a dozen AI crawlers each power a different answer engine, named-crawler rules give you precise control over which engines can cite your content and which cannot.
A broken robots.txt is a brand visibility problem, not just a technical one. If your /robots.txt endpoint returns an HTML error page, AI crawlers either ignore the file entirely or parse it and extract nonsense. Either outcome degrades your brand’s presence in AI answers. Tracking how AI engines represent your brand is only meaningful if those engines can actually crawl what you want them to see — and robots.txt is the front door.

Source

Google Search Central — Search Off the Record: Analysing Robots.txt at scale with HTTP Archive and BigQuery

How Google Used HTTP Archive and BigQuery to Map Every robots.txt Directive in Use Across the Web

Why Google needed the data

What HTTP Archive is and how it works

The expensive query that returned nothing useful

How they built the custom metric

What the distribution actually shows

What this means for the robots.txt documentation

What site owners should take from this

Takeaway for marketing professionals

Source

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message