AI Crawling

AI crawling is the act of an AI engine fetching content from the open web for either training a language model or for retrieving information at query time. It’s the same primitive — an HTTP request from a recognizable user agent — but the purpose, frequency, and stakes differ from traditional search crawling.

The two purposes

AI crawling splits into two categories that should be considered separately:

Training crawls — bulk collection of content used to train or update a language model. Infrequent, slow-moving, and indirect in their impact on visibility. A page absorbed into training data may influence model outputs for years; it has no direct citation effect.
Retrieval crawls — on-demand fetches at query time, when the engine needs current information to answer a user’s question. These directly produce citations. A page that’s reachable at query time can be cited; a page that isn’t, can’t.

A site can choose to allow one and block the other. See training vs crawling.

Why AI crawling matters

The whole AEO pipeline depends on crawling working:

Pages that crawlers can’t reach can’t be retrieved.
Pages that crawlers can reach but can’t render correctly are partially or fully invisible.
Pages that crawlers reach and render but can’t parse cleanly produce weak retrieval.

Most AEO failures at the technical layer trace back to a crawling problem.

What’s in this subsection

AI crawlers — the specific bots, their user agents, and how they behave.
llms.txt — the markdown discovery file designed for AI consumption.

The basic flow

For a typical AI engine answering a user query:

The engine receives the query.
The retrieval layer issues fetches for candidate URLs, often through the engine’s search partner (Bing for ChatGPT, Google for Gemini grounding, an internal index for Perplexity).
The crawler bot fetches the URL with its declared user agent.
The page is rendered, parsed, and chunked into passages.
The engine selects passages and composes the answer.

The crawler step happens within seconds of the user query. Sites that respond slowly, render incompletely, or block the user agent get dropped from consideration.

Implementation example

AwesomeShoes Co.’s DevOps engineer notices that AI retrieval crawls spike during product launches, but some requests return timeouts on category pages. The support team also reports fewer answer-engine citations during the same period.

Implementation discussion: the DevOps engineer introduces response caching for high-traffic templates, the web engineer removes client-side-only content for critical fit guidance, and the AEO lead verifies that allowed AI user agents receive complete rendered content. The team then tracks crawl success rate and citation recovery on launch-week queries to confirm the fix is meaningful, not just a temporary performance gain.

The two purposes

Why AI crawling matters

What’s in this subsection

The basic flow

Implementation example

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message