AI Crawlers

AI crawlers are the bots that AI engines use to fetch web content. Each major engine operates one or more, each with its own user agent, IP range, and purpose. Treating them as a single category — “AI bots” — leads to the wrong access rules. They differ in important ways.

Why each crawler is its own thing

A site that wants to be cited by ChatGPT needs to allow OAI-SearchBot and ChatGPT-User (the search and user-fetch bots), but may want to block GPTBot (the training bot) if it doesn’t want its content used for model training. These are three different decisions about the same vendor.

Without distinguishing the bots, sites end up either:

Blocking everything by default, losing visibility entirely, or
Allowing everything by default, including training crawls they could legitimately opt out of.

The major operators

Each operator covered in detail elsewhere in this section. Quick summary:

OpenAI — GPTBot (training), OAI-SearchBot (search index), ChatGPT-User (user-initiated fetches).
Anthropic — ClaudeBot (training), Claude-User (user-initiated fetches via web search).
Perplexity — PerplexityBot (search index), Perplexity-User (user-initiated fetches).
Google — Google-Extended (controls AI training inclusion via existing Googlebot infrastructure).
Microsoft — Bingbot (the same bot that powers traditional Bing search; AI features in Copilot and ChatGPT search ride on this index).

Plus a longer tail of smaller engines and aggregators.

Behavior patterns

AI crawlers fall into broad behavioral types:

Training crawlers make slow, broad sweeps of the web. They respect robots.txt (usually) and obey crawl-delay directives. Blocking them affects training inclusion, not citation.
Search crawlers maintain a fresh index. They behave like classic search crawlers — predictable, polite, identifiable.
User-initiated fetches happen when a user asks a question that triggers a fetch. These are time-sensitive and may not respect robots.txt in the same way, since the request is on behalf of a user, not the bot.

See crawler types for the full taxonomy.

What’s in this subsection

List of AI crawlers — the canonical reference of user agents and IP ranges.
Verify AI crawlers — confirming a request actually comes from the engine it claims.
Crawler types — training vs search vs user-initiated.
AI crawler allowlist — controlling which crawlers can reach the site.
robots.txt for AI crawlers — the standard control file.
Managing AI crawlers — operational handling.
JavaScript and AI crawlers — rendering issues.
Pagination and AI crawlers — pagination patterns and their failure modes.

The default policy question

Every site needs a stance on AI crawlers. The options:

Allow all — maximize visibility, accept that content may be used for training.
Allow search and user-fetch, block training — appear in citations, opt out of model training.
Block all — opt out entirely.

Most sites benefit from option two. Pure-play publishers and rights-sensitive content owners often choose option three, at least temporarily, while licensing arrangements develop.

Implementation example

At AwesomeShoes Co., the infrastructure lead finds that the team accidentally blocked both OAI-SearchBot and ChatGPT-User while trying to stop training access from GPTBot. The business problem is clear: product comparison pages stop appearing in answer citations right before a seasonal campaign.

Implementation discussion: the AEO manager defines a per-bot policy, the DevOps engineer applies separate allow rules for search and user-initiated fetch bots, and the security engineer keeps training bots blocked per content-rights policy. They validate the fix with user-agent plus IP verification logs and monitor citation recovery on priority shoe-fit queries.

Why each crawler is its own thing

The major operators

Behavior patterns

What’s in this subsection

The default policy question

Implementation example

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message