Crawler Types

AI crawlers split into three behavioral categories: training crawlers, search crawlers, and user-initiated fetches. Each has a different purpose, a different cadence, and a different relationship to the operator’s products. The same site can reasonably allow some categories and block others.

Training crawlers

Training crawlers fetch content to be incorporated into model training datasets. Their behavior:

Slow, broad sweeps of the open web rather than focused fetches.
Generally respect robots.txt and crawl-delay headers.
Consume content for indirect, long-term effects — the model learns from the page, but a specific request to that model may or may not produce a citation referencing the page.
No real-time stake. Blocking a training crawler today only affects future model versions.

Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot (Common Crawl).

The decision to allow or block training crawlers comes down to:

Rights and licensing. Some publishers prefer to license training rights rather than offer them free.
Reputation. Some brands want to signal opt-out from AI training as part of their public stance.
Pragmatism. For most sites, allowing training contributes to long-term recognition by AI systems, similar to how Common Crawl participation shaped earlier web tooling.

See training vs crawling for the broader topic.

Search crawlers

Search crawlers maintain an index that the engine queries at user request time. Their behavior:

Continuous, predictable crawls like classic search bots.
Respect robots.txt and standard directives.
Direct AEO impact. A page in the index can be cited; a page not in the index cannot.
Standard-bot etiquette. Identifiable user agent, polite rates, willingness to honor crawl signals.

Examples: OAI-SearchBot, PerplexityBot, Bingbot, Googlebot, Claude-SearchBot.

Most sites should allow search crawlers unless they have specific reasons to opt out. Blocking a search crawler removes the site from that engine’s citation pool entirely.

User-initiated fetches

User-initiated fetch bots are triggered when an end user asks the AI engine a question that requires fetching a specific URL, often a URL the user named directly or that the engine identified as relevant in real time. Their behavior:

Burst-y, on-demand fetches rather than systematic crawls.
May not respect robots.txt in the same way — the request is technically on behalf of the user, not the bot. Operators handle this differently.
Highest direct AEO impact for time-sensitive queries. When a user asks a question and the engine fetches a URL right then to answer, that fetch decides whether the page makes it into the answer.
Lower request volume per site than the other categories, but each request matters more.

Examples: ChatGPT-User, Claude-User, Perplexity-User.

Sites should generally allow user-initiated fetch bots even if they block training and search bots. Blocking these specifically denies users who explicitly asked for the page.

How operator policies handle the split

Each operator handles user-initiated requests slightly differently:

OpenAI treats ChatGPT-User requests as a “user agent” — in their documentation, robots.txt blocking is honored, but their stance is that these are user requests, not bot crawls.
Anthropic documents Claude-User as also subject to robots.txt but emphasizes it represents an explicit user request.
Perplexity explicitly states that Perplexity-User “generally ignores robots.txt since users initiated the requests.”

The operator differences matter when designing access policy. A robots.txt rule may be honored by one operator and ignored by another for the equivalent user-fetch bot.

Designing a policy

A common policy split:

| Category | Default | Reasoning |

|—|—|—|

| Training crawlers | Allow or block per content rights | Indirect long-term effect; rights-driven decision |

| Search crawlers | Allow | Direct citation impact; foundational to AEO |

| User-initiated fetches | Allow | Explicit user intent; blocking denies the user |

Diverging from this default needs a specific reason. Many sites end up with the right defaults by accident; the recommended path is to make the choice deliberately.

Implementation example

AwesomeShoes Co. needs a crawler policy that protects premium launch content while preserving answer visibility for core buying guides. The policy owner (head of content operations) works with legal, security, and ecommerce to define category-by-category crawler behavior.

Implementation discussion: training crawlers are blocked on rights-sensitive launch pages, search crawlers are allowed across indexable commerce and guide URLs, and user-initiated fetch bots are allowed to support direct shopper questions in assistants. The data analyst reviews citation presence, crawl access logs, and support-ticket trends to verify the policy is understandable, business-aligned, and producing useful outcomes.

Training crawlers

Search crawlers

User-initiated fetches

How operator policies handle the split

Designing a policy

Implementation example

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message