Training Crawlers

Training crawlers are AI bots that fetch content for inclusion in language model training datasets. They do not produce citations directly. Their effect is indirect and slow: content they ingest becomes part of what the model “knows,” shaping responses for years afterward.

What they do

Training crawlers operate like classic search crawlers from a request-pattern standpoint:

Fetch URLs at a steady, slow rate.
Respect robots.txt directives.
Identify themselves with a clear user agent.
Recrawl periodically to capture changes.

What’s different is the destination of the fetched content. Instead of feeding a search crawler index, the content goes into datasets used to train or update large language models. Once a page is included, the model retains a representation of its content even if the page later changes or is taken down.

Major training crawlers

| Bot | Operator | Notes |

|—|—|—|

| GPTBot | OpenAI | Default training crawler; respects robots.txt |

| ClaudeBot | Anthropic | Training crawler for Claude models |

| Google-Extended | Google | Token in robots.txt; doesn’t crawl, just signals AI training opt-out |

| Applebot-Extended | Apple | Token for opting out of Apple AI training without affecting Search |

| CCBot | Common Crawl | Open dataset; many models train on it |

| Bytespider | ByteDance | Training crawler for TikTok/Douyin AI |

| FacebookBot | Meta | Training crawls |

| Meta-ExternalAgent | Meta | AI features and training |

| Cohere-AI | Cohere | Training |

Some, like Google-Extended and Applebot-Extended, are not bots that actually crawl. They’re robots.txt tokens used to opt out of AI training while keeping classic search crawling intact.

The decision: allow or block

Allowing training crawlers means content may be incorporated into future model versions. Blocking them means the content is excluded from training but may still be retrieved by search crawlers and user-initiated fetches.

Reasons to allow:

Long-term recognition. A model that has seen a brand’s content may surface it in answers even without retrieval at query time.
Authority signaling. Brands consistently included in training datasets gain reputational presence in the wider AI ecosystem.
No direct cost. Training crawlers are polite; they don’t significantly affect server load.

Reasons to block:

Content rights. Publishers, journalists, and rights-holders may not want their work used to train models without compensation.
Reputational stance. Some brands want to publicly signal opt-out as part of their position on AI.
Competitive concern. Specialized content (proprietary research, internal documentation) may be valuable specifically because it’s not publicly trainable.

Many sites end up at “allow training but block training of premium or paid content,” using directory-level rules.

How blocking works

The standard mechanisms:

robots.txt for AI crawlers — a User-agent: GPTBot / Disallow: / block opts out of training crawls. Most operators respect this.
Google-Extended token — opts out of Gemini and other Google AI training while keeping Search.
Applebot-Extended token — opts out of Apple AI training while keeping Apple search.

For sites using a CMS, these are usually one-line additions to the live robots.txt. For sites with multiple environments, the rules need to live in version-controlled config and be tested on every deploy.

What blocking does not do

It doesn’t remove content from already-trained models. Once content is in a training set, retraining on a clean set is the only way to fully remove it, and that usually doesn’t happen until the next major model version.
It doesn’t affect retrieval. A page blocked from training crawls can still be retrieved at query time by search crawlers and user-fetch bots.
It doesn’t affect classic Search ranking. Google-Extended does not influence Googlebot.

Auditing training-crawler access

Quarterly checks:

Confirm the intended training-crawler rules are in robots.txt for the live domain.
Check operator documentation for new training-crawler bot names.
Review server logs for unexpected training-crawler activity (suggests a robots.txt issue or operator change).
For premium content sections, confirm directory-level rules are still in place after any restructure.

Implementation example

AwesomeShoes Co. publishes paid fit-research reports and free buying guides. The content operations lead needs a policy that protects premium IP while still allowing broad brand understanding in AI systems.

Implementation discussion: the legal lead and SEO lead define directory-level rules, the platform engineer blocks training crawlers on premium research paths, and keeps free guide paths available where policy allows. They review logs quarterly to confirm premium sections remain excluded while public educational pages remain consistent with the company’s visibility goals.

What they do

Major training crawlers

The decision: allow or block

How blocking works

What blocking does not do

Auditing training-crawler access

Implementation example

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message