robots.txt is the public, advisory file at the root of a domain that tells crawlers what they’re allowed to fetch. For AI crawlers it works the same way it always has for search bots: a User-agent line followed by Allow and Disallow directives. The differences are that there are more bots to consider and that some operators handle the directives slightly differently.
What goes in robots.txt for AI crawlers
A typical opt-in policy for a site that wants AI visibility:
`
# Search and user-fetch bots: allow everything
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Googlebot
Allow: /
# Training bots: allow
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
# Default for everything else
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
`
A typical “block training, allow citation” policy:
`
# Search and user-fetch bots: allow
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Training bots: block
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Default
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
`
Per-directory rules
Common patterns:
Block AI crawlers from premium content:
`
User-agent: GPTBot
Disallow: /premium/
Disallow: /paid-research/
User-agent: ClaudeBot
Disallow: /premium/
Disallow: /paid-research/
`
Block AI crawlers from staging or QA paths:
`
User-agent: *
Disallow: /staging/
Disallow: /qa/
`
Allow user-fetch bots while blocking systematic indexing:
`
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Allow: /
`
This last pattern works because user-initiated fetches are usually for specific URLs the user named, not for indexing. The site says “don’t index me, but if a user explicitly asks about a page, serve them.”
Operators that ignore robots.txt
Most major AI crawlers honor robots.txt for systematic crawls. The exceptions:
- Perplexity-User explicitly does not respect robots.txt because the request is on behalf of a user, not the bot.
- Some scrapers masquerading as AI crawler user agents ignore everything.
- Older or smaller AI bots with less mature compliance.
For requests that ignore robots.txt, the only defense is verification at the WAF or CDN level. Robots.txt is the polite signal; verification is the actual control (see verify AI crawlers).
How AI engines read robots.txt
The mechanics are standard:
- Bot fetches
https://example.com/robots.txt. - Parses the file into rules per user agent.
- For each URL it considers fetching, it checks the rules for its own user agent (or
*if no specific match). - Most-specific match wins.
A few engine-specific behaviors:
- OpenAI treats
GPTBot,OAI-SearchBot, andChatGPT-Useras separate user agents. ADisallowrule forGPTBotdoes not apply to the others. - Google treats
Google-Extendedas a separate token fromGooglebot. BlockingGoogle-Extendedopts the site out of Gemini training without affecting Search. - Perplexity treats
PerplexityBot(search) andPerplexity-User(user-initiated) as separate, with the user-initiated one ignoring rules as noted.
Common mistakes
- Using
*to block AI crawlers without realizing it blocks Googlebot too. A blanketUser-agent: * / Disallow: /AIBots/does nothing because robots.txt doesn’t pattern-match user agents that way. You have to list each one. - Inconsistent rules between staging and production. A staging robots.txt that blocks everything ends up in production after a deploy, and visibility collapses overnight.
- Trailing slash mismatches.
Disallow: /adminblocks/adminbut not/admin/.Disallow: /admin/blocks/admin/but not/admin. Use both if needed. - Treating robots.txt as a security control. It’s advisory. Sensitive content should be authenticated, not just disallowed in robots.txt.
- Forgetting to include the sitemap line.
Sitemap: https://example.com/sitemap.xmlhelps crawlers discover the site structure.
Validation
After any robots.txt change:
- Fetch the file and confirm it’s syntactically clean.
- Test specific user agent / URL combinations using a robots.txt tester.
- Check Google Search Console’s robots.txt report for parse errors.
- Run a crawl from a tool that respects robots.txt and confirm the expected URLs are reachable.
Where to keep robots.txt
The file lives at https://example.com/robots.txt and nowhere else. Common platform-specific notes:
- WordPress: the default robots.txt is generated by the platform; an
Edit robots.txtplugin or a theme override is needed for custom rules. - Static site generators: include robots.txt in the build output.
- Multi-region sites: each region’s domain needs its own robots.txt; rules don’t inherit across domains (coordinate with multilingual AEO).
Implementation example
AwesomeShoes Co. wants to keep training crawlers away from paid research pages while preserving visibility in answer engines for public buying guides. The SEO lead owns the robots policy, but implementation spans content, legal, and infrastructure teams.
Implementation discussion: the SEO lead defines per-bot directives, legal confirms rights-sensitive paths, and DevOps deploys a version-controlled robots.txt with environment checks to prevent staging rules leaking into production. The team validates with robots testers and live fetch checks so policy intent and production behavior stay aligned.