Book a 15-min intro call on Google Calendar Mon–Fri, 2–10 PM IST · Free · Google Meet Pick a time →
  1. Context
  2. Answer Engine Optimization
  3. Crawling and Indexing
  4. AI Crawling
  5. AI Crawlers
  6. robots.txt for AI Crawlers

robots.txt for AI Crawlers

robots.txt is the public, advisory file at the root of a domain that tells crawlers what they’re allowed to fetch. For AI crawlers it works the same way it always has for search bots: a User-agent line followed by Allow and Disallow directives. The differences are that there are more bots to consider and that some operators handle the directives slightly differently.

What goes in robots.txt for AI crawlers

A typical opt-in policy for a site that wants AI visibility:

`

# Search and user-fetch bots: allow everything

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Claude-SearchBot

Allow: /

User-agent: Claude-User

Allow: /

User-agent: Bingbot

Allow: /

User-agent: Googlebot

Allow: /

# Training bots: allow

User-agent: GPTBot

Allow: /

User-agent: ClaudeBot

Allow: /

# Default for everything else

User-agent: *

Allow: /

Sitemap: https://example.com/sitemap.xml

`

A typical “block training, allow citation” policy:

`

# Search and user-fetch bots: allow

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: Perplexity-User

Allow: /

User-agent: Claude-SearchBot

Allow: /

User-agent: Claude-User

Allow: /

# Training bots: block

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: CCBot

Disallow: /

# Default

User-agent: *

Allow: /

Sitemap: https://example.com/sitemap.xml

`

Per-directory rules

Common patterns:

Block AI crawlers from premium content:

`

User-agent: GPTBot

Disallow: /premium/

Disallow: /paid-research/

User-agent: ClaudeBot

Disallow: /premium/

Disallow: /paid-research/

`

Block AI crawlers from staging or QA paths:

`

User-agent: *

Disallow: /staging/

Disallow: /qa/

`

Allow user-fetch bots while blocking systematic indexing:

`

User-agent: PerplexityBot

Disallow: /

User-agent: Perplexity-User

Allow: /

`

This last pattern works because user-initiated fetches are usually for specific URLs the user named, not for indexing. The site says “don’t index me, but if a user explicitly asks about a page, serve them.”

Operators that ignore robots.txt

Most major AI crawlers honor robots.txt for systematic crawls. The exceptions:

  • Perplexity-User explicitly does not respect robots.txt because the request is on behalf of a user, not the bot.
  • Some scrapers masquerading as AI crawler user agents ignore everything.
  • Older or smaller AI bots with less mature compliance.

For requests that ignore robots.txt, the only defense is verification at the WAF or CDN level. Robots.txt is the polite signal; verification is the actual control (see verify AI crawlers).

How AI engines read robots.txt

The mechanics are standard:

  1. Bot fetches https://example.com/robots.txt.
  2. Parses the file into rules per user agent.
  3. For each URL it considers fetching, it checks the rules for its own user agent (or * if no specific match).
  4. Most-specific match wins.

A few engine-specific behaviors:

  • OpenAI treats GPTBot, OAI-SearchBot, and ChatGPT-User as separate user agents. A Disallow rule for GPTBot does not apply to the others.
  • Google treats Google-Extended as a separate token from Googlebot. Blocking Google-Extended opts the site out of Gemini training without affecting Search.
  • Perplexity treats PerplexityBot (search) and Perplexity-User (user-initiated) as separate, with the user-initiated one ignoring rules as noted.

Common mistakes

  • Using * to block AI crawlers without realizing it blocks Googlebot too. A blanket User-agent: * / Disallow: /AIBots/ does nothing because robots.txt doesn’t pattern-match user agents that way. You have to list each one.
  • Inconsistent rules between staging and production. A staging robots.txt that blocks everything ends up in production after a deploy, and visibility collapses overnight.
  • Trailing slash mismatches. Disallow: /admin blocks /admin but not /admin/. Disallow: /admin/ blocks /admin/ but not /admin. Use both if needed.
  • Treating robots.txt as a security control. It’s advisory. Sensitive content should be authenticated, not just disallowed in robots.txt.
  • Forgetting to include the sitemap line. Sitemap: https://example.com/sitemap.xml helps crawlers discover the site structure.

Validation

After any robots.txt change:

  • Fetch the file and confirm it’s syntactically clean.
  • Test specific user agent / URL combinations using a robots.txt tester.
  • Check Google Search Console’s robots.txt report for parse errors.
  • Run a crawl from a tool that respects robots.txt and confirm the expected URLs are reachable.

Where to keep robots.txt

The file lives at https://example.com/robots.txt and nowhere else. Common platform-specific notes:

  • WordPress: the default robots.txt is generated by the platform; an Edit robots.txt plugin or a theme override is needed for custom rules.
  • Static site generators: include robots.txt in the build output.
  • Multi-region sites: each region’s domain needs its own robots.txt; rules don’t inherit across domains (coordinate with multilingual AEO).

Implementation example

AwesomeShoes Co. wants to keep training crawlers away from paid research pages while preserving visibility in answer engines for public buying guides. The SEO lead owns the robots policy, but implementation spans content, legal, and infrastructure teams.

Implementation discussion: the SEO lead defines per-bot directives, legal confirms rights-sensitive paths, and DevOps deploys a version-controlled robots.txt with environment checks to prevent staging rules leaking into production. The team validates with robots testers and live fetch checks so policy intent and production behavior stay aligned.

WhatsApp
Contact Here
×

Get in touch

Three ways to reach us. Pick whichever suits you best.

Send us a message

Takes under a minute. We reply same-day on weekdays.

This field is required.
This field is required.
This field is required.
This field is required.
Monthly Budget
Focus Area
This field is required.
Preferred Mode of Contact
Select how you'd like to be contacted.
This field is required.