Book a 15-min intro call on Google Calendar Mon–Fri, 2–10 PM IST · Free · Google Meet Pick a time →

An AI crawler allowlist is the explicit set of bots a site has decided to permit, enforced at the network and application layers. The allowlist is the operational counterpart to robots.txt for AI crawlers: where robots.txt is advisory, the allowlist actually controls what reaches the application.

Why an allowlist matters

Robots.txt is a request to well-behaved bots. It does not stop scrapers, spoofed user agents, or operators that ignore it. Real access control happens at:

  • The CDN edge (Cloudflare, Fastly, Akamai).
  • The Web Application Firewall (Cloudflare WAF, AWS WAF, custom rules).
  • The application server (rate limiting, IP filtering).

Each of these has rules that decide which requests get through. The allowlist is the canonical list of crawlers that should pass through every layer.

What’s in this subsection

Designing the allowlist

The allowlist is derived from policy decisions:

  1. Decide which crawler types to permit (training / search / user-initiated).
  2. List the specific user agents for each permitted operator.
  3. Pair each user agent with its IP range source for verification.
  4. Document any exceptions (premium content directories, region-specific blocks, etc.).
  5. Translate the allowlist into the rule format each layer requires, especially WAF configuration.

A working allowlist template

`

# AI crawler allowlist — derived from policy

# Last updated: yyyy-mm-dd

[search-crawlers]

  • name: OAI-SearchBot

operator: OpenAI

ip-source: openai.com/searchbot.json

policy: allow

  • name: PerplexityBot

operator: Perplexity

ip-source: perplexity.com/perplexitybot.json

policy: allow

  • name: Bingbot

operator: Microsoft

ip-source: bing.com/toolbox/bingbot.json

policy: allow

  • name: Googlebot

operator: Google

ip-source: google.com/special_crawlers.json

policy: allow

[user-initiated]

  • name: ChatGPT-User

operator: OpenAI

ip-source: openai.com/chatgpt-user.json

policy: allow

  • name: Claude-User

operator: Anthropic

ip-source: anthropic.com/claude-user.json

policy: allow

  • name: Perplexity-User

operator: Perplexity

ip-source: perplexity.com/perplexity-user.json

policy: allow

[training]

  • name: GPTBot

operator: OpenAI

ip-source: openai.com/gptbot.json

policy: block-or-allow-per-content-rights

  • name: ClaudeBot

operator: Anthropic

ip-source: anthropic.com/claudebot.json

policy: block-or-allow-per-content-rights

`

This file lives in version control and is the single source of truth for the security and infrastructure teams.

Implementation example: at AwesomeShoes Co., the security lead owns the allowlist file, the DevOps engineer maps each permitted crawler to edge and WAF rules, and the ecommerce manager flags premium catalog sections that require stricter access. The team reviews bot access outcomes weekly so policy, infrastructure behavior, and merchandising priorities stay aligned.

Keeping the allowlist current

IP ranges change. New crawlers appear. The allowlist needs to be:

  • Automated where possible. A scheduled job that fetches each operator’s published ranges and updates rules (see IP whitelisting).
  • Reviewed quarterly for new operators or behavior changes.
  • Tested after every deploy that touches infrastructure or routing.

Manually maintained allowlists go stale within weeks.

What an allowlist is not

  • Not a substitute for robots.txt. Robots.txt remains the public signal of intent.
  • Not a complete defense. Sophisticated scrapers can spoof user agents and rotate IPs. The allowlist denies the easy abuses, not the determined ones.
  • Not a replacement for content access control. Paywalls, authentication, and per-page directives still apply.

Before shipping changes, the team asks: does this policy make sense for current business goals, does it solve the visibility problem, and is the rule set clear enough for another engineer to operate safely?

WhatsApp
Contact Here
×

Get in touch

Three ways to reach us. Pick whichever suits you best.

Send us a message

Takes under a minute. We reply same-day on weekdays.

This field is required.
This field is required.
This field is required.
This field is required.
Monthly Budget
Focus Area
This field is required.
Preferred Mode of Contact
Select how you'd like to be contacted.
This field is required.