An AI crawler allowlist is the explicit set of bots a site has decided to permit, enforced at the network and application layers. The allowlist is the operational counterpart to robots.txt for AI crawlers: where robots.txt is advisory, the allowlist actually controls what reaches the application.
Why an allowlist matters
Robots.txt is a request to well-behaved bots. It does not stop scrapers, spoofed user agents, or operators that ignore it. Real access control happens at:
- The CDN edge (Cloudflare, Fastly, Akamai).
- The Web Application Firewall (Cloudflare WAF, AWS WAF, custom rules).
- The application server (rate limiting, IP filtering).
Each of these has rules that decide which requests get through. The allowlist is the canonical list of crawlers that should pass through every layer.
What’s in this subsection
- WAF configuration — concrete WAF rules for the major platforms.
- IP whitelisting — keeping IP-based rules current with operator-published ranges.
Designing the allowlist
The allowlist is derived from policy decisions:
- Decide which crawler types to permit (training / search / user-initiated).
- List the specific user agents for each permitted operator.
- Pair each user agent with its IP range source for verification.
- Document any exceptions (premium content directories, region-specific blocks, etc.).
- Translate the allowlist into the rule format each layer requires, especially WAF configuration.
A working allowlist template
`
# AI crawler allowlist — derived from policy
# Last updated: yyyy-mm-dd
[search-crawlers]
- name: OAI-SearchBot
operator: OpenAI
ip-source: openai.com/searchbot.json
policy: allow
- name: PerplexityBot
operator: Perplexity
ip-source: perplexity.com/perplexitybot.json
policy: allow
- name: Bingbot
operator: Microsoft
ip-source: bing.com/toolbox/bingbot.json
policy: allow
- name: Googlebot
operator: Google
ip-source: google.com/special_crawlers.json
policy: allow
[user-initiated]
- name: ChatGPT-User
operator: OpenAI
ip-source: openai.com/chatgpt-user.json
policy: allow
- name: Claude-User
operator: Anthropic
ip-source: anthropic.com/claude-user.json
policy: allow
- name: Perplexity-User
operator: Perplexity
ip-source: perplexity.com/perplexity-user.json
policy: allow
[training]
- name: GPTBot
operator: OpenAI
ip-source: openai.com/gptbot.json
policy: block-or-allow-per-content-rights
- name: ClaudeBot
operator: Anthropic
ip-source: anthropic.com/claudebot.json
policy: block-or-allow-per-content-rights
`
This file lives in version control and is the single source of truth for the security and infrastructure teams.
Implementation example: at AwesomeShoes Co., the security lead owns the allowlist file, the DevOps engineer maps each permitted crawler to edge and WAF rules, and the ecommerce manager flags premium catalog sections that require stricter access. The team reviews bot access outcomes weekly so policy, infrastructure behavior, and merchandising priorities stay aligned.
Keeping the allowlist current
IP ranges change. New crawlers appear. The allowlist needs to be:
- Automated where possible. A scheduled job that fetches each operator’s published ranges and updates rules (see IP whitelisting).
- Reviewed quarterly for new operators or behavior changes.
- Tested after every deploy that touches infrastructure or routing.
Manually maintained allowlists go stale within weeks.
What an allowlist is not
- Not a substitute for robots.txt. Robots.txt remains the public signal of intent.
- Not a complete defense. Sophisticated scrapers can spoof user agents and rotate IPs. The allowlist denies the easy abuses, not the determined ones.
- Not a replacement for content access control. Paywalls, authentication, and per-page directives still apply.
Before shipping changes, the team asks: does this policy make sense for current business goals, does it solve the visibility problem, and is the rule set clear enough for another engineer to operate safely?