AI Crawler Allowlist

An AI crawler allowlist is the explicit set of bots a site has decided to permit, enforced at the network and application layers. The allowlist is the operational counterpart to robots.txt for AI crawlers: where robots.txt is advisory, the allowlist actually controls what reaches the application.

Why an allowlist matters

Robots.txt is a request to well-behaved bots. It does not stop scrapers, spoofed user agents, or operators that ignore it. Real access control happens at:

The CDN edge (Cloudflare, Fastly, Akamai).
The Web Application Firewall (Cloudflare WAF, AWS WAF, custom rules).
The application server (rate limiting, IP filtering).

Each of these has rules that decide which requests get through. The allowlist is the canonical list of crawlers that should pass through every layer.

What’s in this subsection

WAF configuration — concrete WAF rules for the major platforms.
IP whitelisting — keeping IP-based rules current with operator-published ranges.

Designing the allowlist

The allowlist is derived from policy decisions:

Decide which crawler types to permit (training / search / user-initiated).
List the specific user agents for each permitted operator.
Pair each user agent with its IP range source for verification.
Document any exceptions (premium content directories, region-specific blocks, etc.).
Translate the allowlist into the rule format each layer requires, especially WAF configuration.

A working allowlist template

# AI crawler allowlist — derived from policy

# Last updated: yyyy-mm-dd

[search-crawlers]

name: OAI-SearchBot

operator: OpenAI

ip-source: openai.com/searchbot.json

policy: allow

name: PerplexityBot

operator: Perplexity

ip-source: perplexity.com/perplexitybot.json

policy: allow

name: Bingbot

operator: Microsoft

ip-source: bing.com/toolbox/bingbot.json

policy: allow

name: Googlebot

operator: Google

ip-source: google.com/special_crawlers.json

policy: allow

[user-initiated]

name: ChatGPT-User

operator: OpenAI

ip-source: openai.com/chatgpt-user.json

policy: allow

name: Claude-User

operator: Anthropic

ip-source: anthropic.com/claude-user.json

policy: allow

name: Perplexity-User

operator: Perplexity

ip-source: perplexity.com/perplexity-user.json

policy: allow

[training]

name: GPTBot

operator: OpenAI

ip-source: openai.com/gptbot.json

policy: block-or-allow-per-content-rights

name: ClaudeBot

operator: Anthropic

ip-source: anthropic.com/claudebot.json

policy: block-or-allow-per-content-rights

This file lives in version control and is the single source of truth for the security and infrastructure teams.

Implementation example: at AwesomeShoes Co., the security lead owns the allowlist file, the DevOps engineer maps each permitted crawler to edge and WAF rules, and the ecommerce manager flags premium catalog sections that require stricter access. The team reviews bot access outcomes weekly so policy, infrastructure behavior, and merchandising priorities stay aligned.

Keeping the allowlist current

IP ranges change. New crawlers appear. The allowlist needs to be:

Automated where possible. A scheduled job that fetches each operator’s published ranges and updates rules (see IP whitelisting).
Reviewed quarterly for new operators or behavior changes.
Tested after every deploy that touches infrastructure or routing.

Manually maintained allowlists go stale within weeks.

What an allowlist is not

Not a substitute for robots.txt. Robots.txt remains the public signal of intent.
Not a complete defense. Sophisticated scrapers can spoof user agents and rotate IPs. The allowlist denies the easy abuses, not the determined ones.
Not a replacement for content access control. Paywalls, authentication, and per-page directives still apply.

Before shipping changes, the team asks: does this policy make sense for current business goals, does it solve the visibility problem, and is the rule set clear enough for another engineer to operate safely?

Why an allowlist matters

What’s in this subsection

Designing the allowlist

A working allowlist template

Keeping the allowlist current

What an allowlist is not

Get in touch

Chat on WhatsApp

Book on Google Calendar

Send a message

Send us a message