Web Application Firewalls are the most common cause of AI crawler access failures. A site can have a permissive robots.txt for AI crawlers, valid schema, and good content, and still be invisible to AI engines because a default WAF rule is blocking the crawler. Configuring the WAF correctly is the highest-leverage technical fix in AEO.
Why WAFs block AI crawlers
WAFs use rule sets that flag suspicious traffic. AI crawlers often look suspicious to default rules:
- They request many pages from a single IP range.
- They don’t render JavaScript or maintain cookies.
- They sometimes use user agent strings the WAF’s rule set doesn’t recognize as legitimate.
- Their request patterns look like scrapers if classification is naive.
Most WAF vendors now ship rule sets that explicitly handle the major AI crawlers. But defaults vary, and a site that hasn’t audited its rules in 12+ months is likely blocking at least one important crawler.
Cloudflare
Cloudflare offers explicit AI bot management.
The “Block AI Bots” toggle: Cloudflare added a one-click “block AI bots” option in 2024. Toggling it on blocks training crawlers and, depending on configuration, may block search and user-fetch crawlers as well. Sites doing AEO want this toggle off, or want to selectively allow specific bots.
Verified Bots list: Cloudflare maintains a Verified Bots program. Crawlers in the list pass standard managed challenges. Confirm the AI crawlers a site wants to allow are in the verified list.
Custom rules pattern:
`
# Allow OpenAI, Anthropic, Perplexity user-fetch bots
(http.user_agent contains “ChatGPT-User”) or
(http.user_agent contains “Claude-User”) or
(http.user_agent contains “Perplexity-User”)
# Action: Skip remaining security rules
`
For IP-based verification:
`
(http.user_agent contains “GPTBot”) and
(ip.src in $openai_gptbot_ips)
# Action: Skip remaining security rules
`
The IP list should be maintained as a Cloudflare list, refreshed by an automated job.
AWS WAF
AWS WAF supports custom rules with user agent and IP matching.
Pattern:
- Create an IP set per AI crawler operator (
openai-gptbot-ips,perplexity-bot-ips, etc.). - Create a regex pattern set matching the user agents.
- Build a rule combining user agent match AND IP set match.
- Set the action to
Allow. - Place the rule above any “block bots” rules in the priority order.
`
Rule: Allow Verified AI Crawlers
Priority: 10 (high)
Conditions:
– User-Agent matches: ^.*(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|Bingbot).*$
– AND Source IP in: verified-ai-crawler-ips
Action: Allow
`
The IP set is updated by a Lambda function that fetches each operator’s published ranges on a schedule.
Akamai / Fastly / others
Pattern is similar:
- Define the user agents to recognize.
- Define the IP ranges to verify against.
- Create a rule that allows the combination and skips bot mitigation.
- Place the rule above bot mitigation rules.
Each platform has its own syntax but the logic is identical.
Application-layer rate limiting
If the application server applies its own rate limits (Express middleware, Django throttling, etc.), AI crawlers need to be exempt or given a higher limit. Search crawlers in particular issue many requests in short bursts during recrawls.
Standard pattern:
`
if isVerifiedCrawler(request):
skipRateLimit()
`
Testing the configuration
After any WAF change, test with curl:
`bash
curl -A “GPTBot” -I https://example.com/sample-page
curl -A “PerplexityBot” -I https://example.com/sample-page
curl -A “ChatGPT-User” -I https://example.com/sample-page
`
Each should return 200 OK with the actual content, not a 403, 429, or 503.
For deeper testing:
- Run a request volume test (a hundred requests from a known crawler IP) and confirm no rate limiting kicks in.
- Test from an IP outside the operator’s verified range with the same user agent and confirm the WAF blocks it. If both succeed, user-agent verification isn’t enforced.
Common mistakes
- Allowing user agent without verifying IP. Lets scrapers spoof the user agent and bypass other defenses.
- One-time configuration. WAF rules need to be reviewed when operators publish new IP ranges or new crawlers.
- Stage-only testing. Production WAF often differs from staging. Test on production after deploys.
- Forgetting CDN-level rules. Cloudflare/Fastly rules can override or precede WAF rules. Audit both.
- Generic “block bots” rules placed above the AI crawler allow rules in priority order.
Auditing cadence
Monthly:
- Test all permitted crawlers with curl.
- Review WAF logs for blocks of intended-allow user agents.
- Confirm IP set freshness against operator publication dates.
Quarterly:
- Full WAF rule audit.
- Update operator user agent and IP lists from canonical sources.
- Verify logging captures crawler-specific request data.
Implementation example
During a new product-line rollout at AwesomeShoes Co., the technical SEO lead sees a sudden drop in answer citations for size-guide pages. The WAF had recently enabled an aggressive bot rule that now challenges legitimate AI crawler requests.
Implementation discussion: the security engineer creates verified-crawler allow rules above generic bot blocks, the DevOps engineer refreshes operator IP sets automatically, and the SEO lead validates priority URLs with controlled crawler-header tests. They compare blocked-request logs and citation recovery week over week to confirm the fix is both secure and effective.