Book a 15-min intro call on Google Calendar Mon–Fri, 2–10 PM IST · Free · Google Meet Pick a time →

Web Application Firewalls are the most common cause of AI crawler access failures. A site can have a permissive robots.txt for AI crawlers, valid schema, and good content, and still be invisible to AI engines because a default WAF rule is blocking the crawler. Configuring the WAF correctly is the highest-leverage technical fix in AEO.

Why WAFs block AI crawlers

WAFs use rule sets that flag suspicious traffic. AI crawlers often look suspicious to default rules:

  • They request many pages from a single IP range.
  • They don’t render JavaScript or maintain cookies.
  • They sometimes use user agent strings the WAF’s rule set doesn’t recognize as legitimate.
  • Their request patterns look like scrapers if classification is naive.

Most WAF vendors now ship rule sets that explicitly handle the major AI crawlers. But defaults vary, and a site that hasn’t audited its rules in 12+ months is likely blocking at least one important crawler.

Cloudflare

Cloudflare offers explicit AI bot management.

The “Block AI Bots” toggle: Cloudflare added a one-click “block AI bots” option in 2024. Toggling it on blocks training crawlers and, depending on configuration, may block search and user-fetch crawlers as well. Sites doing AEO want this toggle off, or want to selectively allow specific bots.

Verified Bots list: Cloudflare maintains a Verified Bots program. Crawlers in the list pass standard managed challenges. Confirm the AI crawlers a site wants to allow are in the verified list.

Custom rules pattern:

`

# Allow OpenAI, Anthropic, Perplexity user-fetch bots

(http.user_agent contains “ChatGPT-User”) or

(http.user_agent contains “Claude-User”) or

(http.user_agent contains “Perplexity-User”)

# Action: Skip remaining security rules

`

For IP-based verification:

`

(http.user_agent contains “GPTBot”) and

(ip.src in $openai_gptbot_ips)

# Action: Skip remaining security rules

`

The IP list should be maintained as a Cloudflare list, refreshed by an automated job.

AWS WAF

AWS WAF supports custom rules with user agent and IP matching.

Pattern:

  1. Create an IP set per AI crawler operator (openai-gptbot-ips, perplexity-bot-ips, etc.).
  2. Create a regex pattern set matching the user agents.
  3. Build a rule combining user agent match AND IP set match.
  4. Set the action to Allow.
  5. Place the rule above any “block bots” rules in the priority order.

`

Rule: Allow Verified AI Crawlers

Priority: 10 (high)

Conditions:

– User-Agent matches: ^.*(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|Bingbot).*$

– AND Source IP in: verified-ai-crawler-ips

Action: Allow

`

The IP set is updated by a Lambda function that fetches each operator’s published ranges on a schedule.

Akamai / Fastly / others

Pattern is similar:

  1. Define the user agents to recognize.
  2. Define the IP ranges to verify against.
  3. Create a rule that allows the combination and skips bot mitigation.
  4. Place the rule above bot mitigation rules.

Each platform has its own syntax but the logic is identical.

Application-layer rate limiting

If the application server applies its own rate limits (Express middleware, Django throttling, etc.), AI crawlers need to be exempt or given a higher limit. Search crawlers in particular issue many requests in short bursts during recrawls.

Standard pattern:

`

if isVerifiedCrawler(request):

skipRateLimit()

`

Testing the configuration

After any WAF change, test with curl:

`bash

curl -A “GPTBot” -I https://example.com/sample-page

curl -A “PerplexityBot” -I https://example.com/sample-page

curl -A “ChatGPT-User” -I https://example.com/sample-page

`

Each should return 200 OK with the actual content, not a 403, 429, or 503.

For deeper testing:

  • Run a request volume test (a hundred requests from a known crawler IP) and confirm no rate limiting kicks in.
  • Test from an IP outside the operator’s verified range with the same user agent and confirm the WAF blocks it. If both succeed, user-agent verification isn’t enforced.

Common mistakes

  • Allowing user agent without verifying IP. Lets scrapers spoof the user agent and bypass other defenses.
  • One-time configuration. WAF rules need to be reviewed when operators publish new IP ranges or new crawlers.
  • Stage-only testing. Production WAF often differs from staging. Test on production after deploys.
  • Forgetting CDN-level rules. Cloudflare/Fastly rules can override or precede WAF rules. Audit both.
  • Generic “block bots” rules placed above the AI crawler allow rules in priority order.

Auditing cadence

Monthly:

  • Test all permitted crawlers with curl.
  • Review WAF logs for blocks of intended-allow user agents.
  • Confirm IP set freshness against operator publication dates.

Quarterly:

  • Full WAF rule audit.
  • Update operator user agent and IP lists from canonical sources.
  • Verify logging captures crawler-specific request data.

Implementation example

During a new product-line rollout at AwesomeShoes Co., the technical SEO lead sees a sudden drop in answer citations for size-guide pages. The WAF had recently enabled an aggressive bot rule that now challenges legitimate AI crawler requests.

Implementation discussion: the security engineer creates verified-crawler allow rules above generic bot blocks, the DevOps engineer refreshes operator IP sets automatically, and the SEO lead validates priority URLs with controlled crawler-header tests. They compare blocked-request logs and citation recovery week over week to confirm the fix is both secure and effective.

WhatsApp
Contact Here
×

Get in touch

Three ways to reach us. Pick whichever suits you best.

Send us a message

Takes under a minute. We reply same-day on weekdays.

This field is required.
This field is required.
This field is required.
This field is required.
Monthly Budget
Focus Area
This field is required.
Preferred Mode of Contact
Select how you'd like to be contacted.
This field is required.