Book a 15-min intro call on Google Calendar Mon–Fri, 2–10 PM IST · Free · Google Meet Pick a time →

IP whitelisting for AI crawlers is the practice of allowing requests through privileged paths only when the source IP matches a published operator range. It’s the second half of crawler verification: user agent says who the request claims to be from, IP confirms it.

Why IP whitelisting is needed

Without it, anyone can send a request claiming to be GPTBot or PerplexityBot. The user agent is a string the client controls. Trusting it alone means:

  • Scrapers bypass rate limits by impersonating crawlers.
  • Attackers use known-good user agents to evade bot detection.
  • Analytics over-count “AI crawler traffic” because spoofed requests pollute the data.

IP whitelisting makes the user-agent claim verifiable.

Where operators publish IP ranges

| Operator | Source URL | Format |

|—|—|—|

| OpenAI | platform.openai.com/docs (links to JSON files per bot) | JSON |

| Anthropic | docs.anthropic.com (per-bot JSON) | JSON |

| Perplexity | perplexity.com/perplexitybot.json, perplexity.com/perplexity-user.json | JSON |

| Google | gstatic.com/ipranges/googlebot.json, gstatic.com/ipranges/special-crawlers.json | JSON |

| Microsoft | bing.com/toolbox/bingbot.json | JSON |

Each is publicly available. Most include CIDR ranges with metadata indicating which bot the range serves.

Implementation pattern

  1. Fetch each operator’s published ranges on a schedule (daily is overkill, weekly is fine for most).
  2. Parse the CIDRs into a working IP set.
  3. Apply the set at the WAF, CDN, or application layer.
  4. Verify on each request that claims to be from the operator’s bot.

Pseudocode:

`

function fetchAllRanges():

ranges = {}

ranges[‘openai-gptbot’] = fetchJSON(‘https://openai.com/gptbot.json’).prefixes

ranges[‘openai-searchbot’] = fetchJSON(‘https://openai.com/searchbot.json’).prefixes

ranges[‘perplexity-bot’] = fetchJSON(‘https://perplexity.com/perplexitybot.json’).prefixes

ranges[‘perplexity-user’] = fetchJSON(‘https://perplexity.com/perplexity-user.json’).prefixes

// … per operator

return ranges

function isVerifiedCrawler(request):

ranges = getCachedRanges()

ua = request.userAgent

ip = request.sourceIP

if ua.contains(‘GPTBot’) and ip.in(ranges[‘openai-gptbot’]):

return true

// … per crawler

return false

`

Where to apply the whitelist

Three layers, each with different tradeoffs:

CDN edge

Pros: stops bad traffic before it reaches origin. Lowest latency for legitimate traffic. Easy to configure for major CDNs.

Cons: less granular than application-layer logic. Limited ability to combine with other signals.

Cloudflare, Fastly, and Akamai all support IP-list-based rules at the edge.

WAF

Pros: integrates with the rest of the security ruleset. Logging is centralized.

Cons: depends on WAF vendor capabilities. AWS WAF supports IP sets natively; Cloudflare WAF integrates with edge IP rules.

Application

Pros: most flexible. Can integrate with custom logic (e.g., serve different content to verified crawlers).

Cons: traffic reaches origin even if blocked. Higher latency for legitimate crawlers.

The standard pattern is verification at the CDN or WAF, with application-layer logic for any per-request behavior that depends on whether the request is a verified crawler.

Refresh cadence

  • Weekly automated fetch. Manual fetches go stale.
  • Cache the ranges in a centralized store (Redis, a config service, a Cloudflare list, an AWS IP set).
  • Alert on fetch failures. A failed fetch means the rules are running against stale data.
  • Quarterly review to add any new operators or remove deprecated ranges.

Common mistakes

  • Hardcoded IP lists in code or config files. Operators update ranges, code does not.
  • No fallback when fetch fails. Better to keep using the last known good list than to fail open.
  • Whitelisting too broadly. Allowing all AWS IP ranges because a crawler runs on AWS lets every AWS-hosted scraper through.
  • One whitelist for all crawlers. Each operator has different ranges. A unified whitelist either over-allows or under-allows.
  • Not testing after operator changes. When an operator adds new IP ranges, requests from those IPs fail until the next automated refresh.

Logging verification outcomes

Per request, log:

  • Source IP.
  • User agent.
  • Whether the user agent matched a known crawler pattern.
  • Whether the source IP was in the matching operator’s verified range.
  • The verification verdict: verified / spoofed / not-a-known-crawler.

This data answers questions like “how much of our claimed AI crawler traffic is real” and “did we have any blocks on verified crawlers in the last week.”

Implementation example

AwesomeShoes Co.’s platform engineer notices that “AI crawler” traffic jumps after a promotion, but citation performance does not improve. The likely problem is spoofed user agents inflating logs and hiding real crawler failures.

Implementation discussion: the platform engineer automates weekly range fetches, the security analyst enforces user-agent-plus-IP verification at the WAF, and the AEO owner tracks verified-crawler success rate beside citation trends. If verified access rises without citation gains, content quality is the next bottleneck; if verified access drops, infrastructure fixes stay priority.

WhatsApp
Contact Here
×

Get in touch

Three ways to reach us. Pick whichever suits you best.

Send us a message

Takes under a minute. We reply same-day on weekdays.

This field is required.
This field is required.
This field is required.
This field is required.
Monthly Budget
Focus Area
This field is required.
Preferred Mode of Contact
Select how you'd like to be contacted.
This field is required.