Book a 15-min intro call on Google Calendar Mon–Fri, 2–10 PM IST · Free · Google Meet Pick a time →

Crawler user agent strings can be spoofed. Anyone can send an HTTP request claiming to be GPTBot or PerplexityBot. Verifying that a request actually comes from the operator it claims is necessary before granting privileged access — for example, before exempting the request from rate limits or WAF configuration rules.

Why verification matters

Two real risks if user agent alone is trusted:

  • Scrapers impersonating legitimate bots to bypass rate limits and harvest content.
  • Attackers using AI bot user agents as a known-good string to evade simple bot detection.

The defense is to verify the request’s origin, not just its self-reported identity.

Two verification methods

IP allowlisting

Most operators publish their crawler IP ranges. The verification:

  1. Read the request’s source IP.
  2. Check whether the IP falls in the operator’s published list.
  3. If yes, trust the user agent; if no, treat as untrusted regardless of user agent.

Each operator publishes ranges differently:

  • OpenAI publishes IP ranges for each bot in their documentation.
  • Anthropic publishes ranges in their docs.
  • Perplexity publishes JSON files at fixed URLs (perplexity.com/perplexitybot.json, perplexity.com/perplexity-user.json).
  • Google publishes ranges at gstatic.com/ipranges/ files.

The list-fetching has to be automated. Hardcoded IPs go stale.

Reverse DNS

Some operators support reverse DNS verification:

  1. Get the request’s source IP.
  2. Run a reverse DNS lookup (PTR record).
  3. Check the hostname matches the expected pattern (e.g., *.googlebot.com for Googlebot).
  4. Forward-confirm by resolving the hostname back to an IP and confirming it matches the source.

Reverse DNS is more flexible than IP lists because it doesn’t require maintaining a list, but it’s slower per request and not all operators support it.

Implementation pattern

The standard pattern, in pseudocode:

`

function isVerifiedCrawler(request):

ip = request.sourceIP

ua = request.userAgent

if ua.contains(“GPTBot”):

return ip in openAITrainingIPs()

if ua.contains(“OAI-SearchBot”):

return ip in openAISearchIPs()

if ua.contains(“PerplexityBot”):

return ip in perplexityBotIPs()

// … per crawler

return false

`

The IP lists are cached for 24 hours and refreshed from the operator’s published source.

Where to enforce verification

  • At the WAF. The verification result decides whether to allow the request through, rate-limit it, or block it.
  • In application logging. Log requests as “verified crawler” or “unverified” so analytics treat them differently.
  • Not in robots.txt for AI crawlers. Robots.txt is advisory. Verification happens at the network layer.

What not to do

  • Don’t trust user agent alone for any privileged decision. Rate limits, content access, schema-served-only-to-bots — all require verification first.
  • Don’t block based on IP without checking user agent. Some legitimate bots share IP space with shared infrastructure.
  • Don’t manually maintain IP lists. Operators update them. Automate the fetch.

Logging verification outcomes

Useful log fields per request:

  • Source IP.
  • User agent.
  • Verification result: verified / unverified / not-a-known-crawler.
  • Operator (if verified): OpenAI / Anthropic / Perplexity / etc.
  • Crawler type (if verified): training / search / user-initiated.

This makes it possible to answer “did GPTBot actually crawl us last week, or was it a scraper” without guessing.

Implementation example

At AwesomeShoes Co., traffic labeled as PerplexityBot spikes overnight, but citation performance does not move. The security analyst suspects spoofed crawler traffic is polluting dashboards and bypassing rate policies.

Implementation discussion: the platform engineer enforces user-agent-plus-IP verification at the WAF, logs verification outcomes centrally, and routes unverified traffic through stricter bot controls. The AEO manager then reports only verified crawler activity when evaluating visibility changes, making decisions based on reliable signals.

WhatsApp
Contact Here
×

Get in touch

Three ways to reach us. Pick whichever suits you best.

Send us a message

Takes under a minute. We reply same-day on weekdays.

This field is required.
This field is required.
This field is required.
This field is required.
Monthly Budget
Focus Area
This field is required.
Preferred Mode of Contact
Select how you'd like to be contacted.
This field is required.