The Server-Side Playbook for Indexation, Training and Retrieval Bots
Treat AI crawlers as one flow and you waste crawl budget and bandwidth. How to segment bots, cache with 304s, guard 103 Early Hints, and verify spoofers.
Bot traffic stopped being a single flow worth a single rule. An enterprise access log now holds at least three populations with different intentions, and serving them the same way leaves crawl budget and bandwidth on the floor. The split is worth making explicit before any caching rule goes in.
Indexation bots like Googlebot and Bingbot crawl to populate a search index. Training bots like GPTBot and ClaudeBot pull content to train models. Retrieval bots like OAI-SearchBot and PerplexityBot fetch a page in real time to ground a live answer with a citation. Same request on the wire. Three different reasons, and three different values to your business.
The volume is no longer a rounding error. GPTBot requests grew 305% year over year between May 2024 and May 2025, climbing from ninth to third among all crawlers (Cloudflare, 2025). Here is where the load sits now.
| Crawler group | Share of HTML request traffic | Source |
|---|---|---|
| Googlebot | 4.5% | Cloudflare, 2025 |
| All AI crawlers combined | 4.2% | Cloudflare, 2025 |
A single AI vendor now rivals Google for raw crawl load on your origin. That changes the math. Managing it takes caching and verification at the protocol level, not a plugin.
Conditional validation with 304 saves the most for the least
The cheapest crawl-budget win is the response you don't render. Conditional validation makes that possible. It lets a crawler ask whether anything changed before you build the page. The crawler sends If-Modified-Since and If-None-Match headers carrying the Last-Modified date and ETag it saw last time. Your origin compares them against current state and, when nothing has changed, returns 304 Not Modified with headers only and an empty body.
That empty body does the work. The origin skips template rendering, database calls, and content transfer, and the crawler gets its answer in a fraction of the bytes and time a full 200 would cost. A bot that spends less time per unchanged URL has budget left to reach URLs it hasn't seen, so correct 304 handling tends to lift the count of unique URLs crawled per day rather than just trimming the bill.
Two failure modes are worth checking. An origin that generates a fresh ETag on every request defeats the mechanism, since the validator never matches. And a CDN that strips conditional headers before they reach the origin hides the signal entirely. Confirm both before assuming 304s are working.
HTTP 103 Early Hints needs a guard, not a blanket
Early Hints is a real user-experience win and a real crawler hazard, and the resolution is narrower than most performance guides suggest. A 103 Early Hints response preloads critical resources and improves Largest Contentful Paint for human visitors. The problem is the empty initial response. Googlebot does not support experimental HTTP features, and an unexpected early response can be read as a bad response.
Google's own guidance from Gary Illyes is specific. Emit 103 only when the request carries sec-fetch-mode: navigate, which real browser navigations send and search crawlers do not (Google, via Search Engine Roundtable). That single condition is the fix. You don't need to fingerprint every bot and maintain a blocklist for Early Hints; you gate the feature on the header that distinguishes a human navigation from a crawler fetch. The same logic applies to prefetch hints and Real User Monitoring beacons, none of which a crawler needs and all of which add noise to a crawl.
Verifying the bot before you trust the user agent
A user agent string is plain text. Scrapers wear Googlebot's like a disguise to slip past rate limits. The only reliable check is the one Google documents, double-reverse DNS.
The pipeline runs in three steps. Take the client IP and do a reverse lookup to get a hostname. Confirm the hostname sits under a trusted domain, googlebot.com or google.com for Google's crawlers. Then do a forward lookup on that hostname and confirm it resolves back to the original IP. A spoofer controlling its own PTR records can fake the first step but not the round trip. Cache the verdict per IP in something fast like Redis or KeyDB, because Googlebot crawls from a small address pool and you don't want a DNS round trip on every hit.
At the edge, Cloudflare exposes the same decision as variables you can act on before the request reaches your origin. cf.bot_management.verified_bot flags confirmed legitimate crawlers, cf.verified_bot_category names the type, and cf.bot_management.score rates likely automation from 1 to 99. JA3 and JA4 TLS fingerprints catch clients whose handshake doesn't match the browser they claim to be. The pattern is to allow verified search engines, challenge or block the spoofers, and never make that call on the user agent alone.
The log fields that make this measurable
None of the above is observable without the right fields in your access log. Five carry most of the diagnostic weight.
| Log field | Example value | What it tells you |
|---|---|---|
| `$remote_addr` | 66.249.66.1 | Verify bot authenticity, isolate spoofers |
| `$status` | 304 or 410 | Cache hit ratio, redirect loops, crawl efficiency |
| `$body_bytes_sent` | numeric bytes | Payload bloat and uncompressed responses |
| `$http_user_agent` | crawler identifier | Classify indexation, training, or retrieval |
| `$time_local` | localized timestamp | Crawl frequency spikes and discovery rate |
Join those over a month and the picture assembles itself. Which vendors crawl most, how much of it returns 304 versus a full render, where spoofers cluster, and which sections eat budget without earning a citation.
The work sits at the seam between SEO and infrastructure, which is exactly why it tends to fall through the cracks at growing companies. If you'd want a second read on which crawlers are worth your origin's resources and how your caching is actually behaving under bot load, the AI Search & SEO Audit covers this layer, server logs included.
For the indexation side of the same problem, see why Google crawls pages it won't index. For the verified-Googlebot log script that pairs with this, see find ghost crawls with a Python log script.