When Googlebot Keeps Crawling Pages It Refuses to Index
Why crawled pages drop out of Google's index, how to find ghost crawls in your server logs, and when to prune, noindex, or consolidate programmatic pages.
Open the Page indexing report in Search Console on any large site and the pattern is usually there. "Crawled, currently not indexed" keeps growing while the indexed count stays flat. Googlebot visits, fetches a 200, and walks away without keeping anything. Discovery stopped being the bottleneck years ago. Retention is the bottleneck now.
Google's own documentation is direct about this. Crawling never guarantees indexing, and pages that make it in can be dropped later when the index has better candidates for the same intent. Indexing is a selection process with a cost attached, and pages compete for a slot. Thin programmatic variations, near-duplicates, and template-heavy pages lose that competition quietly. No error, no notification.
The economics behind the selection
Storing and serving a page costs Google money for as long as the page stays in the index. So the index favors pages that earn their slot, and the crawl system rations attention by site. Google's crawl budget documentation puts thresholds on when this rationing starts to bite.
| Site profile (Google Search Central) | Crawl budget becomes a real constraint |
|---|---|
| 1 million+ unique pages, content changing weekly | Yes |
| 10,000+ unique pages, content changing daily | Yes |
| Smaller sites with stable content | Rarely |
Below those lines, crawl budget talk is mostly noise. Above them, every low-value URL Googlebot fetches is a higher-value URL it didn't, and index retention starts behaving like a portfolio decision someone else makes about your site.
Ghost crawls are the symptom worth logging
Call them ghost crawls. URLs that Googlebot fetches on a regular schedule, returns 200 for, and never indexes, or indexes briefly and drops. Search Console won't show you the pattern directly because the crawl stats and the indexing report live in different views. Your logs show it plainly.
The audit takes three inputs. Raw access logs, a verified-Googlebot filter, and an export of your indexed URLs.
# Googlebot hits per URL over the log window
grep Googlebot access.log | awk '{print $7}' | sort | uniq -c | sort -rn > crawled.txt
# URLs crawled 5+ times that appear nowhere in the indexed export
awk '$1 >= 5 {print $2}' crawled.txt | sort > hot.txt
comm -23 hot.txt indexed_urls.txt > ghost_crawls.txtTwo cautions from doing this on real logs. Filter by verified Googlebot (reverse DNS to googlebot.com, then forward-confirm), because scrapers fake the user agent constantly. And read the result by directory, not by URL. Ghost crawls cluster in sections (faceted filters, paginated archives, parameter variants, location pages), and the section is the unit you'll act on.
Information density decides who survives
A pattern shows up again and again in the ghost-crawl list. Pages where the template outweighs the content. Strip the header, footer, sidebar, and recommendation blocks from a typical programmatic page and count what's left that exists nowhere else on the site. On losing pages it's often a hundred words wrapped in two thousand words of shell.
Google's quality rater guidelines have distinguished main content from supplementary content for years, and the indexing system behaves consistently with that split. A crawler that fetches fifty location pages and finds the same shell with a swapped city name learns the section's value quickly.
Measuring this doesn't need bespoke tooling. Screaming Frog's near-duplicate report with a custom content area set, or a rendered-DOM word count of the main content block against total page weight, gets you a density score per template. Sections sitting under roughly one-third unique content are the first place to look.
Prune, noindex, or consolidate
The brief version of a decision that gets overcomplicated.
- Pages with no search value and no users. Return 410 (or 404) and remove internal links to them. Dead weight earns nothing by existing.
- Pages users need but search doesn't. Apply noindex and let them be crawled. Blocking them in robots.txt instead is the classic mistake, because a page Google can't crawl can't show its noindex, and URL-only indexing can keep it half-alive.
- Overlapping variants with real demand spread across them. Consolidate to one canonical page and redirect. The variants' history transfers, and a noindex just deletes it.
Pruning feels destructive, so teams default to keeping everything indexed and hoping. The sites that win the retention game treat the index footprint as something they curate. Fewer, denser pages crawled more often beats a long tail of shells crawled once a quarter.
Retention is upstream of AI citations too
The post-index world has a second customer. AI Overviews ground their answers in Google's index, and answer engines that run their own crawlers apply the same economics with smaller budgets. A page pruned for low density isn't just absent from rankings. It's absent from the pool of passages any answer engine can quote. Index retention has become the entry ticket, for both result types.
A useful next step this quarter is a one-time log audit against your indexed export, section by section, before deciding what to build next. If you'd rather have a second pair of eyes on the result, the AI Search & SEO Audit on this site covers exactly this diagnostic, crawl economics included.