Technical SEO

When Googlebot Keeps Crawling Pages It Refuses to Index

Why crawled pages drop out of Google's index, how to find ghost crawls in your server logs, and when to prune, noindex, or consolidate programmatic pages.

Open the Page indexing report in Search Console on any large site and the pattern is usually there. "Crawled, currently not indexed" keeps growing while the indexed count stays flat. Googlebot visits, fetches a 200, and walks away without keeping anything. Discovery stopped being the bottleneck years ago. Retention is the bottleneck now.

Google's own documentation is direct about this. Crawling never guarantees indexing, and pages that make it in can be dropped later when the index has better candidates for the same intent. Indexing is a selection process with a cost attached, and pages compete for a slot. Thin programmatic variations, near-duplicates, and template-heavy pages lose that competition quietly. No error, no notification.

The economics behind the selection

Storing and serving a page costs Google money for as long as the page stays in the index. So the index favors pages that earn their slot, and the crawl system rations attention by site. Google's crawl budget documentation puts thresholds on when this rationing starts to bite.

Site profile (Google Search Central)Crawl budget becomes a real constraint
1 million+ unique pages, content changing weeklyYes
10,000+ unique pages, content changing dailyYes
Smaller sites with stable contentRarely

Below those lines, crawl budget talk is mostly noise. Above them, every low-value URL Googlebot fetches is a higher-value URL it didn't, and index retention starts behaving like a portfolio decision someone else makes about your site.

Ghost crawls are the symptom worth logging

Call them ghost crawls. URLs that Googlebot fetches on a regular schedule, returns 200 for, and never indexes, or indexes briefly and drops. Search Console won't show you the pattern directly because the crawl stats and the indexing report live in different views. Your logs show it plainly.

The audit takes three inputs. Raw access logs, a verified-Googlebot filter, and an export of your indexed URLs.

# Googlebot hits per URL over the log window
grep Googlebot access.log | awk '{print $7}' | sort | uniq -c | sort -rn > crawled.txt

# URLs crawled 5+ times that appear nowhere in the indexed export
awk '$1 >= 5 {print $2}' crawled.txt | sort > hot.txt
comm -23 hot.txt indexed_urls.txt > ghost_crawls.txt

Two cautions from doing this on real logs. Filter by verified Googlebot (reverse DNS to googlebot.com, then forward-confirm), because scrapers fake the user agent constantly. And read the result by directory, not by URL. Ghost crawls cluster in sections (faceted filters, paginated archives, parameter variants, location pages), and the section is the unit you'll act on.

Information density decides who survives

A pattern shows up again and again in the ghost-crawl list. Pages where the template outweighs the content. Strip the header, footer, sidebar, and recommendation blocks from a typical programmatic page and count what's left that exists nowhere else on the site. On losing pages it's often a hundred words wrapped in two thousand words of shell.

Google's quality rater guidelines have distinguished main content from supplementary content for years, and the indexing system behaves consistently with that split. A crawler that fetches fifty location pages and finds the same shell with a swapped city name learns the section's value quickly.

Measuring this doesn't need bespoke tooling. Screaming Frog's near-duplicate report with a custom content area set, or a rendered-DOM word count of the main content block against total page weight, gets you a density score per template. Sections sitting under roughly one-third unique content are the first place to look.

Prune, noindex, or consolidate

The brief version of a decision that gets overcomplicated.

Pruning feels destructive, so teams default to keeping everything indexed and hoping. The sites that win the retention game treat the index footprint as something they curate. Fewer, denser pages crawled more often beats a long tail of shells crawled once a quarter.

Retention is upstream of AI citations too

The post-index world has a second customer. AI Overviews ground their answers in Google's index, and answer engines that run their own crawlers apply the same economics with smaller budgets. A page pruned for low density isn't just absent from rankings. It's absent from the pool of passages any answer engine can quote. Index retention has become the entry ticket, for both result types.

A useful next step this quarter is a one-time log audit against your indexed export, section by section, before deciding what to build next. If you'd rather have a second pair of eyes on the result, the AI Search & SEO Audit on this site covers exactly this diagnostic, crawl economics included.


Chat on WhatsApp