Technical SEO

A Python Script That Finds Pages Googlebot Crawls but Won't Keep

A walkthrough of a log-analysis script that verifies real Googlebot hits, scores crawl frequency, and flags URLs missing from the index.

Three lines of grep were enough to start the ghost-crawl audit in the previous post on index retention. Shell pipelines stop scaling right around the point the question gets interesting, when you want verified bot traffic only, status-code filters, and output your team can sort in a spreadsheet. This is the Python version, short enough to read in one sitting and honest about the part most published versions of this script fake.

That part is verification. A user-agent string is a text field anyone can send, and scrapers impersonate Googlebot constantly because it gets them past rate limiters. A log analysis that filters on the string "Googlebot" alone counts an unknown amount of fake traffic, and every number downstream inherits the error.

Parsing the log and checking who's really calling

The combined log format most servers write includes the user agent in the final quoted field. The regex captures it, the filter does a cheap string check first, and the expensive check runs only on survivors.

import re
import socket
from functools import lru_cache
import pandas as pd

LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?)\] '
    r'"(?P<method>\S+) (?P<path>\S+) \S+" '
    r'(?P<status>\d{3}) (?P<bytes>\S+) '
    r'"(?P<referer>.*?)" "(?P<agent>.*?)"'
)

@lru_cache(maxsize=4096)
def is_real_googlebot(ip):
    """Reverse DNS, then forward-confirm. Google's documented method."""
    try:
        host = socket.gethostbyaddr(ip)[0]
        if not host.endswith(('.googlebot.com', '.google.com')):
            return False
        return ip in {a[4][0] for a in socket.getaddrinfo(host, None)}
    except OSError:
        return False

Two lookups, not one, and the second is the point. Reverse DNS alone can be spoofed by anyone who controls their own PTR records, so you resolve the hostname back to an IP and confirm it matches the one in the log. Google documents exactly this procedure for crawler verification. The lru_cache keeps it fast, since Googlebot visits from a modest pool of addresses and each one only needs checking once per run.

def parse_googlebot_hits(log_path):
    records = []
    with open(log_path) as f:
        for line in f:
            m = LOG_PATTERN.match(line)
            if not m:
                continue
            d = m.groupdict()
            if 'Googlebot' not in d['agent']:
                continue
            if not is_real_googlebot(d['ip']):
                continue  # spoofed UA, skip it
            records.append({
                'timestamp': d['date'],
                'path': d['path'],
                'status': int(d['status']),
            })
    return pd.DataFrame(records)

One deliberate choice here. The path keeps its query string. Faceted and parameter URLs are often exactly where crawl waste lives, and stripping parameters would hide the worst offenders inside their clean parents. Strip them later if your site doesn't use parameters for content.

Scoring retention risk

With verified hits in a DataFrame, the risk logic is a frequency count joined against your indexed URL list.

def retention_risks(hits, indexed_paths, min_crawls=5):
    ok = hits[hits['status'] == 200]
    counts = (ok['path'].value_counts()
                .rename_axis('path')
                .reset_index(name='crawls'))
    counts['indexed'] = counts['path'].isin(indexed_paths)
    return counts[(~counts['indexed']) & (counts['crawls'] >= min_crawls)]

The 200-only filter is doing real work. Status codes split the problem. A URL crawled twenty times that returns 404 is a cleanup task for your redirect map. A URL crawled twenty times that returns 200 and never gets indexed is a value judgment Google made about your content, and that's the list this script exists to produce.

Five crawls per log window is the threshold separating deliberate revisits from one-off discovery fetches. Tune it to your window length. Five hits in a week means something different than five hits in a quarter.

Rolling the output up to sections

Individual URLs are the wrong unit for action, the same lesson as the shell version. One extra line groups risks by top-level directory, which turns four thousand flagged URLs into a decision about six templates.

risks = retention_risks(parse_googlebot_hits('access.log'), indexed_paths)
risks['section'] = risks['path'].str.extract(r'^(/[^/?]*)')
print(risks.groupby('section')['crawls'].sum().sort_values(ascending=False))

A typical result puts most of the wasted crawl in two or three sections (filtered listings, tag archives, near-duplicate location pages), and those sections become the prune, noindex, or consolidate conversation from the retention post.

Getting the indexed list without guessing

The indexed_paths set is the input that decides whether the output means anything, and there's no single export that hands it over. Two workable sources, used together when stakes are high.

For a site under a few thousand pages, a week of logs and one afternoon covers the whole loop. Run it quarterly and the trend line tells you more than any single run, since a growing ghost-crawl list is the earliest signal that a template is losing the index's confidence.

If you'd rather build this capability into your own team than borrow a script, the training programmes on this site teach exactly this kind of log work, cohort format, on your own data.


Chat on WhatsApp