AI Search

Small-team SEO crawl checklist, what to crawl and what to ignore

Most crawl advice is written for enterprise teams. Here's the version for startups with one marketer running Screaming Frog once a quarter.

If you have one marketer, no dev budget this quarter, and a site under 10,000 pages, you do not need to crawl your entire site every week. You also do not need to fix every red cell in a 47-tab spreadsheet.

You need a quarterly crawl that answers three questions: Can Google and AI engines reach my important pages? Are those pages readable? Is anything blocking citations or rankings that I can fix without engineering?

Most crawl guides are written for enterprise teams with dedicated SEO engineers. This is the version for you.

What to crawl

Run a full crawl of your primary domain. Include staging if you are about to push a major redesign, otherwise skip it.

Set your crawl tool to follow internal links only. Respect robots.txt. Limit external links to one hop (you want to see what you are linking to, but you do not need to crawl Wikipedia).

If your site has more than 5,000 pages, filter the export down to:

You are not trying to achieve 100% coverage. You are trying to spot problems on pages that matter.

The five things I check first

I open every crawl export the same way. I sort by HTTP status, then by indexability, then by page depth. Here is what I am looking for.

1. 404s that should not be 404s

Filter to status code 404. Export the list. Cross-reference it against:

If a 404 used to get traffic or still has inbound links, either restore it or set up a 301 redirect to the closest live equivalent. If it is an old blog post with no traffic and no inbound links, leave it dead.

Do not batch-redirect everything to your homepage. That is a waste of link equity and it confuses Google. Redirect to the most relevant live page, or let it 404.

2. Orphan pages with traffic

Orphan pages are live URLs that have zero internal links pointing to them. They exist, they are indexed, but no visitor or crawler can reach them by clicking through your site.

Filter your crawl to pages with zero inbound internal links. Cross-reference against organic traffic in the last 90 days.

If an orphan page is getting clicks from Google or appearing in AI answers, it needs internal links. Add it to a relevant category hub, link it from a related blog post, or include it in a "related articles" module.

If it has no traffic and no reason to exist, either delete it or noindex it.

3. Indexable pages blocked by robots.txt or meta robots

Filter to pages that return a 200 status code but are blocked by robots.txt, a noindex tag, or X-Robots-Tag header.

If the page is meant to be indexed (a product page, a key landing page, a pillar post), remove the block. If it is a checkout flow, a thank-you page, or a customer portal, the block is correct.

The mistake I see most often: a staging directive (noindex) left in place after launch, or a robots.txt disallow that was meant for /admin/ accidentally covering /blog/.

Check your three to five most important landing pages manually. If any are blocked, escalate to your developer immediately.

4. Thin or duplicate title tags and meta descriptions

Sort by title tag. Look for exact duplicates, near-duplicates, or titles under 20 characters.

Duplicate titles confuse Google about which page to rank for a given query. They also make it harder for AI engines to differentiate between your pages when deciding what to cite.

Fix duplicates on:

Use a formula if you need to scale this quickly. For product pages: `[Product Name] | [Category] | [Brand]`. For blog posts: `[Headline] | [Brand Blog]`.

Meta descriptions matter less for rankings, but they still appear in classic search results and sometimes get pulled into AI answer snippets. If yours are missing or identical across dozens of pages, write unique ones for your top 20 traffic-driving URLs. Ignore the rest until you have time.

5. Deep pages that should not be deep

Page depth is the number of clicks from your homepage. A page that is five or six clicks deep is hard for crawlers to find and unlikely to rank well.

Filter to pages deeper than level 4. Cross-reference against traffic or conversion value.

If a deep page is commercially important (a product page, a lead-gen landing page), bring it closer to the surface. Add it to your main navigation, link it from your homepage, or feature it in a category hub.

If it is a low-value archive page or an old case study that no longer reflects your offering, leave it deep or delete it.

What to ignore (for now)

Here is what I do not prioritize in a small-team crawl:

Render issues and JavaScript errors. If your site is built on Next.js, Nuxt, or another modern framework and Google Search Console is not showing coverage errors, your rendering is probably fine. Do not spend hours diagnosing this unless you see a clear indexing problem.

Hreflang mistakes. If you are only operating in one language or one market, hreflang does not apply to you. If you have multiple country sites, fix hreflang only after you have fixed the five issues above.

Canonical tag conflicts. Check canonicals on your ten highest-traffic pages. If those are clean, move on. Do not try to audit canonical tags across 3,000 URLs unless you are seeing widespread indexing issues in Search Console.

Image alt text coverage. Alt text matters for accessibility and for image search, but it rarely moves the needle on organic traffic for a startup. Fix it on product pages and key visuals. Ignore it on decorative UI elements.

Page speed scores below 90. A Lighthouse score of 65 is not ideal, but it is not killing your rankings if your core web vitals (LCP, CLS, INP) are passing in Search Console. Optimize speed after you fix crawlability and content.

How often to crawl

Quarterly is enough for most startups. Crawl more often if:

Monthly crawls make sense once you cross 5,000 indexed pages or you have a dedicated SEO hire. Weekly crawls are for enterprise sites with continuous deployment and large engineering teams.

What to do with the results

Export your findings into a simple tracker: URL, issue type, priority (high, medium, low), owner, status.

High priority: anything blocking indexing, breaking user experience, or affecting your top 20 traffic pages.

Medium priority: duplicate titles, orphan pages with some traffic, 404s with inbound links.

Low priority: deep pages with no traffic, missing alt text on blog images, old redirects that still work.

Batch the high-priority fixes and either handle them yourself (if they are content or meta tag edits) or hand them to your developer in a single ticket with context. Do not drip-feed fixes one at a time. That is how crawl audits die in the backlog.

If you need help deciding what to fix first or you want a second pair of eyes on a messy crawl export, book a free 30-minute strategy call. I will walk you through the three changes that will move the needle fastest for your site.

The crawl is not the strategy

A crawl tells you what is broken. It does not tell you what to build, what to write, or which queries to target. That is a separate exercise (I covered it in how to pick the 20 queries that decide your AI search strategy).

Run the crawl. Fix what is blocking your best pages. Then get back to publishing, building links, and showing up in the answers your customers are actually reading.


Chat on WhatsApp