AI Search

Your robots.txt and sitemap decisions, laid out for AI search

Open your robots.txt right now. Does it block GPTBot? Allow PerplexityBot? You need clear answers for both files before AI engines crawl you this week.

Open your robots.txt file right now. Type your-domain.com/robots.txt into a browser. Does it block GPTBot? Allow PerplexityBot? Does it point to a sitemap? If you don't have clear answers for those three questions, you are making crawl decisions by accident instead of strategy. This post gives you the decision framework to fix that in the next hour.

I run these audits with SME clients in Southeast Asia every week. The pattern repeats. Their robots.txt either blocks everything by default (because a developer copy-pasted a cautious template three years ago) or allows everything with no thought about which bots serve citations versus which ones scrape for training. Their sitemap either includes noindex pages, or it doesn't exist at all, or it hasn't updated since 2023. Then they wonder why ChatGPT never cites them and Perplexity pulls from competitors.

The fix is not complicated. You need to make two file decisions and you need to make them based on what each AI crawler actually does.

The crawlers you actually need to decide about

The three most active are GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot (Perplexity). Each has a different job.

GPTBot is OpenAI's primary crawler. It gathers content that can improve future models and supports ChatGPT search. ClaudeBot collects data for Anthropic's Claude models. PerplexityBot indexes pages so Perplexity can cite them in its answer engine, which often links directly back to sources.

Then there are the user-agent bots. ChatGPT-User is not a bulk crawler at all. Instead, it fetches a single page in real time when a ChatGPT user clicks a link or asks a question that needs current information. Treating it the same as GPTBot is a common and costly mistake.

The decision you need to make is not "block all AI" or "allow all AI." It is which bot serves which purpose for your brand, and whether you care more about citations (visibility) or training-data control.

Here is the working decision table I use with clients.

BotPurposeIf you block itTypical posture for SMEs
GPTBotModel training + ChatGPT searchNo ChatGPT search citationsAllow (citations matter more than training anxiety)
ChatGPT-UserLive fetch for user queriesChatGPT can't retrieve your page when askedAllow (this is the citation path)
ClaudeBotModel training for ClaudeReduced Claude citation likelihoodAllow
PerplexityBotIndexing for Perplexity citationsPerplexity can't cite you, no direct links backAllow
Google-ExtendedGemini training (not Google Search)Does not affect classic Google Search indexingSelective (many block this, allow the others)
CCBot (Common Crawl)Training datasets for many modelsReduced presence in derivative modelsBlock (training-only, no citation value)

The cleanest posture for a startup or SME that wants AI visibility is this. Allow the citation bots (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot). Block the pure-training crawlers (CCBot, optionally Google-Extended). Protect gated paths (checkout, admin, account).

Here is what that looks like in robots.txt.

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/

Sitemap: https://yourdomain.com/sitemap.xml

That file lives at the root of your domain. Any change takes effect on the next crawl, usually within 24 hours.

One more nuance that catches teams. Perplexity was still able to access content even when its bots were successfully blocked, by querying Perplexity AI with questions about restricted domains, because the user-initiated fetch (Perplexity-User) can bypass robots.txt. OpenAI respects robots.txt and does not try to evade either a robots.txt directive or a network level block. ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. This is the implementation gap. Most bots honor the file. A few have carve-outs for "user-driven" fetches. You can block those at the firewall or CDN level if it matters to you, but for public marketing content the trade-off usually isn't worth it.

The sitemap file that actually helps crawlers find you

Your sitemap.xml is the map you hand to every crawler. A clean sitemap should only include canonical, indexable URLs and stay aligned with robots.txt to avoid sending mixed crawl signals.

The mistakes I see in SME audits are predictable.

The real work is choosing the right URLs, keeping the file clean, and maintaining trustworthy lastmod data. Stop spending time tuning priority scores. Spend it making sure every URL in the sitemap is a page you actually want indexed, is live (not a 404 or 301), and is crawlable (not blocked by robots.txt or noindex).

Here is the small-team checklist I give clients.

CheckWhat it meansHow to verify
Only indexable URLsNo noindex, no 404s, no redirectsCrawl your sitemap in Screaming Frog, filter by status code and meta robots
Canonical URLs onlyNo duplicates, no tracking parametersCheck that every URL in the sitemap matches its own canonical tag
Aligned with robots.txtNo URLs listed that robots.txt blocksCross-check Disallow paths in robots.txt against sitemap URLs
Auto-updatesSitemap regenerates when you publishIn 2026, your sitemap should update automatically whenever you add, change, or remove a page. CMS Users can use plugins or native settings.
Submitted to Search ConsoleGoogle knows where to fetch itGo to Google Search Console, Sitemaps, verify submission and index coverage
Referenced in robots.txtCrawlers read robots.txt firstAdd sitemap location line at the bottom of robots.txt

If your site has fewer than 500 pages, one sitemap file is fine. If you are over 10,000 pages or running e-commerce with product variants, split into a sitemap index. Large websites must split sitemaps into multiple files and organize them with a sitemap index.

Dynamic sitemaps are table stakes now. Dynamic sitemaps that update automatically are usually the best option for blogs, ecommerce sites, and any site publishing content regularly. WordPress plugins (Yoast, RankMath) handle this. Shopify does it natively. If your site is custom-built, ask your developer to set up a script that regenerates sitemap.xml every time you publish or update a page. A static sitemap that someone has to remember to regenerate manually is a liability.

Why this matters more for AI search than it did for Google

Google has had two decades to build fallback discovery paths. It follows links, it scrapes social signals, it has Search Console where you can force a URL into the index. AI crawlers are newer and less forgiving. In the audits I run, I almost always see teams who have optimized for Google but never checked their Bing Webmaster Tools index coverage. Check Bing Webmaster Tools for index coverage before doing any ChatGPT optimization.

Perplexity uses real-time web retrieval, pulling from indexed, crawlable web content. ChatGPT's browsing capability similarly relies on accessible, indexed content. Content that isn't indexed, or content that is indexed slowly, has a narrower pathway into these AI-generated responses. If your content isn't in the index, it effectively doesn't exist from the perspective of AI systems that retrieve from live web sources.

The crawl-index-cite pipeline is stricter for AI engines because the engines are newer, their indexes are smaller, and they don't have the muscle memory of a decade of manual submissions and bug reports. If your robots.txt blocks them by accident, they move on. If your sitemap is broken, they don't come back as often.

The operational fix is this. Make your crawl decision explicit (the table above). Make your sitemap clean and current (the checklist above). Then verify that the major AI bots are actually visiting you. Check your server logs for GPTBot, ClaudeBot, PerplexityBot user-agent strings. If you see zero requests in the last 30 days, something is blocking them and you need to find out what.

I have worked with clients who spent three months optimizing content for ChatGPT citations, only to discover their firewall was blocking all OpenAI IPs because a security vendor's default ruleset flagged them as "scrapers." The content was perfect. The structure was right. The crawlers couldn't reach the site. Check access first, then optimize content.

The two-file operating system

Your robots.txt is access control. Your sitemap.xml is the priority map. Both need to agree, both need to stay current, and both need to account for the fact that AI crawlers now matter as much as Googlebot.

Open robots.txt, apply the decision table, save. Open Search Console, check sitemap coverage, fix any 404s or noindex conflicts, resubmit. Then move on to the content work that actually earns citations.

If you need help deciding which paths to block or how to structure a sitemap index for a large site, the consultancy includes a technical audit that maps this out. If your team wants to own this internally, the training workshop covers robots.txt strategy, sitemap hygiene, and crawler verification in the first session.

You can also just open the files right now and apply the tables above. The decisions are not complicated once you know what each bot does.


Chat on WhatsApp