Your robots.txt and sitemap decisions, laid out for AI search
Open your robots.txt right now. Does it block GPTBot? Allow PerplexityBot? You need clear answers for both files before AI engines crawl you this week.
Open your robots.txt file right now. Type your-domain.com/robots.txt into a browser. Does it block GPTBot? Allow PerplexityBot? Does it point to a sitemap? If you don't have clear answers for those three questions, you are making crawl decisions by accident instead of strategy. This post gives you the decision framework to fix that in the next hour.
I run these audits with SME clients in Southeast Asia every week. The pattern repeats. Their robots.txt either blocks everything by default (because a developer copy-pasted a cautious template three years ago) or allows everything with no thought about which bots serve citations versus which ones scrape for training. Their sitemap either includes noindex pages, or it doesn't exist at all, or it hasn't updated since 2023. Then they wonder why ChatGPT never cites them and Perplexity pulls from competitors.
The fix is not complicated. You need to make two file decisions and you need to make them based on what each AI crawler actually does.
The crawlers you actually need to decide about
The three most active are GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot (Perplexity). Each has a different job.
GPTBot is OpenAI's primary crawler. It gathers content that can improve future models and supports ChatGPT search. ClaudeBot collects data for Anthropic's Claude models. PerplexityBot indexes pages so Perplexity can cite them in its answer engine, which often links directly back to sources.
Then there are the user-agent bots. ChatGPT-User is not a bulk crawler at all. Instead, it fetches a single page in real time when a ChatGPT user clicks a link or asks a question that needs current information. Treating it the same as GPTBot is a common and costly mistake.
The decision you need to make is not "block all AI" or "allow all AI." It is which bot serves which purpose for your brand, and whether you care more about citations (visibility) or training-data control.
Here is the working decision table I use with clients.
| Bot | Purpose | If you block it | Typical posture for SMEs |
|---|---|---|---|
| GPTBot | Model training + ChatGPT search | No ChatGPT search citations | Allow (citations matter more than training anxiety) |
| ChatGPT-User | Live fetch for user queries | ChatGPT can't retrieve your page when asked | Allow (this is the citation path) |
| ClaudeBot | Model training for Claude | Reduced Claude citation likelihood | Allow |
| PerplexityBot | Indexing for Perplexity citations | Perplexity can't cite you, no direct links back | Allow |
| Google-Extended | Gemini training (not Google Search) | Does not affect classic Google Search indexing | Selective (many block this, allow the others) |
| CCBot (Common Crawl) | Training datasets for many models | Reduced presence in derivative models | Block (training-only, no citation value) |
The cleanest posture for a startup or SME that wants AI visibility is this. Allow the citation bots (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot). Block the pure-training crawlers (CCBot, optionally Google-Extended). Protect gated paths (checkout, admin, account).
Here is what that looks like in robots.txt.
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/
Sitemap: https://yourdomain.com/sitemap.xmlThat file lives at the root of your domain. Any change takes effect on the next crawl, usually within 24 hours.
One more nuance that catches teams. Perplexity was still able to access content even when its bots were successfully blocked, by querying Perplexity AI with questions about restricted domains, because the user-initiated fetch (Perplexity-User) can bypass robots.txt. OpenAI respects robots.txt and does not try to evade either a robots.txt directive or a network level block. ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. This is the implementation gap. Most bots honor the file. A few have carve-outs for "user-driven" fetches. You can block those at the firewall or CDN level if it matters to you, but for public marketing content the trade-off usually isn't worth it.
The sitemap file that actually helps crawlers find you
Your sitemap.xml is the map you hand to every crawler. A clean sitemap should only include canonical, indexable URLs and stay aligned with robots.txt to avoid sending mixed crawl signals.
The mistakes I see in SME audits are predictable.
- The sitemap includes pages set to noindex. If a page is set to noindex, it should not be in your XML sitemap. A sitemap is you telling search engines "please index this." A noindex tag says the opposite.
- The sitemap lists URLs blocked in robots.txt. Your robots.txt file and sitemap need to agree. Blocking pages in robots.txt while listing them in your sitemap is a classic mistake and wastes crawl budget.
- The sitemap hasn't updated in months, so new content never gets discovered.
- The sitemap has grown too large and nobody split it into an index file.
The real work is choosing the right URLs, keeping the file clean, and maintaining trustworthy lastmod data. Stop spending time tuning priority scores. Spend it making sure every URL in the sitemap is a page you actually want indexed, is live (not a 404 or 301), and is crawlable (not blocked by robots.txt or noindex).
Here is the small-team checklist I give clients.
| Check | What it means | How to verify |
|---|---|---|
| Only indexable URLs | No noindex, no 404s, no redirects | Crawl your sitemap in Screaming Frog, filter by status code and meta robots |
| Canonical URLs only | No duplicates, no tracking parameters | Check that every URL in the sitemap matches its own canonical tag |
| Aligned with robots.txt | No URLs listed that robots.txt blocks | Cross-check Disallow paths in robots.txt against sitemap URLs |
| Auto-updates | Sitemap regenerates when you publish | In 2026, your sitemap should update automatically whenever you add, change, or remove a page. CMS Users can use plugins or native settings. |
| Submitted to Search Console | Google knows where to fetch it | Go to Google Search Console, Sitemaps, verify submission and index coverage |
| Referenced in robots.txt | Crawlers read robots.txt first | Add sitemap location line at the bottom of robots.txt |
If your site has fewer than 500 pages, one sitemap file is fine. If you are over 10,000 pages or running e-commerce with product variants, split into a sitemap index. Large websites must split sitemaps into multiple files and organize them with a sitemap index.
Dynamic sitemaps are table stakes now. Dynamic sitemaps that update automatically are usually the best option for blogs, ecommerce sites, and any site publishing content regularly. WordPress plugins (Yoast, RankMath) handle this. Shopify does it natively. If your site is custom-built, ask your developer to set up a script that regenerates sitemap.xml every time you publish or update a page. A static sitemap that someone has to remember to regenerate manually is a liability.
Why this matters more for AI search than it did for Google
Google has had two decades to build fallback discovery paths. It follows links, it scrapes social signals, it has Search Console where you can force a URL into the index. AI crawlers are newer and less forgiving. In the audits I run, I almost always see teams who have optimized for Google but never checked their Bing Webmaster Tools index coverage. Check Bing Webmaster Tools for index coverage before doing any ChatGPT optimization.
Perplexity uses real-time web retrieval, pulling from indexed, crawlable web content. ChatGPT's browsing capability similarly relies on accessible, indexed content. Content that isn't indexed, or content that is indexed slowly, has a narrower pathway into these AI-generated responses. If your content isn't in the index, it effectively doesn't exist from the perspective of AI systems that retrieve from live web sources.
The crawl-index-cite pipeline is stricter for AI engines because the engines are newer, their indexes are smaller, and they don't have the muscle memory of a decade of manual submissions and bug reports. If your robots.txt blocks them by accident, they move on. If your sitemap is broken, they don't come back as often.
The operational fix is this. Make your crawl decision explicit (the table above). Make your sitemap clean and current (the checklist above). Then verify that the major AI bots are actually visiting you. Check your server logs for GPTBot, ClaudeBot, PerplexityBot user-agent strings. If you see zero requests in the last 30 days, something is blocking them and you need to find out what.
I have worked with clients who spent three months optimizing content for ChatGPT citations, only to discover their firewall was blocking all OpenAI IPs because a security vendor's default ruleset flagged them as "scrapers." The content was perfect. The structure was right. The crawlers couldn't reach the site. Check access first, then optimize content.
The two-file operating system
Your robots.txt is access control. Your sitemap.xml is the priority map. Both need to agree, both need to stay current, and both need to account for the fact that AI crawlers now matter as much as Googlebot.
Open robots.txt, apply the decision table, save. Open Search Console, check sitemap coverage, fix any 404s or noindex conflicts, resubmit. Then move on to the content work that actually earns citations.
If you need help deciding which paths to block or how to structure a sitemap index for a large site, the consultancy includes a technical audit that maps this out. If your team wants to own this internally, the training workshop covers robots.txt strategy, sitemap hygiene, and crawler verification in the first session.
You can also just open the files right now and apply the tables above. The decisions are not complicated once you know what each bot does.