Skip to content
Back to Blog
seorobotstxt-genai-crawlerstechnical-seo

How to Block AI Crawlers in robots.txt (2026)

Blocking 'Claude-Web' does nothing in 2026, it is deprecated. Here is how to block the AI training crawlers that matter while keeping your ChatGPT and Perplexity citations.

SZ
Founder, Molixa
10 min read
Share
How to Block AI Crawlers in robots.txt (2026)
Table of contents8 sections

To block AI crawlers in robots.txt, add Disallow: / rules for the current training bots by their real user-agent names, mainly GPTBot, ClaudeBot, CCBot, and Google-Extended. Skip deprecated strings like Claude-Web, which no longer do anything, and leave the AI search bots allowed if you still want ChatGPT and Perplexity to cite you.

The mistake almost every guide makes is treating "AI crawlers" as one thing to blanket-block. They are not. Some bots scrape your content to train models. Others fetch your pages live to answer a user's question and cite you. Block the wrong group and you vanish from AI search results while the training scrapers you meant to stop keep ignoring your outdated rules.

Training Bots vs Search Bots: Block One, Keep the Other#

This split is the single most important concept on the page, and it is the one most posts miss. Before you write a single rule, decide which group you actually want to stop.

Crawler purposeWhat it doesShould you block it?
Training crawlersScrape content to train or improve AI modelsBlock if you do not want your content used for training
AI search crawlersFetch pages live to answer queries and cite sourcesUsually allow, blocking removes you from AI answers
Traditional search crawlersIndex pages for Google, Bing, etc.Never block, this destroys your SEO

Blocking GPTBot stops OpenAI from training on your content, but it does not stop OAI-SearchBot, the agent that powers citations in ChatGPT Search. If you block both, you protect your content from training and disappear from ChatGPT's cited sources. That may be what you want, or it may be a costly accident. Decide deliberately.

The clean default for most publishers: block the training scrapers, keep the search agents. You opt out of free model training while staying visible (and linked) inside AI answers.

The Deprecated User-Agents Still Floating Around#

A lot of copy-pasted robots.txt advice blocks user-agent strings that no longer do anything. If your file lists these expecting them to stop Anthropic or others, it does not work.

  • Claude-Web is deprecated. It was an old Anthropic user-agent and is no longer the string that fetches content for training or live retrieval. Blocking it has no effect on Anthropic's current crawlers.
  • anthropic-ai is likewise outdated as a blocking target. Anthropic's active training crawler identifies as ClaudeBot.

If your goal is to stop Anthropic's model training, the directive that matters in 2026 is ClaudeBot, not Claude-Web or anthropic-ai. Blocking the dead strings while leaving ClaudeBot allowed is the exact failure pattern this guide exists to fix.

The Current AI Crawler User-Agents (2026)#

Here are the user-agents worth knowing, grouped by what they do. User-agent matching in robots.txt is case-insensitive, and each User-agent line targets exactly one token.

User-agentOperatorPurposeTypical choice
GPTBotOpenAITraining data collectionBlock to opt out of training
OAI-SearchBotOpenAIChatGPT Search results and citationsAllow to stay in ChatGPT answers
ChatGPT-UserOpenAILive fetch when a user asks ChatGPT to visit a pageAllow for on-demand user fetches
ClaudeBotAnthropicTraining and crawlingBlock to opt out of training
Claude-SearchBotAnthropicSearch indexing for ClaudeAllow to stay visible in Claude search
Claude-UserAnthropicLive fetch for a user requestAllow for on-demand user fetches
Google-ExtendedGoogleGemini training and grounding controlBlock to opt out of Gemini training
GooglebotGoogleCore Search index and AI OverviewsNever block
CCBotCommon CrawlOpen dataset many models train onBlock to opt out of dataset scraping
PerplexityBotPerplexityIndexing for Perplexity answersAllow to be cited by Perplexity
BytespiderByteDanceAggressive training crawlerBlock, frequently ignores rules
Meta-ExternalAgentMetaAI training crawlerBlock to opt out of Meta training

Two important nuances. First, Google-Extended only controls Gemini training and grounding; it does not affect your normal Google ranking, which Googlebot governs. You can opt out of Gemini training without harming SEO. Second, some crawlers (Bytespider is the usual offender) have a reputation for ignoring robots.txt, so for those you may need server-level blocking, not just a polite directive.

The robots.txt Rules You Actually Want#

Here is a current, copy-ready robots.txt that blocks the major training crawlers, keeps the AI search agents allowed so you stay citable, and never touches traditional search bots. Place it at the root of your domain, at https://yourdomain.com/robots.txt.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow AI search crawlers so you stay cited in AI answers
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Never block traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Default for everything else
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

If you want to block all AI training and all AI search (full opt-out, accepting that you lose AI citations), change the search-bot blocks to Disallow: / as well. If you only want to opt out of training while staying maximally visible, the file above is your template. Build a customized version with your own domain and rules using the free robots.txt generator, which assembles the correct current syntax so you do not paste a stale string by accident.

Each User-agent block needs its own directive line. Listing Disallow: / once under a single agent does not cascade to the others, every bot you want to block needs an explicit block.

How to Block AI Crawlers Step by Step#

Step 1: Decide your policy#

Pick one of three stances. Allow everything (you accept training and want every citation). Block training, allow search (the common publisher default, opt out of training but stay cited). Or block everything (full opt-out, no AI use of your content and no AI citations). Your robots.txt follows from that single decision, so make it first.

Step 2: Write the user-agent blocks#

Add a User-agent: line for each crawler, followed by Disallow: / to block or Allow: / to permit. Use the current names from the table above, GPTBot and ClaudeBot, not the deprecated Claude-Web. Group your blocks and your allows so the file stays readable. The free robots.txt generator lets you toggle each bot and outputs the exact syntax.

Step 3: Add a noai meta tag or X-Robots-Tag (optional belt and braces)#

robots.txt asks crawlers not to fetch a page. It does not control what happens to content already collected, and well-behaved crawlers honor it while bad ones may not. For an extra signal, add an X-Robots-Tag: noai, noimageai HTTP header or a page-level <meta name="robots" content="noai"> tag. Support is not universal, but it expresses intent that some operators respect.

Step 4: Deploy at the domain root and test#

Upload the file so it resolves at https://yourdomain.com/robots.txt exactly, robots.txt only applies to its own host and protocol, and subdomains need their own. Fetch the URL in a browser to confirm it loads as plain text, then check it in Google Search Console's robots.txt report to make sure you did not accidentally block Googlebot. Pair it with a current XML sitemap, which you can build with the free sitemap generator, and reference that sitemap at the bottom of the file as shown above.

Will Blocking AI Crawlers Hurt Your SEO?#

Blocking AI training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) has no effect on your traditional Google or Bing rankings, because those are governed by Googlebot and Bingbot, which you keep allowed. You can opt out of model training and rank exactly as before.

Blocking AI search crawlers is different. If you disallow OAI-SearchBot, Claude-SearchBot, or PerplexityBot, you remove yourself from those engines' cited answers, which is a growing source of referral traffic. And blocking Googlebot, which a careless wildcard rule can do, is genuinely harmful and de-indexes you from Google. Keep your search-engine and AI-search agents allowed unless you have a specific reason not to.

The other piece of on-page control worth setting correctly is your meta tags, so a careless noindex or canonical does not undercut the same pages, our free meta tag generator helps you audit those. And if you are thinking about how AI engines read your content beyond crawling, our guide on whether FAQ schema is dead in 2026 covers the structured data those same engines parse to cite you.

The Bottom Line#

Blocking AI crawlers in robots.txt works in 2026, but only if you use the right names and the right strategy. Block the training scrapers (GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider, Meta-ExternalAgent) by their current user-agents, ignore deprecated strings like Claude-Web that do nothing, and keep the AI search agents and traditional search bots allowed unless you genuinely want to disappear from AI answers.

Decide your policy first, write explicit blocks for each bot, deploy at the domain root, and confirm in Search Console that Googlebot is untouched. The free robots.txt generator builds a current, syntactically correct file so you stop the crawlers you mean to and keep the visibility you want.

Frequently Asked Questions#

How do I block AI crawlers in robots.txt? Add a User-agent: block with Disallow: / for each training crawler you want to stop, using current names like GPTBot, ClaudeBot, CCBot, and Google-Extended. Keep traditional search bots like Googlebot allowed, and decide separately whether to allow AI search agents like OAI-SearchBot and PerplexityBot so you stay cited.

Does blocking Claude-Web stop Anthropic from training on my content? No. Claude-Web is deprecated and blocking it has no effect. Anthropic's active training crawler identifies as ClaudeBot, so that is the user-agent to disallow in 2026. Blocking the old strings while leaving ClaudeBot allowed is the most common reason people think they opted out when they did not.

Will blocking AI crawlers hurt my Google rankings? No, as long as you only block AI bots and keep Googlebot allowed. Blocking GPTBot, ClaudeBot, CCBot, or Google-Extended does not affect traditional Google or Bing rankings, those are controlled by Googlebot and Bingbot. Note that Google-Extended only opts you out of Gemini training, not Search.

What is the difference between GPTBot and OAI-SearchBot? GPTBot collects data to train OpenAI's models, so block it to opt out of training. OAI-SearchBot fetches pages to power ChatGPT Search results and citations, so allow it if you want ChatGPT to surface and link your content. Blocking both opts you out of training and removes you from ChatGPT's cited answers.

Can AI crawlers ignore my robots.txt? Yes. robots.txt is a request, not enforcement. Reputable crawlers like GPTBot and ClaudeBot honor it, but some, such as Bytespider, have a reputation for ignoring it. For those, add an X-Robots-Tag HTTP header or block them at the server or firewall level rather than relying on robots.txt alone.

Should I block all AI bots or just training bots? For most publishers, block training bots and allow AI search bots. That opts you out of free model training while keeping you visible and cited in ChatGPT Search, Claude, and Perplexity, a growing traffic source. Full blocking makes sense only if you do not want your content used by AI in any form, including citations.

seorobotstxt-genai-crawlerstechnical-seo

More from Molixa

Try Molixa Tools

50+ free AI tools for content creation, SEO, coding, and more. No signup, no watermark.

Explore all tools