Blocking Bad LLM Bots For SEO Performance

Summary

The blog post addresses the rise of Large Language Models (LLM) and their impact on web content scraping, particularly focusing on the detrimental effects of LLM scraping operations on websites. It highlights the need for website owners to be vigilant and proactive in blocking bad LLM bots from accessing their content, providing a list of user agents and IP addresses to consider blocking in Robots.txt. The post also discusses the lack of consent infrastructure in the current web ecosystem, emphasizing the importance of establishing guidelines for AI bots to respect website content and consent agreements. Additionally, it offers practical recommendations for identifying and blocking suspicious IP addresses associated with LLM scraping activities. The overall aim is to empower website owners to protect their content, optimize server performance, and enhance their SEO efforts in the face of increasing LLM-related challenges.

I concede the point that if real human users are going to use services like ChatGPT or Claude to find things that we as SEOs and marketers need to ensure our clients are visible here and encourage these systems to play nice with the open web. When used properly LLM-AI systems provide a boon to human productivity and consumer decision making. However, with every new announcement of some AI company gaining a massive investment or being bought for billions of dollars more and more LLM scraping operations pop up. Not only do these services seek to steal content from a website without ever giving anything back in hopes of making a big payday, they also cause increased bandwidth fees and even downtime.

In this article we discuss blocking bad known or suspected LLM bots from accessing your server / website and keeping your content focused on LLM-based services that might bring you value instead of causing you harm.

Bad in this context could mean a few different things:

  • Doesn’t add in value to the source website in terms of new traffic, sales, lead generation, etc…
  • Takes large volumes of content likely to be used or repurposed for their own needs without your permission.
  • Scrapes pages at a high velocity potentially causing load time issues for real humans using your website or worse crashing your website.
  • Takes any quantity of content from your site in order to help one or more of your competitors perform better than you and/or take your customers.

Bad LLM Bots (and honestly all kinds of scraper bots) can cause a host of issues. When we look beyond classic SEO to LLM Optimization (or GEO, or AI SEO, or ChatGPT SEO, or whatever it’s being referred to as today) we see a rise in the potential downsides and performance complications.

  1. Slow load times or website downtime due to sustained heavy bot traffic – Many webhosting services have limits on how many concurrent connections / visits / page views / etc… a website can have at one time. When you go over this limit the system might shut your entire website down momentarily displaying an error message OR dramatically reduce the load time for new users.
  2. Regular search like Google and Bing isn’t dead yet, and is still the preferred method for billions of web users around the globe. – Allowing unfettered access to content could lead to a major rise in the volume of competitors leading to reduced value from SEO efforts.
  3. Reduced reasoning for visiting a website, opening an app, or engaging with a brand – If users can just query all of the information about you inside of random systems that give zero back, then the lifetime value of your SEO and content efforts could collapse. Here specifically we mean systems that give absolutely zero back such as a company taking your content, adding labeling for AI, and selling it to researchers for example.

It is our hope that by publishing this document and including suspected bad LLM user-agents and IP addresses that collectively we can all start selecting those LLMs that will provide value back to the web and curtail the open theft that has been happening – or at the very least improve your server load times to aid your current SEO efforts and give you the tools to decide which bots to allow and which ones to block.

1. The Core Problem: The Lack of Consent Infrastructure

I won’t get into the deep details here, but the core of this problem is a lack of proper consent infrastructure. The early web established this as what is called the “handshake” between websites and search engines. Search Engines gained consent to crawl websites if they agreed to follow Robots.txt directives and use the information they found to rank those pages not compete with them, and websites gained valuable traffic.

LLMs break this decades old arrangement and instead take whatever they want, use it however they want, and give nothing back. Most LLM-AI systems do not even inform the world of their operations until after they have built their first several pre-trained models and then they release a way to block their AI scraping. Even in these cases AI scraping bots like ChatGPT-User, Google-Extended, and Meta-externalagent often ignore the Robots.txt directives and steal the content they want anyways. Some use clever tricks to get around this agreement, for example ChatGPT-User bot uses internal Bing data from Microsoft and visits the site live from a Microsoft server bypassing Robots.txt blocking even if specifically declared. Google claims to respect the Google-Extended block in Robots.txt for training their modes like Gemini that power AI Mode but uses data from their GoogleBot in order to pull content into their AI Overviews which effectively steals the content from your website and reproduces it even if you specifically asked Google not to do this (and often doesn’t link to you or give you credit either!).

Google, OpenAI, Anthropic, Perplexity, and Meta have all had a chance to establish a consent framework that would allow websites to block AI bots from content or allow it to access content and there is a growing number of community initiatives in this direction, however, so far all of the major LLM-AI systems have outright refused to adopt such principles and instead crawl at will.

This means that those who wish to try and make their fortune chasing AI to sell to or usurp these tech companies are also ignoring the concepts of consent and incentives. Meaning, normal website owners, publishers, and small businesses are paying the price.

2. User Agents to Consider Blocking in Robots.txt

This is a list of bots / user-agents you should block in Robots.txt if you do not want any of your content being included in an LLM-AI system. Again, there is not a set universal way to remove content from or block specific pages/sections of content from being used in training data. As explained above many LLM-AI systems are also finding ways around the Robots.txt directives and

User-agent = Google-Extended
Google’s AI scraper bot, if you use this Google promises to not use your data for pre-training. However, they will still use it for other AI implementations such as AI Overviews.

Recommendation: Block
Scraping frequency: High

User-agent = meta-externalagent
Meta’s AI scraper bot, if you use this Meta will not use your content to pre-train their LLM systems. There is no currently not value in being scraped by Meta. They give absolutely nothing back.

Recommendation: Block
Scraping frequency: High

User-agent = GPTBot
This is OpenAI’s main crawler that helps them build foundational models. There is little value to your content being in the training system unless you can determine that your brand entity will be

Recommendation: Block
Scraping frequency: Moderate

User-agent = ChatGPT-User
This is OpenAI’s web crawler that crawls content on behalf of a user. This does not mean a user sees your website or clicks to visit, but that ChatGPT fetched content on your site for an operation performed on that users behalf. This means if the user wants to read a subscriber-gated article or write a new article based on your content or asks a question where ChatGPT determines to fetch content of your page to answer it uses this user-agent.

Recommendation: Do not block – It is pointless to block this bot (we know firsthand). Like many other tech companies if the crawl is “user initiated” and one-time, they will ignore Robots.txt directives.
Scraping frequency: Low

User-agent = OAI-SearchBot
OpenAI says this use this when a user searches for something in ChatGPT. However, we rarely see this in logs right now. Even though it might rarely appear in logs, OpenAI promises data from this bot is not used in training AI systems and it could provide value if the number of users using ChatGPT to search climbs.

Recommendation: Do not Block
Scraping frequency: Low

User-agent = PerplexityBot
This is Perplexity’s main search bot. Allowing this bot is required to appear in their search results and they claim this bot is not used to gather data for pre-training models. However, Perplexity never says how to keep them from using your content in their LLM-models so allow this user-agent with caution.

Recommendation: Do not Block
Scraping frequency: Moderate

User-agent = Perplexity‑User
This is the bot that crawls pages used to perform “actions” for Perplexity’s users. Perplexity does not publish what activities users of their service most frequently perform, but it is safe to say none of those actions are valuable to you since this is not used in calculating search responses. Most likely the actions being taken are ones combing or rewriting articles, meaning allowing this bot would simply make stealing your content much easier than it already is.

Recommendation: Block
Scraping frequency: Low

User-agent = anthropic-ai
Anthropic does not comment on their bots or publish documentation on them that we can readily find. This bot is rumored to exist but we haven’t seen it in our logs yet. If it does exist, this is most likely Anthropic’s bot used to gather content for foundational pre-trained models.

Recommendation: Block
Scraping frequency: Low

User-agent = AwarioSmartBot
This bot belongs to a company that builds chatbots for companies. If your company is using Awario you will not want to block this crawler. However, in nearly 100% of cases we’ve seen so far the bot is crawling pages without the website owner’s knowing or consenting – most likely to aid a competitor.

Recommendation: Block – Unless you use Awario on your site.
Scraping frequency: High

How to block all of these bots in Robots.tx


User-Agent: Google-Extended
Disallow: /

User-Agent: Meta-externalagent
Disallow: /

User-Agent: GPTBot
Disallow: /

User-Agent: Perplexity-User
Disallow: /

User-Agent: Anthropic-ai
Disallow: /

User-Agent: AwarioSmartBot
Disallow: /

3. IP Addresses of Unknown Bots to Block

These are ip addresses of suspected LLM bots. The ip addresses are found in raw access or visitor logs and display behavior similar to other voracious bots. You can block ip addresses at your server level, in cPanel under “IP Blocker”, or in various other ways like using WordFence for WordPress.

IP Address = 206.41.168.153
This is the ip address of a bot that doesn’t declare a user-agent and traces back to an Agentic AI company based in Milwaukee, WI called “Thoughtport”.

Recommendation: Block
Scraping frequency: High

IP Address = 103.208.70.211
This is a suspected LLM scraper code named “Tataskythief” running on the Tata Sky network in India. There is no user-agent declared but it has a voracious appetite sucking up 50 to 60 documents per minute before leaving your site alone for awhile after it takes everything.

Recommendation: Block
Scraping frequency: High

IP Address = 152.59.163.47
This is a suspected LLM scraper code named “Reliancethief” running on the Reliance Industries Limited network. There is no user-agent declared but it has a similarly high appetite gobbling up 30 to 40 documents per minute

Recommendation: Block
Scraping frequency: High

IP Address = 188.213.34.101
This is a suspected LLM scraper code named “M247thief” running on the M247 hosting system in Romania. This bot is extra bad as it is known to click on Google Ads in order to scrape content while also downing dozens of documents per minute.

Recommendation: Block
Scraping frequency: High

IP Address = 158.181.11.203
This is a suspected LLM scraper code named “Megalinethief” running on the Megaline network out of Kyrgyzstan. While this bot crawling at a much lower rate than others, unless you’re in Kyrgyzstan there is likely little to zero value in getting crawled and having your content scraped by this system.

Recommendation: Block
Scraping frequency: High

IP Addresses of suspected or known LLM scrapers / bots we recommend you block:


206.41.168.153
103.208.70.211
152.59.163.47
188.213.34.101
158.181.11.203

If you’re trying to find similar ones we recommend doing some rough fingerprinting and find ip addresses in your raw access or visitor logs that match a specific profile similar to this:

1. IP address does not route to an ISP or cellular data provider – IP address that route to hosting companies or tech companies you do not use are likely scrapers of some kind.

2. IP address does not route to a country or region where you do business. – New LLM scrapers are popping up globally and working to steal content for pre-training in order to try and get in on the AI-LLM gold rush.

3. IP address shows access that doesn’t match standard user behavior such as accessing dozens of documents in one minute, accessing 404 pages, etc… – LLM scrapers are generally not being limited by their creators and are accessing all kinds of content quickly and in haphazard fashion. This is usually a tell-tale sign of an LLM scraping operation.

Resources

Dark Visitors Agent list: https://darkvisitors.com/agents
Google User Agent Overview: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
Meta Web Crawlers: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
OpenAI Crawlers: https://platform.openai.com/docs/bots/overview-of-openai-crawlers
Perplexity Crawlers: https://docs.perplexity.ai/guides/bots

Joe Youngblood

view all posts

Joe Youngblood is a top Dallas SEO, Digital Marketer, and Marketing Theorist. When he's not working with clients or writing about marketing he spends time supporting local non-profits and taking his dogs to various parks.

0COMMENTS Join the Conversation →