How to Run an AI Visibility Audit Using Common Crawl’s New Field Guide

The blog post discusses Common Crawl’s new field guide called “The AI Visibility Audit” authored by Stephen Burns, providing a practical framework for auditing a website’s accessibility to AI-related crawlers and datasets. The audit delves into areas such as CCBot access, Common Crawl index coverage, harmonic centrality, structured data completeness, and server-side rendering. It emphasizes understanding the importance of AI visibility in the evolving landscape of search, training data, and retrieval systems. The post also touches on the nuances of intentional blocking of AI crawlers, the significance of deliberate decisions in AI visibility strategy, and the relevance of such audits for various business sectors like local SEO, ecommerce, and content-driven sites. The analysis provides actionable insights for conducting AI visibility audits and highlights the necessity of verifying technical setups align with intended goals.

Common Crawl recently published a new field guide called The AI Visibility Audit, written by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation.

The guide is worth paying attention to because it gives SEOs and business owners a practical framework for auditing something most traditional SEO audits still do not cover very well: whether a website is actually reachable by the crawlers, indexes, and datasets that can influence AI visibility.

I am of course not affiliated with Common Crawl or Stephen Burns. This article is a third-party look at the guide from the perspective of an SEO consultant, with additional notes on how business owners, in-house marketers, and agencies can use the framework in the real world. I urge everyone to read the source material directly with the link below:

Common Crawl AI Visibility Audit Guide (PDF)

Common Crawl and Stephen Burns deserve credit for creating the field guide and also the five-check audit framework which is discussed below.

What Is an AI Visibility Audit?

An AI visibility audit checks whether a website can be accessed, captured, understood, and potentially used by AI-related crawlers, data sources, and retrieval systems.

This is different from a normal SEO audit.

A traditional SEO audit usually asks questions like:

Can Googlebot crawl the site?
Are important pages indexed?
Are titles, headings, canonicals, and internal links set up correctly?
Does the site have enough useful content?
Does the site have authority and backlinks?
Are pages ranking and converting?

Those questions still matter. None of that goes away.

But AI visibility adds a new layer upstream of rankings. Before an AI system can recommend, summarize, cite, or “know” your business, your content has to be reachable by the systems that discover and collect web data.

That is the main point of Stephen Burns’ field guide. A page can rank well in Google and still be mostly invisible to AI systems if it is blocked from the wrong crawlers, missing from important crawl datasets, hidden behind JavaScript, or disconnected from the parts of the web that get crawled frequently.

The Big Idea: Search Is Becoming “Index, Rank, Train, and Retrieve”

For years, SEO has mostly revolved around crawling, indexing, ranking, and search result presentation.

AI systems add two more ideas to the mix:

1. Training Data

Some AI systems learn from large datasets built from web crawls, licensed data, user interactions, documents, and other sources. If your website is included in the right datasets before a model is trained, then your content may have a chance to become part of what the model “knows.”

This does not guarantee that a model will mention your business. It does not guarantee citations. It does not mean you can “rank” in AI the same way you rank in Google.

But if your website is completely absent from important crawl datasets, you may be missing the earliest layer of AI visibility.

2. Retrieval

Many AI systems also use live retrieval. This means the system may search, browse, or fetch current information at the time a user asks a question.

This is especially important for recent information, local businesses, product availability, pricing, news, events, reviews, and anything published after a model’s training cutoff.

In plain English: some AI visibility comes from what the model already learned, and some comes from what the system can retrieve right now.

A complete AI visibility audit should care about both.

Why Common Crawl Matters

Common Crawl is a nonprofit organization that crawls the open web and publishes large web crawl datasets for public use. Its crawler is called CCBot.

Common Crawl’s data has been used in research, machine learning, search, and AI development. That does not mean Common Crawl controls ChatGPT, Gemini, Claude, Perplexity, or any other AI product. It does not.

However, Common Crawl can be an important upstream source of web data. That means SEOs should at least know whether their websites are accessible to CCBot and present in the Common Crawl Index.

For business owners, here is the simple version:

If AI systems learn from web-scale datasets, and your website is blocked from those datasets, you may be making it harder for AI systems to understand or recommend your business later.

The Five Checks in Stephen Burns’ AI Visibility Audit Framework

Stephen Burns’ field guide lays out five main checks:

CCBot access check
Common Crawl Index coverage audit
Harmonic Centrality check
Structured data completeness
Server-side rendering audit

Here is how each one works and how SEOs can use it.

1. CCBot Access Check

The first question is the most basic:

Can Common Crawl’s crawler actually access your website?

If CCBot is blocked, your site may not appear in Common Crawl’s archive. If it does not appear in the archive, it may be absent from datasets that rely on Common Crawl.

There are two main places this can go wrong.

Robots.txt Blocking

Your robots.txt file might include a rule that blocks CCBot:

User-agent: CCBot
Disallow: /

You may also see rules for other AI-related crawlers, such as GPTBot, ClaudeBot, Google-Extended, PerplexityBot, or other user agents.

Blocking these bots is not always a mistake. Some publishers and creators intentionally do not want their content used in AI training. That is a legitimate business and legal decision.

The problem is when a business blocks AI-related crawlers without knowing it.

This can happen because of:

A developer decision made during a previous site build
A WordPress security plugin
A CDN setting
A managed robots.txt feature
A blanket “block AI bots” toggle
A copied robots.txt file from another website
A temporary rule that never got removed

CDN, Firewall, or WAF Blocking

The second issue is more difficult to spot.

Your robots.txt file may look clean, but your CDN, firewall, or bot-management system may still block CCBot before it reaches the server.

In that case, your site appears open when you look at robots.txt, but the crawler may receive a 403 Forbidden response when it tries to access the site.

This is why the guide recommends checking both robots.txt and the actual server response.

How to Test CCBot Access

First, check robots.txt:

curl -s https://example.com/robots.txt

Look for rules that block CCBot or other AI-related crawlers.

Then test the server response using a CCBot user agent:

curl -A "CCBot/2.0" -I https://example.com/

A good response usually looks like this:

HTTP/2 200

A blocked response may look like this:

HTTP/2 403

Then compare it against a normal browser user agent:

curl -A "Mozilla/5.0" -I https://example.com/

If the browser user agent gets a 200 but CCBot gets a 403, then the site is likely blocking CCBot at the firewall, CDN, or bot-management layer.

What to Do If CCBot Is Blocked

If the business wants AI visibility, accidental blocking should usually be fixed.

Places to check include:

Cloudflare bot settings
Akamai bot management
Fastly edge rules
WordPress security plugins
Server firewall rules
Managed robots.txt settings
Custom nginx or Apache rules

The important point is this: do not assume robots.txt tells the whole story. A lot of AI crawler blocking now happens at the edge before the request ever reaches the website.

2. Common Crawl Index Coverage Audit

Once you know the site is open to CCBot, the next question is:

Has Common Crawl actually captured the site?

Permission and presence are not the same thing.

A site may allow CCBot but still have poor coverage in Common Crawl. This can happen if the site is new, weakly linked, technically difficult to crawl, mostly rendered with JavaScript, blocked in the past, or not very central in the web graph.

How to Check Common Crawl Coverage

Common Crawl provides a free public index at:

https://index.commoncrawl.org/

You can query a recent crawl index like this:

curl "https://index.commoncrawl.org/CC-MAIN-2026-21-index?url=example.com/*&output=json" | head

Replace the crawl ID with a current Common Crawl index and replace example.com with the domain you are auditing.

What to Look For

When reviewing Common Crawl coverage, look for:

Whether the domain appears at all
How recently it was crawled
How many URLs appear
Whether important pages are included
Whether only the homepage appears
Whether blog posts, category pages, product pages, or service pages are present
Whether low-value parameter URLs are being captured instead of canonical URLs
Whether old URLs appear but new URLs do not

What Good Coverage Looks Like

For a healthy business website, you usually want to see more than just the homepage.

Depending on the site, important URLs may include:

Homepage
About page
Service pages
Location pages
Product pages
Category pages
Blog posts
Buying guides
Author pages
Research pages
Resource pages

For local businesses, I would pay close attention to service pages, city pages, reviews, FAQs, and pages that clearly explain what the business does and where it operates.

For ecommerce websites, I would pay attention to category pages, product pages, buying guides, brand pages, and comparison content.

For publishers, I would check recent articles, evergreen explainers, author pages, category pages, and high-performing older content.

What Poor Coverage Looks Like

Bad signs include:

No results for the domain
Only the homepage appears
Important sections are missing
The most recent crawl is very old
Only parameter URLs appear
Only old URLs from a previous site version appear
JavaScript app shell URLs appear without meaningful content

If the site is technically open but barely appears in Common Crawl, the next step is to investigate discovery, internal links, backlinks, rendering, crawl depth, and site architecture.

3. Harmonic Centrality Check

This is one of the more interesting parts of the guide because it connects AI visibility back to the structure of the web.

Common Crawl uses web graph data to help understand and prioritize the web. Stephen Burns explains that Harmonic Centrality can help describe how close a domain is to the “core” of the web’s link structure.

This is different from thinking only in terms of backlink volume.

A website with many low-quality or isolated links may not be as central as a website with fewer links from highly connected, trusted, frequently crawled parts of the web.

Why Harmonic Centrality Matters

If a domain is more central in the web graph, it may be crawled more often and more deeply.

If a domain is less central, it may still be accessible, but crawled less frequently or less completely.

For AI visibility, that matters because being technically crawlable is not the same as being prioritized.

How to Check It

The guide references this community tool:

https://webgraph.metehan.ai/

This is not an official Common Crawl product. It is a community tool built on Common Crawl Web Graph data. That means it can be useful for quick checks, but SEOs should avoid treating it as a perfect or final score.

How SEOs Should Use This

Do not turn Harmonic Centrality into another fake “domain authority” metric.

Use it as a strategic signal.

If a site has weak Common Crawl coverage and weak centrality, then link building, digital PR, local mentions, and industry citations may help more than just Google rankings. They may also improve the site’s ability to be discovered and crawled by systems that rely on the web graph.

For local businesses, that could mean earning mentions from:

Local news websites
Chambers of commerce
Industry associations
Local universities
Relevant local directories
Community organizations
Event pages
Vendor and partner websites

For ecommerce brands, that could mean earning mentions from:

Buying guides
Product reviews
Industry publications
Manufacturer pages
Comparison articles
Affiliate publishers
Category resource pages

For B2B companies, that could mean earning mentions from:

Partner pages
Integration marketplaces
Software review sites
Conference pages
Podcast pages
Research citations
Industry reports

This is where AI visibility overlaps with brand building. The more your brand is connected to trusted and central parts of the web, the easier it may be for crawlers and AI systems to discover, understand, and associate your business with its topics.

4. Structured Data Completeness

The fourth check is structured data.

Structured data does not force an AI system to cite you. It does not guarantee rankings. It does not automatically make your business show up in ChatGPT, Gemini, Claude, or Perplexity.

But it can help machines understand the entities on your website.

That matters because AI visibility is not just about keywords. It is also about entity understanding.

You want machines to understand:

Who the business is
What the business does
Where the business operates
Who wrote the content
What products or services are described
Which pages are related
Which reviews belong to which business
Which organization owns the website
Which social profiles and external references confirm the entity

Structured Data Types to Review

Depending on the website, useful schema types may include:

Organization
LocalBusiness
Article
BlogPosting
Product
Service
FAQPage
BreadcrumbList
Person
Review
AggregateRating
WebSite
WebPage

For local SEO, I would pay special attention to LocalBusiness schema, service area information, sameAs links, reviews, and NAP consistency.

For author-driven websites, I would review Person schema, author pages, article schema, and sameAs links.

For ecommerce websites, I would review Product schema, offers, availability, reviews, brand, GTIN or MPN fields where applicable, and breadcrumbs.

How to Test Structured Data

Use Google’s Rich Results Test:

https://search.google.com/test/rich-results

You can also use Schema.org’s validator:

https://validator.schema.org/

The goal is not to add every schema type possible. The goal is to make the important entities clear, accurate, and consistent.

Common Structured Data Problems

Missing Organization schema
Wrong business type
Inconsistent name, address, or phone number
Missing author markup on articles
Product schema without offers or availability
Incorrect review schema
FAQ schema that does not match visible content
Breadcrumb schema that conflicts with the site structure
Old schema left behind after a redesign
Multiple plugins outputting conflicting schema

This is usually one of the easier parts of the audit to fix. If the site is already crawlable, structured data can help make the content more machine-readable.

5. Server-Side Rendering Audit

The fifth check is whether important content exists in the raw HTML.

This matters because many crawlers do not behave like full modern browsers. Some may not execute JavaScript. Some may capture the initial HTML and move on. Some may not wait for client-side content to load.

If the content only appears after JavaScript runs, some crawlers may see a mostly empty page.

How to Check Raw HTML Visibility

Pick an important page and search the raw HTML for a unique phrase from the visible content.

curl -s https://example.com/key-page | grep -i "unique headline text"

If the phrase appears, the content is present in the raw HTML.

If the phrase does not appear, the content may be injected by JavaScript after the initial page load.

Pages to Test

Do not only test the homepage.

Test important templates and page types, including:

Top service pages
Top location pages
Top blog posts
Product pages
Category pages
Author pages
FAQ pages
Comparison pages
Review pages

This is especially important for websites built with heavy JavaScript frameworks, headless CMS setups, client-side rendered apps, faceted ecommerce systems, and some modern page builders.

What to Fix

If important content is missing from raw HTML, consider:

Server-side rendering
Static generation
Hybrid rendering
Prerendering key pages
Reducing reliance on client-side injected content
Making core copy, internal links, schema, and metadata available in the initial HTML

This is not only an AI visibility issue. It is also a technical SEO issue.

Verify Real CCBot Traffic Before Making Decisions

One of the most useful warnings in Stephen Burns’ guide is that user-agent strings are not proof.

Anyone can claim to be CCBot. A scraper, spam bot, or bad actor can send requests using the CCBot user-agent string.

That means you should not blame Common Crawl for crawler traffic unless you verify that the requests are actually coming from Common Crawl.

The correct process is a forward-confirmed reverse DNS check.

Common Crawl says real CCBot requests should resolve to a .crawl.commoncrawl.org hostname and then resolve back to the same IP address.

Example:

host 18.97.14.84

Then verify the hostname resolves back:

host 18-97-14-84.crawl.commoncrawl.org

Common Crawl also publishes CCBot IP ranges here:

https://index.commoncrawl.org/ccbot.json

This matters because some site owners may block CCBot after seeing what they think is bad bot behavior, when the traffic may actually be from an impostor using the CCBot name.

The Business Owner Version of the Audit

If you own a business and do not want to run command-line tests yourself, ask your SEO, developer, or web host these questions:

Does our robots.txt file block CCBot, GPTBot, ClaudeBot, Google-Extended, or other AI-related crawlers?
Does our CDN or firewall block these bots even if robots.txt allows them?
Are we present in the Common Crawl Index?
Are our most important pages present in Common Crawl?
Are our important pages visible in the raw HTML?
Do we have valid structured data on our important pages?
Are we earning links and mentions from trusted, relevant, and well-connected websites?

If the answer to several of these questions is “I don’t know,” an AI visibility audit is probably worth doing.

The SEO Consultant Version of the Audit

If you are an SEO consultant or agency, this can become a practical technical deliverable.

It does not need to be a massive report.

A useful AI visibility audit could include:

Executive summary
CCBot robots.txt access result
CCBot server response result
AI crawler robots.txt review
CDN/WAF access notes
Common Crawl Index presence
Important URL coverage sample
Most recent Common Crawl capture date
Harmonic Centrality / Web Graph notes
Structured data issues
Raw HTML rendering issues
Priority fixes
Recommended next steps

Example AI Visibility Audit Scorecard

Audit Area	Status	What It Means	Recommended Fix
CCBot robots.txt access	Pass / Fail	Shows whether CCBot is allowed by robots.txt.	Remove accidental disallow rules if AI visibility is desired.
CCBot server response	Pass / Fail	Shows whether the CDN, WAF, or server allows CCBot requests.	Adjust CDN, firewall, or bot-management settings.
Common Crawl Index coverage	Present / Missing / Thin	Shows whether the domain is actually captured.	Improve crawlability, links, rendering, and access.
Harmonic Centrality	Strong / Moderate / Weak	Suggests whether the domain may be prioritized or deprioritized for crawling.	Earn links and mentions from better-connected websites.
Structured data	Complete / Partial / Missing	Shows whether important entities are machine-readable.	Add or fix Organization, Article, Product, LocalBusiness, Breadcrumb, and author markup.
Server-side rendering	Pass / Fail	Shows whether important content exists in raw HTML.	Use SSR, static generation, hybrid rendering, or prerendering for key pages.

What If Your Website Blocks AI Crawlers?

The answer depends on your goals.

Some publishers, creators, and businesses may intentionally want to block AI training crawlers. That can be a valid strategy, especially for organizations with premium content, licensing concerns, legal concerns, or strong objections to AI training.

But many businesses probably do not want to be invisible to AI systems by accident.

For example, most local businesses, ecommerce stores, SaaS companies, B2B companies, professional service firms, and consultants likely want to be discoverable in AI-generated recommendations and research workflows.

For those businesses, accidentally blocking CCBot and other AI-related crawlers may be a problem.

The key is not that every business should allow every bot. The key is that the decision should be intentional.

Do not let a default CDN setting decide your AI visibility strategy for you.

Should Every Business Allow Every AI Bot?

No.

This is where the conversation needs nuance.

There are different kinds of bots:

Training data crawlers
Search crawlers
Retrieval crawlers
AI assistant crawlers
SEO tool crawlers
Scrapers pretending to be legitimate bots
Spam bots

A business may want to allow some and block others.

A company may want its marketing pages discoverable but not private app routes. A publisher may want search visibility but not training use. An ecommerce store may want product pages accessible but still block aggressive scrapers.

The audit does not force one answer. It helps you understand what is actually happening so you can make a deliberate decision.

How This Applies to Local SEO

Local SEO is one of the areas where this matters most.

People are increasingly asking AI systems questions like:

Who is the best plumber near me?
What are the top family lawyers in Dallas?
Which med spas offer semaglutide in Plano?
What are the best restaurants for a birthday dinner in Fort Worth?
Which roofing companies have good reviews in McKinney?

AI systems may answer these questions using a mix of trained knowledge, live search, business profiles, reviews, local directories, citations, and website content.

An AI visibility audit will not replace local SEO basics. Businesses still need Google Business Profile optimization, reviews, local citations, service pages, local links, and accurate NAP information.

But it can help make sure the business’s own website is not excluded from the AI discovery layer.

For local businesses, I would combine Stephen Burns’ audit framework with:

Google Business Profile optimization
Bing Places optimization
Apple Business Connect
LocalBusiness schema
Review acquisition
Local citations
Local news mentions
City and service pages
Authoritative local links
Clear NAP consistency

How This Applies to Ecommerce SEO

For ecommerce websites, an AI visibility audit can reveal whether product and category content is accessible to AI systems.

This matters because users are asking AI tools for product recommendations, comparisons, gift ideas, buying guides, and alternatives.

For ecommerce, I would pay attention to:

Product page coverage in Common Crawl
Category page coverage
Buying guide coverage
Brand page coverage
Product schema
Review schema
Server-side rendering
Faceted navigation problems
Canonical tags
Thin product descriptions
Duplicate manufacturer copy
Blocked scripts or resources

If your product content is not accessible, rendered, structured, or included in important crawl datasets, competitors may have an advantage in future AI shopping recommendations.

How This Applies to Content SEO

For content-heavy sites, the AI visibility audit is a reminder that publishing is not enough.

Your best content needs to be:

Discoverable
Crawlable
Accessible
Rendered in raw HTML
Structured
Internally linked
Externally referenced
Associated with clear authors and entities
Updated when necessary

This is especially important for evergreen guides, research reports, glossary pages, statistics pages, and thought leadership content.

If your best content is blocked, hidden behind JavaScript, disconnected from the web graph, or missing structured data, it may not perform as well in AI discovery systems as it should.

A Practical AI Visibility Audit Checklist

Access

Review robots.txt for CCBot rules.
Review robots.txt for GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and other AI-related bots.
Check whether a CDN or security provider is managing robots.txt.
Test the homepage with a CCBot user agent.
Test important internal pages with a CCBot user agent.
Compare CCBot responses against normal browser responses.
Check CDN/WAF settings for AI bot blocking.

Common Crawl Coverage

Query the Common Crawl Index for the root domain.
Check the most recent crawl date.
Check whether important sections are present.
Check whether important URLs are missing.
Compare coverage across multiple crawl snapshots if needed.
Look for crawl waste, parameter URLs, and low-value URLs.

Centrality and Discovery

Check Web Graph / Harmonic Centrality signals.
Review external links and brand mentions.
Identify whether the site is connected to trusted industry or local sources.
Look for isolated pages with weak internal links.
Build links and mentions around relevance and web connectivity, not just raw link volume.

Structured Data

Validate Organization or LocalBusiness schema.
Validate Article or BlogPosting schema.
Validate Product schema where applicable.
Validate BreadcrumbList schema.
Check author markup.
Check sameAs links.
Remove duplicate or conflicting schema from multiple plugins.

Rendering

Fetch raw HTML for important pages.
Confirm core body content appears without JavaScript.
Confirm internal links appear without JavaScript.
Confirm schema appears in the source HTML.
Confirm titles, meta descriptions, canonicals, and headings are present.
Flag pages where crawlers may see an empty app shell.

Reporting

Create a one-page scorecard.
Separate urgent fixes from strategic improvements.
Explain what each issue means in plain English.
Document whether blocking appears intentional or accidental.
Include screenshots or command outputs when helpful.
Recommend next steps based on the business’s AI visibility goals.

Important Caveat: Common Crawl Inclusion Does Not Guarantee AI Visibility

Being included in Common Crawl does not guarantee that ChatGPT, Gemini, Claude, Perplexity, AI Overviews, or any other AI system will mention your brand.

There are many steps between being crawled and being recommended.

Your content may be filtered. It may not be used by a specific model. It may be outweighed by stronger sources. It may be retrieved but not cited. It may be understood but not selected. It may be outdated by the time a user asks a question.

So the right claim is not:

“Get into Common Crawl and you will rank in AI.”

The better claim is:

“If your site is blocked from important crawl and retrieval systems, you may be removing yourself from AI visibility opportunities before the competition even starts.”

That is the value of the audit.

Important Caveat: Some Sites Should Intentionally Opt Out

The guide also gives space to the other side of the issue.

Some publishers, creators, and businesses may not want their content included in AI training datasets. That is a legitimate position.

The same audit process can help those sites too.

Instead of asking, “Are we accidentally blocked?” they can ask, “Are we actually opted out in the way we intended?”

That distinction matters because blocking rules can be inconsistent across bots. A rule that blocks one crawler may not block another. A CDN setting may say one thing while robots.txt says another. A business may believe it opted out when the technical setup does not fully match that intent.

In other words, the audit is useful whether your strategy is inclusion or exclusion. The point is to know what is actually happening.

Final Thoughts

Stephen Burns and Common Crawl deserve credit for putting a practical framework around a part of AI SEO that has been easy to talk about but harder to audit.

The most useful idea in the guide is that AI visibility starts before the answer engine. It starts at the crawl layer.

For years, SEOs have audited whether Google can crawl, render, index, and rank a page. Now we also need to ask whether AI-related crawlers and datasets can access, capture, and understand that same content.

That does not mean every business should blindly open the door to every bot.

It does mean business owners should make that decision intentionally.

If you want to appear in AI systems, being blocked from the crawl is a problem.

If you want to stay out of AI systems, you need to verify that your blocking strategy actually works.

Either way, this is now part of the SEO conversation.

FAQs

What is an AI visibility audit?

An AI visibility audit checks whether a website is accessible, understandable, and discoverable by AI-related crawlers, datasets, and retrieval systems. It usually includes crawler access, Common Crawl coverage, structured data, rendering, and web graph visibility checks.

Who created the AI Visibility Audit field guide?

The field guide The AI Visibility Audit was created by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation. Common Crawl published the guide as a free resource for SEOs and GEO practitioners.

What is Common Crawl?

Common Crawl is a nonprofit organization that crawls the open web and publishes large web crawl datasets for public use. These datasets are used in research, machine learning, and other applications.

What is CCBot?

CCBot is Common Crawl’s web crawler. It visits publicly accessible web pages and follows robots.txt rules. If your website blocks CCBot, your content may be excluded from Common Crawl’s datasets.

Does blocking CCBot hurt Google rankings?

Blocking CCBot should not directly hurt Google rankings because CCBot is not Googlebot. However, if your goal is AI visibility, blocking CCBot may reduce your chances of being included in Common Crawl datasets that may be used by AI systems.

Does allowing CCBot guarantee that ChatGPT will mention my business?

No. Allowing CCBot does not guarantee visibility in ChatGPT or any other AI system. It only helps make sure your site is not excluded at the crawl layer. AI systems still filter, retrieve, rank, summarize, and select information in different ways.

Can my site rank in Google but be invisible to AI systems?

Yes. A site can rank in Google but still be blocked from AI-related crawlers, missing from Common Crawl, hidden behind JavaScript, or inaccessible to live retrieval systems.

Should local businesses care about Common Crawl?

Yes, especially if they care about future AI visibility. Local businesses should still focus on Google Business Profile, Bing Places, reviews, citations, local links, and useful website content, but they should also make sure their site is not accidentally blocking AI-related crawlers.

Should ecommerce stores run an AI visibility audit?

Yes. Ecommerce stores should check whether product pages, category pages, buying guides, reviews, and brand pages are accessible, rendered in raw HTML, marked up with structured data, and included in Common Crawl where appropriate.

What is Harmonic Centrality?

Harmonic Centrality is a web graph measure that helps describe how close a domain is to the core of the web’s link structure. In the context of Common Crawl, it can help explain why some sites may be crawled more frequently or deeply than others.

Is this the same as GEO?

It overlaps with GEO, or generative engine optimization, but it is more specific. This audit focuses on crawl access, training-data visibility, retrieval access, structured data, and rendering. GEO can also include content strategy, entity optimization, digital PR, citations, and direct testing of AI answers.

How often should you run an AI visibility audit?

For most businesses, running this quarterly or after major website changes is a good starting point. You should also run it after changing CDN, firewall, robots.txt, JavaScript framework, CMS, security plugin, or hosting settings.

What is the biggest mistake businesses make with AI crawler access?

The biggest mistake is assuming nothing is blocked. Many sites may block AI crawlers through a CDN, WAF, plugin, or managed robots.txt setting without the business owner realizing it.

Should I block AI crawlers or allow them?

That depends on your business goals. If you want more AI visibility, accidental blocking is probably bad. If you are a publisher or creator who does not want content used in AI training, blocking may be intentional. The important thing is to make the decision deliberately and verify that the technical setup matches your intent.

Methodology

This article is based on a third-party review of The AI Visibility Audit, a field guide written by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation, along with Common Crawl’s announcement article introducing the guide. The analysis above adds practical SEO consulting context for business owners, local SEOs, ecommerce SEOs, content SEOs, and technical SEOs.

Common Crawl and Stephen Burns should receive full credit for the creation of the field guide and the five-check audit framework discussed in this article.

Summary