How to Run an AI Visibility Audit Using Common Crawl’s New Field Guide

Table of Contents
  • 1 What Is an AI Visibility Audit?
  • 2 The Big Idea: Search Is Becoming “Index, Rank, Train, and Retrieve”
  • 3 Why Common Crawl Matters
  • 4 The Five Checks in Stephen Burns’ AI Visibility Audit Framework
  • 5 1. CCBot Access Check
  • 6 2. Common Crawl Index Coverage Audit
  • 7 3. Harmonic Centrality Check
  • 8 4. Structured Data Completeness
  • 9 5. Server-Side Rendering Audit
  • 10 Verify Real CCBot Traffic Before Making Decisions
  • 11 The Business Owner Version of the Audit
  • 12 The SEO Consultant Version of the Audit
  • 13 What If Your Website Blocks AI Crawlers?
  • 14 Should Every Business Allow Every AI Bot?
  • 15 How This Applies to Local SEO
  • 16 How This Applies to Ecommerce SEO
  • 17 How This Applies to Content SEO
  • 18 A Practical AI Visibility Audit Checklist
  • 19 Important Caveat: Common Crawl Inclusion Does Not Guarantee AI Visibility
  • 20 Important Caveat: Some Sites Should Intentionally Opt Out
  • 21 Final Thoughts
  • 22 FAQs
  • 23 Methodology
  • Summary

    The blog post discusses Common Crawl’s new field guide called “The AI Visibility Audit” authored by Stephen Burns, providing a practical framework for auditing a website’s accessibility to AI-related crawlers and datasets. The audit delves into areas such as CCBot access, Common Crawl index coverage, harmonic centrality, structured data completeness, and server-side rendering. It emphasizes understanding the importance of AI visibility in the evolving landscape of search, training data, and retrieval systems. The post also touches on the nuances of intentional blocking of AI crawlers, the significance of deliberate decisions in AI visibility strategy, and the relevance of such audits for various business sectors like local SEO, ecommerce, and content-driven sites. The analysis provides actionable insights for conducting AI visibility audits and highlights the necessity of verifying technical setups align with intended goals.

    Common Crawl recently published a new field guide called The AI Visibility Audit, written by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation.

    The guide is worth paying attention to because it gives SEOs and business owners a practical framework for auditing something most traditional SEO audits still do not cover very well: whether a website is actually reachable by the crawlers, indexes, and datasets that can influence AI visibility.

    I am of course not affiliated with Common Crawl or Stephen Burns. This article is a third-party look at the guide from the perspective of an SEO consultant, with additional notes on how business owners, in-house marketers, and agencies can use the framework in the real world. I urge everyone to read the source material directly with the link below:

    Common Crawl and Stephen Burns deserve credit for creating the field guide and also the five-check audit framework which is discussed below.

    What Is an AI Visibility Audit?

    An AI visibility audit checks whether a website can be accessed, captured, understood, and potentially used by AI-related crawlers, data sources, and retrieval systems.

    This is different from a normal SEO audit.

    A traditional SEO audit usually asks questions like:

    • Can Googlebot crawl the site?
    • Are important pages indexed?
    • Are titles, headings, canonicals, and internal links set up correctly?
    • Does the site have enough useful content?
    • Does the site have authority and backlinks?
    • Are pages ranking and converting?

    Those questions still matter. None of that goes away.

    But AI visibility adds a new layer upstream of rankings. Before an AI system can recommend, summarize, cite, or “know” your business, your content has to be reachable by the systems that discover and collect web data.

    That is the main point of Stephen Burns’ field guide. A page can rank well in Google and still be mostly invisible to AI systems if it is blocked from the wrong crawlers, missing from important crawl datasets, hidden behind JavaScript, or disconnected from the parts of the web that get crawled frequently.

    The Big Idea: Search Is Becoming “Index, Rank, Train, and Retrieve”

    For years, SEO has mostly revolved around crawling, indexing, ranking, and search result presentation.

    AI systems add two more ideas to the mix:

    1. Training Data

    Some AI systems learn from large datasets built from web crawls, licensed data, user interactions, documents, and other sources. If your website is included in the right datasets before a model is trained, then your content may have a chance to become part of what the model “knows.”

    This does not guarantee that a model will mention your business. It does not guarantee citations. It does not mean you can “rank” in AI the same way you rank in Google.

    But if your website is completely absent from important crawl datasets, you may be missing the earliest layer of AI visibility.

    2. Retrieval

    Many AI systems also use live retrieval. This means the system may search, browse, or fetch current information at the time a user asks a question.

    This is especially important for recent information, local businesses, product availability, pricing, news, events, reviews, and anything published after a model’s training cutoff.

    In plain English: some AI visibility comes from what the model already learned, and some comes from what the system can retrieve right now.

    A complete AI visibility audit should care about both.

    Why Common Crawl Matters

    Common Crawl is a nonprofit organization that crawls the open web and publishes large web crawl datasets for public use. Its crawler is called CCBot.

    Common Crawl’s data has been used in research, machine learning, search, and AI development. That does not mean Common Crawl controls ChatGPT, Gemini, Claude, Perplexity, or any other AI product. It does not.

    However, Common Crawl can be an important upstream source of web data. That means SEOs should at least know whether their websites are accessible to CCBot and present in the Common Crawl Index.

    For business owners, here is the simple version:

    If AI systems learn from web-scale datasets, and your website is blocked from those datasets, you may be making it harder for AI systems to understand or recommend your business later.

    The Five Checks in Stephen Burns’ AI Visibility Audit Framework

    Stephen Burns’ field guide lays out five main checks:

    1. CCBot access check
    2. Common Crawl Index coverage audit
    3. Harmonic Centrality check
    4. Structured data completeness
    5. Server-side rendering audit

    Here is how each one works and how SEOs can use it.

    1. CCBot Access Check

    The first question is the most basic:

    Can Common Crawl’s crawler actually access your website?

    If CCBot is blocked, your site may not appear in Common Crawl’s archive. If it does not appear in the archive, it may be absent from datasets that rely on Common Crawl.

    There are two main places this can go wrong.

    Robots.txt Blocking

    Your robots.txt file might include a rule that blocks CCBot:

    User-agent: CCBot
    Disallow: /

    You may also see rules for other AI-related crawlers, such as GPTBot, ClaudeBot, Google-Extended, PerplexityBot, or other user agents.

    Blocking these bots is not always a mistake. Some publishers and creators intentionally do not want their content used in AI training. That is a legitimate business and legal decision.

    The problem is when a business blocks AI-related crawlers without knowing it.

    This can happen because of:

    • A developer decision made during a previous site build
    • A WordPress security plugin
    • A CDN setting
    • A managed robots.txt feature
    • A blanket “block AI bots” toggle
    • A copied robots.txt file from another website
    • A temporary rule that never got removed

    CDN, Firewall, or WAF Blocking

    The second issue is more difficult to spot.

    Your robots.txt file may look clean, but your CDN, firewall, or bot-management system may still block CCBot before it reaches the server.

    In that case, your site appears open when you look at robots.txt, but the crawler may receive a 403 Forbidden response when it tries to access the site.

    This is why the guide recommends checking both robots.txt and the actual server response.

    How to Test CCBot Access

    First, check robots.txt:

    curl -s https://example.com/robots.txt

    Look for rules that block CCBot or other AI-related crawlers.

    Then test the server response using a CCBot user agent:

    curl -A "CCBot/2.0" -I https://example.com/

    A good response usually looks like this:

    HTTP/2 200

    A blocked response may look like this:

    HTTP/2 403

    Then compare it against a normal browser user agent:

    curl -A "Mozilla/5.0" -I https://example.com/

    If the browser user agent gets a 200 but CCBot gets a 403, then the site is likely blocking CCBot at the firewall, CDN, or bot-management layer.

    What to Do If CCBot Is Blocked

    If the business wants AI visibility, accidental blocking should usually be fixed.

    Places to check include:

    • Cloudflare bot settings
    • Akamai bot management
    • Fastly edge rules
    • WordPress security plugins
    • Server firewall rules
    • Managed robots.txt settings
    • Custom nginx or Apache rules

    The important point is this: do not assume robots.txt tells the whole story. A lot of AI crawler blocking now happens at the edge before the request ever reaches the website.

    2. Common Crawl Index Coverage Audit

    Once you know the site is open to CCBot, the next question is:

    Has Common Crawl actually captured the site?

    Permission and presence are not the same thing.

    A site may allow CCBot but still have poor coverage in Common Crawl. This can happen if the site is new, weakly linked, technically difficult to crawl, mostly rendered with JavaScript, blocked in the past, or not very central in the web graph.

    How to Check Common Crawl Coverage

    Common Crawl provides a free public index at:

    https://index.commoncrawl.org/

    You can query a recent crawl index like this:

    curl "https://index.commoncrawl.org/CC-MAIN-2026-21-index?url=example.com/*&output=json" | head

    Replace the crawl ID with a current Common Crawl index and replace example.com with the domain you are auditing.

    What to Look For

    When reviewing Common Crawl coverage, look for:

    • Whether the domain appears at all
    • How recently it was crawled
    • How many URLs appear
    • Whether important pages are included
    • Whether only the homepage appears
    • Whether blog posts, category pages, product pages, or service pages are present
    • Whether low-value parameter URLs are being captured instead of canonical URLs
    • Whether old URLs appear but new URLs do not

    What Good Coverage Looks Like

    For a healthy business website, you usually want to see more than just the homepage.

    Depending on the site, important URLs may include:

    • Homepage
    • About page
    • Service pages
    • Location pages
    • Product pages
    • Category pages
    • Blog posts
    • Buying guides
    • Author pages
    • Research pages
    • Resource pages

    For local businesses, I would pay close attention to service pages, city pages, reviews, FAQs, and pages that clearly explain what the business does and where it operates.

    For ecommerce websites, I would pay attention to category pages, product pages, buying guides, brand pages, and comparison content.

    For publishers, I would check recent articles, evergreen explainers, author pages, category pages, and high-performing older content.

    What Poor Coverage Looks Like

    Bad signs include:

    • No results for the domain
    • Only the homepage appears
    • Important sections are missing
    • The most recent crawl is very old
    • Only parameter URLs appear
    • Only old URLs from a previous site version appear
    • JavaScript app shell URLs appear without meaningful content

    If the site is technically open but barely appears in Common Crawl, the next step is to investigate discovery, internal links, backlinks, rendering, crawl depth, and site architecture.

    3. Harmonic Centrality Check

    This is one of the more interesting parts of the guide because it connects AI visibility back to the structure of the web.

    Common Crawl uses web graph data to help understand and prioritize the web. Stephen Burns explains that Harmonic Centrality can help describe how close a domain is to the “core” of the web’s link structure.

    This is different from thinking only in terms of backlink volume.

    A website with many low-quality or isolated links may not be as central as a website with fewer links from highly connected, trusted, frequently crawled parts of the web.

    Why Harmonic Centrality Matters

    If a domain is more central in the web graph, it may be crawled more often and more deeply.

    If a domain is less central, it may still be accessible, but crawled less frequently or less completely.

    For AI visibility, that matters because being technically crawlable is not the same as being prioritized.

    How to Check It

    The guide references this community tool:

    https://webgraph.metehan.ai/

    This is not an official Common Crawl product. It is a community tool built on Common Crawl Web Graph data. That means it can be useful for quick checks, but SEOs should avoid treating it as a perfect or final score.

    How SEOs Should Use This

    Do not turn Harmonic Centrality into another fake “domain authority” metric.

    Use it as a strategic signal.

    If a site has weak Common Crawl coverage and weak centrality, then link building, digital PR, local mentions, and industry citations may help more than just Google rankings. They may also improve the site’s ability to be discovered and crawled by systems that rely on the web graph.

    For local businesses, that could mean earning mentions from:

    • Local news websites
    • Chambers of commerce
    • Industry associations
    • Local universities
    • Relevant local directories
    • Community organizations
    • Event pages
    • Vendor and partner websites

    For ecommerce brands, that could mean earning mentions from:

    • Buying guides
    • Product reviews
    • Industry publications
    • Manufacturer pages
    • Comparison articles
    • Affiliate publishers
    • Category resource pages

    For B2B companies, that could mean earning mentions from:

    • Partner pages
    • Integration marketplaces
    • Software review sites
    • Conference pages
    • Podcast pages
    • Research citations
    • Industry reports

    This is where AI visibility overlaps with brand building. The more your brand is connected to trusted and central parts of the web, the easier it may be for crawlers and AI systems to discover, understand, and associate your business with its topics.

    4. Structured Data Completeness

    The fourth check is structured data.

    Structured data does not force an AI system to cite you. It does not guarantee rankings. It does not automatically make your business show up in ChatGPT, Gemini, Claude, or Perplexity.

    But it can help machines understand the entities on your website.

    That matters because AI visibility is not just about keywords. It is also about entity understanding.

    You want machines to understand:

    • Who the business is
    • What the business does
    • Where the business operates
    • Who wrote the content
    • What products or services are described
    • Which pages are related
    • Which reviews belong to which business
    • Which organization owns the website
    • Which social profiles and external references confirm the entity

    Structured Data Types to Review

    Depending on the website, useful schema types may include:

    • Organization
    • LocalBusiness
    • Article
    • BlogPosting
    • Product
    • Service
    • FAQPage
    • BreadcrumbList
    • Person
    • Review
    • AggregateRating
    • WebSite
    • WebPage

    For local SEO, I would pay special attention to LocalBusiness schema, service area information, sameAs links, reviews, and NAP consistency.

    For author-driven websites, I would review Person schema, author pages, article schema, and sameAs links.

    For ecommerce websites, I would review Product schema, offers, availability, reviews, brand, GTIN or MPN fields where applicable, and breadcrumbs.

    How to Test Structured Data

    Use Google’s Rich Results Test:

    https://search.google.com/test/rich-results

    You can also use Schema.org’s validator:

    https://validator.schema.org/

    The goal is not to add every schema type possible. The goal is to make the important entities clear, accurate, and consistent.

    Common Structured Data Problems

    • Missing Organization schema
    • Wrong business type
    • Inconsistent name, address, or phone number
    • Missing author markup on articles
    • Product schema without offers or availability
    • Incorrect review schema
    • FAQ schema that does not match visible content
    • Breadcrumb schema that conflicts with the site structure
    • Old schema left behind after a redesign
    • Multiple plugins outputting conflicting schema

    This is usually one of the easier parts of the audit to fix. If the site is already crawlable, structured data can help make the content more machine-readable.

    5. Server-Side Rendering Audit

    The fifth check is whether important content exists in the raw HTML.

    This matters because many crawlers do not behave like full modern browsers. Some may not execute JavaScript. Some may capture the initial HTML and move on. Some may not wait for client-side content to load.

    If the content only appears after JavaScript runs, some crawlers may see a mostly empty page.

    How to Check Raw HTML Visibility

    Pick an important page and search the raw HTML for a unique phrase from the visible content.

    curl -s https://example.com/key-page | grep -i "unique headline text"

    If the phrase appears, the content is present in the raw HTML.

    If the phrase does not appear, the content may be injected by JavaScript after the initial page load.

    Pages to Test

    Do not only test the homepage.

    Test important templates and page types, including:

    • Top service pages
    • Top location pages
    • Top blog posts
    • Product pages
    • Category pages
    • Author pages
    • FAQ pages
    • Comparison pages
    • Review pages

    This is especially important for websites built with heavy JavaScript frameworks, headless CMS setups, client-side rendered apps, faceted ecommerce systems, and some modern page builders.

    What to Fix

    If important content is missing from raw HTML, consider:

    • Server-side rendering
    • Static generation
    • Hybrid rendering
    • Prerendering key pages
    • Reducing reliance on client-side injected content
    • Making core copy, internal links, schema, and metadata available in the initial HTML

    This is not only an AI visibility issue. It is also a technical SEO issue.

    Verify Real CCBot Traffic Before Making Decisions

    One of the most useful warnings in Stephen Burns’ guide is that user-agent strings are not proof.

    Anyone can claim to be CCBot. A scraper, spam bot, or bad actor can send requests using the CCBot user-agent string.

    That means you should not blame Common Crawl for crawler traffic unless you verify that the requests are actually coming from Common Crawl.

    The correct process is a forward-confirmed reverse DNS check.

    Common Crawl says real CCBot requests should resolve to a .crawl.commoncrawl.org hostname and then resolve back to the same IP address.

    Example:

    host 18.97.14.84

    Then verify the hostname resolves back:

    host 18-97-14-84.crawl.commoncrawl.org

    Common Crawl also publishes CCBot IP ranges here:

    https://index.commoncrawl.org/ccbot.json

    This matters because some site owners may block CCBot after seeing what they think is bad bot behavior, when the traffic may actually be from an impostor using the CCBot name.

    The Business Owner Version of the Audit

    If you own a business and do not want to run command-line tests yourself, ask your SEO, developer, or web host these questions:

    1. Does our robots.txt file block CCBot, GPTBot, ClaudeBot, Google-Extended, or other AI-related crawlers?
    2. Does our CDN or firewall block these bots even if robots.txt allows them?
    3. Are we present in the Common Crawl Index?
    4. Are our most important pages present in Common Crawl?
    5. Are our important pages visible in the raw HTML?
    6. Do we have valid structured data on our important pages?
    7. Are we earning links and mentions from trusted, relevant, and well-connected websites?

    If the answer to several of these questions is “I don’t know,” an AI visibility audit is probably worth doing.

    The SEO Consultant Version of the Audit

    If you are an SEO consultant or agency, this can become a practical technical deliverable.

    It does not need to be a massive report.

    A useful AI visibility audit could include:

    • Executive summary
    • CCBot robots.txt access result
    • CCBot server response result
    • AI crawler robots.txt review
    • CDN/WAF access notes
    • Common Crawl Index presence
    • Important URL coverage sample
    • Most recent Common Crawl capture date
    • Harmonic Centrality / Web Graph notes
    • Structured data issues
    • Raw HTML rendering issues
    • Priority fixes
    • Recommended next steps

    Example AI Visibility Audit Scorecard

    Audit Area Status What It Means Recommended Fix
    CCBot robots.txt access Pass / Fail Shows whether CCBot is allowed by robots.txt. Remove accidental disallow rules if AI visibility is desired.
    CCBot server response Pass / Fail Shows whether the CDN, WAF, or server allows CCBot requests. Adjust CDN, firewall, or bot-management settings.
    Common Crawl Index coverage Present / Missing / Thin Shows whether the domain is actually captured. Improve crawlability, links, rendering, and access.
    Harmonic Centrality Strong / Moderate / Weak Suggests whether the domain may be prioritized or deprioritized for crawling. Earn links and mentions from better-connected websites.
    Structured data Complete / Partial / Missing Shows whether important entities are machine-readable. Add or fix Organization, Article, Product, LocalBusiness, Breadcrumb, and author markup.
    Server-side rendering Pass / Fail Shows whether important content exists in raw HTML. Use SSR, static generation, hybrid rendering, or prerendering for key pages.

    What If Your Website Blocks AI Crawlers?

    The answer depends on your goals.

    Some publishers, creators, and businesses may intentionally want to block AI training crawlers. That can be a valid strategy, especially for organizations with premium content, licensing concerns, legal concerns, or strong objections to AI training.

    But many businesses probably do not want to be invisible to AI systems by accident.

    For example, most local businesses, ecommerce stores, SaaS companies, B2B companies, professional service firms, and consultants likely want to be discoverable in AI-generated recommendations and research workflows.

    For those businesses, accidentally blocking CCBot and other AI-related crawlers may be a problem.

    The key is not that every business should allow every bot. The key is that the decision should be intentional.

    Do not let a default CDN setting decide your AI visibility strategy for you.

    Should Every Business Allow Every AI Bot?

    No.

    This is where the conversation needs nuance.

    There are different kinds of bots:

    • Training data crawlers
    • Search crawlers
    • Retrieval crawlers
    • AI assistant crawlers
    • SEO tool crawlers
    • Scrapers pretending to be legitimate bots
    • Spam bots

    A business may want to allow some and block others.

    A company may want its marketing pages discoverable but not private app routes. A publisher may want search visibility but not training use. An ecommerce store may want product pages accessible but still block aggressive scrapers.

    The audit does not force one answer. It helps you understand what is actually happening so you can make a deliberate decision.

    How This Applies to Local SEO

    Local SEO is one of the areas where this matters most.

    People are increasingly asking AI systems questions like:

    • Who is the best plumber near me?
    • What are the top family lawyers in Dallas?
    • Which med spas offer semaglutide in Plano?
    • What are the best restaurants for a birthday dinner in Fort Worth?
    • Which roofing companies have good reviews in McKinney?

    AI systems may answer these questions using a mix of trained knowledge, live search, business profiles, reviews, local directories, citations, and website content.

    An AI visibility audit will not replace local SEO basics. Businesses still need Google Business Profile optimization, reviews, local citations, service pages, local links, and accurate NAP information.

    But it can help make sure the business’s own website is not excluded from the AI discovery layer.

    For local businesses, I would combine Stephen Burns’ audit framework with:

    • Google Business Profile optimization
    • Bing Places optimization
    • Apple Business Connect
    • LocalBusiness schema
    • Review acquisition
    • Local citations
    • Local news mentions
    • City and service pages
    • Authoritative local links
    • Clear NAP consistency

    How This Applies to Ecommerce SEO

    For ecommerce websites, an AI visibility audit can reveal whether product and category content is accessible to AI systems.

    This matters because users are asking AI tools for product recommendations, comparisons, gift ideas, buying guides, and alternatives.

    For ecommerce, I would pay attention to:

    • Product page coverage in Common Crawl
    • Category page coverage
    • Buying guide coverage
    • Brand page coverage
    • Product schema
    • Review schema
    • Server-side rendering
    • Faceted navigation problems
    • Canonical tags
    • Thin product descriptions
    • Duplicate manufacturer copy
    • Blocked scripts or resources

    If your product content is not accessible, rendered, structured, or included in important crawl datasets, competitors may have an advantage in future AI shopping recommendations.

    How This Applies to Content SEO

    For content-heavy sites, the AI visibility audit is a reminder that publishing is not enough.

    Your best content needs to be:

    • Discoverable
    • Crawlable
    • Accessible
    • Rendered in raw HTML
    • Structured
    • Internally linked
    • Externally referenced
    • Associated with clear authors and entities
    • Updated when necessary

    This is especially important for evergreen guides, research reports, glossary pages, statistics pages, and thought leadership content.

    If your best content is blocked, hidden behind JavaScript, disconnected from the web graph, or missing structured data, it may not perform as well in AI discovery systems as it should.

    A Practical AI Visibility Audit Checklist

    Access

    • Review robots.txt for CCBot rules.
    • Review robots.txt for GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and other AI-related bots.
    • Check whether a CDN or security provider is managing robots.txt.
    • Test the homepage with a CCBot user agent.
    • Test important internal pages with a CCBot user agent.
    • Compare CCBot responses against normal browser responses.
    • Check CDN/WAF settings for AI bot blocking.

    Common Crawl Coverage

    • Query the Common Crawl Index for the root domain.
    • Check the most recent crawl date.
    • Check whether important sections are present.
    • Check whether important URLs are missing.
    • Compare coverage across multiple crawl snapshots if needed.
    • Look for crawl waste, parameter URLs, and low-value URLs.

    Centrality and Discovery

    • Check Web Graph / Harmonic Centrality signals.
    • Review external links and brand mentions.
    • Identify whether the site is connected to trusted industry or local sources.
    • Look for isolated pages with weak internal links.
    • Build links and mentions around relevance and web connectivity, not just raw link volume.

    Structured Data

    • Validate Organization or LocalBusiness schema.
    • Validate Article or BlogPosting schema.
    • Validate Product schema where applicable.
    • Validate BreadcrumbList schema.
    • Check author markup.
    • Check sameAs links.
    • Remove duplicate or conflicting schema from multiple plugins.

    Rendering

    • Fetch raw HTML for important pages.
    • Confirm core body content appears without JavaScript.
    • Confirm internal links appear without JavaScript.
    • Confirm schema appears in the source HTML.
    • Confirm titles, meta descriptions, canonicals, and headings are present.
    • Flag pages where crawlers may see an empty app shell.

    Reporting

    • Create a one-page scorecard.
    • Separate urgent fixes from strategic improvements.
    • Explain what each issue means in plain English.
    • Document whether blocking appears intentional or accidental.
    • Include screenshots or command outputs when helpful.
    • Recommend next steps based on the business’s AI visibility goals.

    Important Caveat: Common Crawl Inclusion Does Not Guarantee AI Visibility

    Being included in Common Crawl does not guarantee that ChatGPT, Gemini, Claude, Perplexity, AI Overviews, or any other AI system will mention your brand.

    There are many steps between being crawled and being recommended.

    Your content may be filtered. It may not be used by a specific model. It may be outweighed by stronger sources. It may be retrieved but not cited. It may be understood but not selected. It may be outdated by the time a user asks a question.

    So the right claim is not:

    “Get into Common Crawl and you will rank in AI.”

    The better claim is:

    “If your site is blocked from important crawl and retrieval systems, you may be removing yourself from AI visibility opportunities before the competition even starts.”

    That is the value of the audit.

    Important Caveat: Some Sites Should Intentionally Opt Out

    The guide also gives space to the other side of the issue.

    Some publishers, creators, and businesses may not want their content included in AI training datasets. That is a legitimate position.

    The same audit process can help those sites too.

    Instead of asking, “Are we accidentally blocked?” they can ask, “Are we actually opted out in the way we intended?”

    That distinction matters because blocking rules can be inconsistent across bots. A rule that blocks one crawler may not block another. A CDN setting may say one thing while robots.txt says another. A business may believe it opted out when the technical setup does not fully match that intent.

    In other words, the audit is useful whether your strategy is inclusion or exclusion. The point is to know what is actually happening.

    Final Thoughts

    Stephen Burns and Common Crawl deserve credit for putting a practical framework around a part of AI SEO that has been easy to talk about but harder to audit.

    The most useful idea in the guide is that AI visibility starts before the answer engine. It starts at the crawl layer.

    For years, SEOs have audited whether Google can crawl, render, index, and rank a page. Now we also need to ask whether AI-related crawlers and datasets can access, capture, and understand that same content.

    That does not mean every business should blindly open the door to every bot.

    It does mean business owners should make that decision intentionally.

    If you want to appear in AI systems, being blocked from the crawl is a problem.

    If you want to stay out of AI systems, you need to verify that your blocking strategy actually works.

    Either way, this is now part of the SEO conversation.

    FAQs

    What is an AI visibility audit?

    An AI visibility audit checks whether a website is accessible, understandable, and discoverable by AI-related crawlers, datasets, and retrieval systems. It usually includes crawler access, Common Crawl coverage, structured data, rendering, and web graph visibility checks.

    Who created the AI Visibility Audit field guide?

    The field guide The AI Visibility Audit was created by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation. Common Crawl published the guide as a free resource for SEOs and GEO practitioners.

    What is Common Crawl?

    Common Crawl is a nonprofit organization that crawls the open web and publishes large web crawl datasets for public use. These datasets are used in research, machine learning, and other applications.

    What is CCBot?

    CCBot is Common Crawl’s web crawler. It visits publicly accessible web pages and follows robots.txt rules. If your website blocks CCBot, your content may be excluded from Common Crawl’s datasets.

    Does blocking CCBot hurt Google rankings?

    Blocking CCBot should not directly hurt Google rankings because CCBot is not Googlebot. However, if your goal is AI visibility, blocking CCBot may reduce your chances of being included in Common Crawl datasets that may be used by AI systems.

    Does allowing CCBot guarantee that ChatGPT will mention my business?

    No. Allowing CCBot does not guarantee visibility in ChatGPT or any other AI system. It only helps make sure your site is not excluded at the crawl layer. AI systems still filter, retrieve, rank, summarize, and select information in different ways.

    Can my site rank in Google but be invisible to AI systems?

    Yes. A site can rank in Google but still be blocked from AI-related crawlers, missing from Common Crawl, hidden behind JavaScript, or inaccessible to live retrieval systems.

    Should local businesses care about Common Crawl?

    Yes, especially if they care about future AI visibility. Local businesses should still focus on Google Business Profile, Bing Places, reviews, citations, local links, and useful website content, but they should also make sure their site is not accidentally blocking AI-related crawlers.

    Should ecommerce stores run an AI visibility audit?

    Yes. Ecommerce stores should check whether product pages, category pages, buying guides, reviews, and brand pages are accessible, rendered in raw HTML, marked up with structured data, and included in Common Crawl where appropriate.

    What is Harmonic Centrality?

    Harmonic Centrality is a web graph measure that helps describe how close a domain is to the core of the web’s link structure. In the context of Common Crawl, it can help explain why some sites may be crawled more frequently or deeply than others.

    Is this the same as GEO?

    It overlaps with GEO, or generative engine optimization, but it is more specific. This audit focuses on crawl access, training-data visibility, retrieval access, structured data, and rendering. GEO can also include content strategy, entity optimization, digital PR, citations, and direct testing of AI answers.

    How often should you run an AI visibility audit?

    For most businesses, running this quarterly or after major website changes is a good starting point. You should also run it after changing CDN, firewall, robots.txt, JavaScript framework, CMS, security plugin, or hosting settings.

    What is the biggest mistake businesses make with AI crawler access?

    The biggest mistake is assuming nothing is blocked. Many sites may block AI crawlers through a CDN, WAF, plugin, or managed robots.txt setting without the business owner realizing it.

    Should I block AI crawlers or allow them?

    That depends on your business goals. If you want more AI visibility, accidental blocking is probably bad. If you are a publisher or creator who does not want content used in AI training, blocking may be intentional. The important thing is to make the decision deliberately and verify that the technical setup matches your intent.

    Methodology

    This article is based on a third-party review of The AI Visibility Audit, a field guide written by Stephen Burns, Web Intelligence Lead at the Common Crawl Foundation, along with Common Crawl’s announcement article introducing the guide. The analysis above adds practical SEO consulting context for business owners, local SEOs, ecommerce SEOs, content SEOs, and technical SEOs.

    Common Crawl and Stephen Burns should receive full credit for the creation of the field guide and the five-check audit framework discussed in this article.

    Joe Youngblood

    view all posts

    Joe Youngblood is a top Dallas SEO, Digital Marketer, and Marketing Theorist. When he's not working with clients or writing about marketing he spends time supporting local non-profits and taking his dogs to various parks.

    0COMMENTS Join the Conversation →