Key Takeaways
- Crawl priority shapes AI training exposure. Common Crawl uses Harmonic Centrality to decide what to crawl — and high-HC sites appear more often in LLM training data.
- Link topology matters as much as link volume. A single link from a structurally central site can outperform dozens of links from isolated domains as an authority signal.
- Comparative, editorial content dominates AI citations. 32.5% of AI citations come from comparative listicles, while commercial web pages account for less than 5%.
- Co-citation patterns amplify visibility. Brands appearing alongside trusted publishers across multiple result sets are more likely to be cited in AI answers and summaries.
- Google rank still predicts AI citation probability. Sites in position 1 have a 46–48% chance of appearing in AI search results; that drops to about 20% by position 10.
For years, search engine optimization operated on a simple mental model: get indexed, improve your search ranking, earn the click. But as AI-driven search becomes the default discovery layer (through Google’s AI Overviews, ChatGPT, Perplexity, and others), that model is breaking down.
Common Crawl Foundation recently published a study on how SEOs are using web graph data for AI ranking signals, written by Web Intelligence Lead Stephen Burns. The research details how Harmonic Centrality (a graph-based authority metric) shapes crawl priority, training data representation, and which brands artificial intelligence models remember when generating AI answers.
Here’s my analysis of what this research means for SEO strategy, digital PR, and any brand-building visibility across AI search and traditional Google search.

Fractl’s Analysis
Here’s what the data means for digital PR and SEO teams, and where I see the biggest strategic opportunities for brands investing in structural authority.
Why Harmonic Centrality Changes the Game for Link Building
Most SEOs evaluate links through Domain Authority, Domain Rating, or raw backlink counts. Common Crawl’s web graph introduces a fundamentally different metric (Harmonic Centrality) that reframes how link quality maps to AI visibility:
- Structural proximity over popularity. HC measures how close a domain sits to all other domains through the shortest link paths, not just how many sites point to it.
- Crawl priority follows centrality. Common Crawl’s crawler uses HC to decide what to crawl first, so high-HC sites appear in more monthly archive snapshots.
- Training data representation compounds. 64% of the 47 LLMs analyzed used filtered Common Crawl data, and for GPT-3, over 80% of training tokens came from those datasets.
- One central link can outweigh dozens of peripheral ones. A single backlink from a domain deeply embedded in the web’s core does more for AI visibility than volume from isolated blogs.
- Topology functions as a trust signal. Where your linking sites sit in the web graph matters for both search engines and AI tools, which is why earning high-authority backlinks from structurally central publishers is the priority.
The Shift From “Index and Rank” to “Train and Retrieve”

The old search engine optimization playbook focused on ranking factors after indexing. The new reality operates more like a four-stage funnel:
- Crawl inclusion. Is your content being collected by the datasets AI models learn from?
- Training familiarity. Does the model “know” your brand because it encounters it repeatedly in training data?
- Retrieval eligibility. Does your content appear in the result sets that AI search tools pull from in real-time?
- Citation probability. Does your brand actually get named as a source in AI answers and summaries?
This reframes how digital marketers should think about generative engine optimization. GEO isn’t a set of hacks layered on top of traditional SEO. It’s a systems-level visibility strategy that starts with the datasets artificial intelligence models learn from, months before a user types a query.
Brands with fewer traditional search rankings but stronger editorial authority sometimes outperform “SEO-first” competitors in AI-driven results. That’s because algorithm-level familiarity compounds. If your brand doesn’t appear at stage one, stages three and four never happen.
AI visibility isn’t won at ranking time. It’s won months earlier through crawl inclusion, authority proximity, and editorial context.
Content Format Matters More Than Ever for AI Search
One of the most actionable layers in the Common Crawl research comes from independent citation analysis by Brie Moreau of White Light Digital Marketing, who used DataForSEO data to study which content formats AI models actually pull from. The breakdown challenges assumptions about what earns visibility in AI search:
- Comparative listicles account for 32.5% of all AI citations — by far the largest share of any format
- Blogs and opinion pieces sit around 10%
- Commercial and product web pages barely register at less than 5%
AI tools seem to favor high-quality content that explains and synthesizes over content that sells. The formats LLMs prefer to cite (things like comparative studies, expert-interpreted datasets, industry benchmarks) are the same formats that have driven durable media coverage for years. They just happen to also be structurally aligned with how artificial intelligence selects sources for AI search responses, which look very different from traditional Google search snippets.
At Fractl, this tracks with what we’ve seen across thousands of campaigns. The assets that earn the most original research coverage — proprietary surveys, head-to-head analyses, data-driven content marketing strategies built on expert interpretation — are now earning disproportionate AI citations too.
Format is a ranking factor in its own right.
Co-Citation Patterns and Authority Signals in AI-Driven Search
The Common Crawl study’s co-citation analysis has direct implications for how digital PR teams choose where to pitch. AI search tools synthesize across multiple result sets, and brands that repeatedly appear alongside trusted publishers get pulled into the citation pool:
- ChatGPT’s web mode uses Reciprocal Rank Fusion to blend results from multiple sub-queries
- Perplexity and Google’s AI Overviews use similar synthesis approaches
- Brands appearing alongside trusted publishers across multiple result sets have a higher probability of being cited in AI answers
- A placement on a publisher that already appears frequently in AI search results (Forbes, WebMD, or a niche-industry authority) creates a co-citation signal that compounds over time
You don’t just want media coverage; you want to show up with the right neighbors.
Each high-authority brand mention reinforces your position in the network, compounding the content marketing benefits that build trust and authority across human readers, search engines, and AI systems simultaneously. Campaign-level outreach designed around publisher centrality will outperform isolated link placements every time.
Google Search Ranking Still Predicts AI Citation Probability
Moreau’s correlation data confirms that traditional SEO and AI visibility are deeply connected:
Sites in Google search position 1 have roughly a 46–48% probability of being cited by AI, dropping to about 37% at position 2 and declining to approximately 19–20% by position 10.
This is important context for teams debating whether to invest in search engine optimization or AI SEO as separate disciplines. The answer is both, because the same authority signals that improve organic search ranking (authoritative backlinks, expert-driven high-quality content, strong domain authority) also improve AI citation likelihood.
The relationship between digital PR and SEO has always been strong. The Common Crawl data shows it’s now structurally inseparable from AI search visibility as well. User behavior is shifting toward AI-driven discovery, and the brands that perform well in both systems share the same foundational strategy: earn trust through authoritative, high-quality content.
Structured Data, Schema Markup, and the Technical Layer
While Common Crawl’s research focuses on link topology and crawl priority, the technical layer of your web pages plays a complementary role in AI retrieval.
AI tools rely on well-organized, machine-readable formatting to extract information efficiently.
That means several on-page elements directly affect how easily both search engines and AI systems can parse your content:
- Schema markup (FAQs, how-to, and article schemas) makes structured data accessible to both search engines and AI systems
- Clear headings and logical content hierarchy help AI tools identify and extract relevant answers
- Fast-loading web pages with strong user experience reduce friction for both crawlers and real-time retrieval
- Prices, product details, and other structured content get surfaced more reliably when properly marked up
This doesn’t replace the authority-building strategies the Common Crawl research highlights. But strong SEO strategy in 2026 combines structural authority (high-HC backlinks, co-citation proximity) with technical precision to satisfy both algorithm-level ranking factors and real-time AI retrieval.
What This Means for Your Strategy
The Common Crawl research points to five concrete shifts that SEO, digital PR, and content marketing teams should act on:
- Evaluate link sources for structural centrality, not just DA. Prioritize placements on publishers deeply embedded in the web’s core. These function as stronger trust signals for both search engines and AI tools.
- Invest in content formats AI prefers to cite. Comparative studies, data-led listicles, and expert-interpreted research outperform product pages and opinion content in AI answers.
- Build co-citation proximity with trusted publishers. Repeated editorial adjacency with high-authority sources compounds AI visibility and strengthens your authority signals.
- Optimize the technical layer for AI retrieval. Use structured data, schema markup, clear headings, and fast user experience so AI tools can efficiently parse your web pages.
- Treat authority building as a long-term AI investment. Improvements to your web graph position may take months to surface in AI models. The work you do now influences AI search results for years.
How This Connects to Fractl’s Research
Common Crawl’s findings reinforce what Fractl has observed across more than a decade of research-led digital PR campaigns. Our authority-first approach (fewer, higher-quality placements on top-tier publishers) was designed for search engine optimization. It turns out it was also designed for the AI era.
Our research on AI media partnerships powering ChatGPT, Gemini, and Copilot maps the publisher networks that directly feed AI training and retrieval systems. Combined with the Common Crawl data, it offers a clear picture of which editorial ecosystems matter most for brands investing in long-term AI visibility.
Sharing this research across social media and industry channels is part of how we contribute to the broader conversation. As Harmonic Centrality moves from niche metric to industry standard, the gap between structurally authoritative brands and volume-driven competitors will only widen.

Related Reading
- GEO vs. SEO: How AI Is Redefining Search Optimization Strategies
- AI Media Partnerships Powering ChatGPT, Gemini & Copilot
- How to Leverage Internal and External Data for Content Marketing
- How SEO and Content Marketing Work Together to Create Amazing Results
FAQs
What is Harmonic Centrality, and how does it differ from PageRank?
Harmonic Centrality measures how structurally close a domain is to all other domains in the web graph. PageRank measures authority based on inbound link quality. HC identifies central hubs; PageRank identifies well-referenced nodes. Common Crawl uses HC to prioritize crawl order.
Does Common Crawl data directly affect my Google search ranking?
Not directly. Google uses its own crawl and algorithm. But Common Crawl data feeds the training pipelines for many LLMs, including models behind ChatGPT and Perplexity. Strong HC improves your representation in AI training data, which shapes AI search visibility.
How can I check my site’s Harmonic Centrality score?
Metehan Yesilyurt built a free CC Rank Checker at webgraph.metehan.ai. It indexes the top 10 million domains across multiple time periods. You can also verify crawl inclusion through Common Crawl’s index server at index.commoncrawl.org.
Does user intent matter for AI citations?
Yes. AI tools like ChatGPT interpret user intent to decide which sources to cite. Content that directly answers informational queries — comparisons, data summaries, expert analysis — aligns with how AI systems match sources to user behavior and intent.
Should I invest in AI SEO or traditional search engine optimization?
Both. Moreau’s data shows Google search ranking strongly predicts AI citation probability. The same authority signals, high-quality content, and trust signals that drive organic rankings also drive AI visibility. They’re increasingly inseparable.