Skip to main contentdfsdf

Arne van Elk

LLM Training Data Is Not What You Think

To do an AI search training data availability audit, use the five checks proposed in the Common Crawl field guide:
1. CCBot access
2. Common Crawl Index coverage
3. Harmonic Centrality
4. structured data completeness
5. server-side rendering

Then look at robots.txt, CDN, WAF rules and bot-management settings. Server log analysis is your best starting point here (check what we do at Oncrawl ).

Check whether the domain is actually present in the Common Crawl Index. I also recommend Metehan Yeşilyurt 's Harmonic Centrality rank checker: https://webgraph.metehan.ai/

Check whether important content is visible in the raw HTML and not only injected after JavaScript execution.

Shared by Arne van Elk, 1 save total

Arne van Elk

Google Search’s I/O 2026 updates: AI agents and more

"The new intelligent Search box is starting to roll out today, in all countries and languages where AI Mode is available."

Shared by Arne van Elk, 2 saves total

Show more items

Diigo is about better ways to research, share and collaborate on information. Learn more »

Join Diigo