Skip to main contentdfsdf

Arne van Elk

LLM Training Data Is Not What You Think

To do an AI search training data availability audit, use the five checks proposed in the Common Crawl field guide:
1. CCBot access
2. Common Crawl Index coverage
3. Harmonic Centrality
4. structured data completeness
5. server-side rendering

Then look at robots.txt, CDN, WAF rules and bot-management settings. Server log analysis is your best starting point here (check what we do at Oncrawl ).

Check whether the domain is actually present in the Common Crawl Index. I also recommend Metehan Yeşilyurt 's Harmonic Centrality rank checker: https://webgraph.metehan.ai/

Check whether important content is visible in the raw HTML and not only injected after JavaScript execution.

Shared by Arne van Elk, 1 save total

Arne van Elk

Google Search’s I/O 2026 updates: AI agents and more

"The new intelligent Search box is starting to roll out today, in all countries and languages where AI Mode is available."

Shared by Arne van Elk, 2 saves total

Show more items

Highlighter, Sticky notes, Tagging, Groups and Network: integrated suite dramatically boosting research productivity. Learn more »

Join Diigo