Pierre-Carl made also a key point about the ongoing-push for counter regulation –with the New York Times suing Microsoft and OpenAI for infringement of the publisher’s copyrights. This will lead AI companies to sign licensing deals with rightsholders. There are first examples of such deals: between Reddit and Google, or Le Monde and OpenAI. A licensed approach to model training creates one more risk of gatekeeping, as only the largest companies will be able to afford the licensing costs. Launched a bit over a month ago, Common Corpus is an attempt to address these challenges by presenting a new way of contributing to the development of AI as Commons.
As the largest training data set for language models based on open content to date, Common Corpus is built with open data, including administrative data as well as cultural and open-science resources – like CC-licensed YouTube videos, 21 million digitized newspapers, and millions of books, among others. With 180 billion words, it is currently the largest English-speaking data set, but it is also multilingual and leads in terms of open data sets in French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian. Developing Common Corpus was an international effort involving a spectrum of stakeholders from the French Ministry of Culture to digital heritage researchers and open science LLM community, including companies such as HuggingFace, Occiglot, Eleuther, and Nomic AI. The collaborative effort behind building the data set reflects a vision of fostering a culture of openness and accessibility in AI research. Releasing Common Corpus is an attempt at democratizing access to large, quality data sets, which can be used for LLM training. Common Corpus aims to become a key component of a wider pretraining Commons ecosystem such as the “licensed” Pile currently prepared by Eleuther.