This link has been bookmarked by 23 people . It was first bookmarked on 04 Aug 2006, by craig mcmillan.
-
rampiongoogle <Bs n-grams tooo
-
Nick GallGoogle donates a huge corpus. n-grams come from statistical machine translation. They deal with sequences of words.
-
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
-
-
Jeff KubinaGoogle to make available n-gram data from their 1 trillion-word training corpus.
-
-
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
Watch for an annnouncement at the LDC, who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.
-
-
Bruno MartinsThere's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
Page Comments
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.