Version 13 (modified by kmaclean, 15 years ago) (diff) |
---|
ARPA
- CMU Hub4 - language model in ARPA format
- CMU ARPA-format bigram language model - 57138 unigrams and about 10 million bigrams (link)
Possible sources of written data (written corpora) for the creation of Language Models
- U.S. Government Printing Office
- Gutenburg project
- Wikipedia Spoken Articles
- Hansard Canada
- Google Research word n-gram models and training corpora
- Complete Works of William Shakespeare
- Moby Project
- Google Books
- Web 1T 5-gram Version 1 - linguistic education and research only
LM Toolkits
- Palmkit translated from Japanese
- The CMU-Cambridge Statistical Language Modeling Toolkit v2
- HTK
- SRI Language Model Toolkit