The list covers corpora in the narrow sense, i.e. balanced collections of texts with linguistic annotations, preferably available from the Internet. In Leeds we developed a range of 100-200 million word corpora for Chinese, English, French, German, Italian, Polish, Russian and Spanish.
Brown Corpus - the first large computer corpus, collected in 1964 on the basis of texts written in the USA in 1961,
it contains 1 mln.words,
http://www.hd.uib.no/icame/brown/bcm.html
You can search it via LDC
The LOB Corpus, the British equivalent of the Brown Corpus
http://www.hit.uib.no/icame/lob/lob-dir.htm
You can search it via VLC concordancer (Select the LOB corpus)
The British National Corpus (BNC), 100 mln words
http://sara.natcorp.ox.ac.uk/lookup.html
The International Corpus of English (ICE), represents a variety of English dialects, including Australian, Indian, Kenyan, etc.
http://www.ucl.ac.uk/english-usage/ice/index.htm
COBUILD-Direct, a freely available subset of The Bank of English
http://titania.cobuild.collins.co.uk/form.html
The Penn Treebank, a syntactically anotated corpus, including texts from the Brown corpus
http://www.cis.upenn.edu/~treebank/
Russian corpora from Leeds
http://corpus.leeds.ac.uk/ruscorpora.html
Russian corpora from the University of Tübingen, including online interface to the Uppsala corpus (it follows the Brown Corpus model)
http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html
Russian newspaper corpus from the University of Moscow
http://www.philol.msu.ru/~lex/corpus/
"The Computer Fund of Russian Lnaguage"
http://www.artint.ru/cfrl/
The Russian National Corpus (ongoing project)
http://www.ruscorpora.ru/
the Russian frequency list, on the basis of my 40 MW corpus
Scripta Sinica, a representative Chinese corpus
http://www.sinica.edu.tw/ftms-bin/ftmsw3
Croatian National Corpus
http://www.hnk.ffzg.hr/
Czech National Corpus
http://ucnk.ff.cuni.cz/
The Prague Dependency Treebank, a Czech corpus with syntactic annotation
http://ufal.mff.cuni.cz/pdt/pdt.html
Dutch corpora
http://www.inl.nl/corp/corplex.htm
Corpus of Estonian wirtten texts
http://psych.ut.ee/gling/en/corpusb/
Language Bank of Finland
http://www.csc.fi/hakemukset/wwwlomake/index.phtml?lomake=Kielipankki
German corpora collected at Institut für Deutsche Sprache
http://corpora.ids-mannheim.de/~cosmas/
German DWDS corpus
http://www.dwds.de/
Greek National Corpus
http://hnc.ilsp.gr/find.asp
Hebrew newspaper corpus
http://cl.haifa.ac.il/~shlomo/corpora/
Hungarian National Corpus
http://corpus.nytud.hu/mnsz/index_eng.html
Italian written corpus
http://www.cilta.unibo.it/Portale/RicercaLinguistica/coris_eng.html
Portuguese corpora
http://acdc.linguateca.pt/
Spanish corpora
http://spraakbanken.gu.se/lb/konk/rom2/
http://www.corpusdelespanol.org/
http://www.lllf.uam.es/servicios/servicios.html
http://www.rae.es (click on 'consulta banco de datos')
Swedish corpus
http://spraakbanken.gu.se/lb/konk/
Michael Barlow's Parallel Corpora Page
http://www.ruf.rice.edu/~barlow/para.html
my Perl-tools for working with parallel corpora
http://purl.org/net/concordance
The English-Norwegian parallel corpus
http://www.hf.uio.no/iba/prosjekt/
The open-source parallel corpus for Open Office
http://logos.uio.no/opus/
The Canadian Hansard: the debates in the Candaian House of Commons
http://www.isi.edu/natural-language/download/hansard/
English-French-Spanish CRATER corpus
http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html
English-German-Russian parallel corpus:
http://corpus.leeds.ac.uk/paraquery.html
English-Japanese corpus of business letters:
http://ysomeya.hp.infoseek.co.jp/
English-Portuguese corpora:
http://www.linguateca.pt/COMPARA/
European Parliament Proceedings (European languages):
http://www.isi.edu/~koehn/publications/europarl/
Text Encoding Initiative (TEI)
Expert Advisory Group on Language Engineering Standards (EAGLES)
Ch. 5, Collocations from Christopher D. Manning and Hinrich Schütze, 1999. Foundations of Statistical Natural Language Processing.
Statistical NLP links from Chris Manning
The loglikelihood calculator, including a reference to the seminal Ted Dunning's paper
The Centre for Linguistic Documentation
UCREL: Lancaster University Centre For Computer Corpus Research On Language
TRACTOR (TELRI Research Archive of Computational Tools and Resources)
Michael Barlow's Corpus Linguistics Page
Kenji Kita's list of corpora and texts
The list of corpora from the University of Essex
David Lee's Bookmarks for Corpus-Based Linguistics
Last modified on 31/05/06
by Serge Sharoff, s.sharoff
leeds.ac.uk