Useful links to corpora

I found the following links on corpus linguistics and natural language processing quite useful for my own ressearch. Given that there are so many collections of various links to various corpora or computational tools (my favourite is David Lee's website), the problem is with selecting a golden set of links; though beauty is always in the eyes of the beholder. The list below reflects my personal preferences, namely:

Representative monolingual corpora

The list covers corpora in the narrow sense, i.e. balanced collections of texts with linguistic annotations, preferably available from the Internet. In Leeds we developed a range of 100-200 million word corpora for Chinese, English, French, German, Italian, Polish, Russian and Spanish.

English

Brown Corpus - the first large computer corpus, collected in 1964 on the basis of texts written in the USA in 1961, it contains 1 mln.words,
http://www.hd.uib.no/icame/brown/bcm.html
You can search it via LDC

The LOB Corpus, the British equivalent of the Brown Corpus
http://www.hit.uib.no/icame/lob/lob-dir.htm
You can search it via VLC concordancer (Select the LOB corpus)

The British National Corpus (BNC), 100 mln words
http://sara.natcorp.ox.ac.uk/lookup.html

The International Corpus of English (ICE), represents a variety of English dialects, including Australian, Indian, Kenyan, etc.
http://www.ucl.ac.uk/english-usage/ice/index.htm

COBUILD-Direct, a freely available subset of The Bank of English
http://titania.cobuild.collins.co.uk/form.html

The Penn Treebank, a syntactically anotated corpus, including texts from the Brown corpus
http://www.cis.upenn.edu/~treebank/

Russian

Russian corpora from Leeds
http://corpus.leeds.ac.uk/ruscorpora.html

Russian corpora from the University of Tübingen, including online interface to the Uppsala corpus (it follows the Brown Corpus model)
http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html

Russian newspaper corpus from the University of Moscow
http://www.philol.msu.ru/~lex/corpus/

"The Computer Fund of Russian Lnaguage"
http://www.artint.ru/cfrl/

The Russian National Corpus (ongoing project)
http://www.ruscorpora.ru/

the Russian frequency list, on the basis of my 40 MW corpus

Other languages

Scripta Sinica, a representative Chinese corpus
http://www.sinica.edu.tw/ftms-bin/ftmsw3

Croatian National Corpus
http://www.hnk.ffzg.hr/

Czech National Corpus
http://ucnk.ff.cuni.cz/

The Prague Dependency Treebank, a Czech corpus with syntactic annotation
http://ufal.mff.cuni.cz/pdt/pdt.html

Dutch corpora
http://www.inl.nl/corp/corplex.htm

Corpus of Estonian wirtten texts
http://psych.ut.ee/gling/en/corpusb/

Language Bank of Finland
http://www.csc.fi/hakemukset/wwwlomake/index.phtml?lomake=Kielipankki

German corpora collected at Institut für Deutsche Sprache
http://corpora.ids-mannheim.de/~cosmas/

German DWDS corpus
http://www.dwds.de/

Greek National Corpus
http://hnc.ilsp.gr/find.asp

Hebrew newspaper corpus
http://cl.haifa.ac.il/~shlomo/corpora/

Hungarian National Corpus
http://corpus.nytud.hu/mnsz/index_eng.html

Italian written corpus
http://www.cilta.unibo.it/Portale/RicercaLinguistica/coris_eng.html

Portuguese corpora
http://acdc.linguateca.pt/

Spanish corpora
http://spraakbanken.gu.se/lb/konk/rom2/
http://www.corpusdelespanol.org/
http://www.lllf.uam.es/servicios/servicios.html
http://www.rae.es (click on 'consulta banco de datos')

Swedish corpus
http://spraakbanken.gu.se/lb/konk/

Multilingual corpora and tools

Michael Barlow's Parallel Corpora Page
http://www.ruf.rice.edu/~barlow/para.html

my Perl-tools for working with parallel corpora
http://purl.org/net/concordance

The English-Norwegian parallel corpus
http://www.hf.uio.no/iba/prosjekt/

The open-source parallel corpus for Open Office
http://logos.uio.no/opus/

The Canadian Hansard: the debates in the Candaian House of Commons
http://www.isi.edu/natural-language/download/hansard/

English-French-Spanish CRATER corpus
http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html

English-German-Russian parallel corpus:
http://corpus.leeds.ac.uk/paraquery.html

English-Japanese corpus of business letters:
http://ysomeya.hp.infoseek.co.jp/

English-Portuguese corpora:
http://www.linguateca.pt/COMPARA/

European Parliament Proceedings (European languages):
http://www.isi.edu/~koehn/publications/europarl/

Links to Corpus Standards

Text Encoding Initiative (TEI)

Corpus Encoding Standard

Expert Advisory Group on Language Engineering Standards (EAGLES)

Statistical tools and methods

Ch. 5, Collocations from Christopher D. Manning and Hinrich Schütze, 1999. Foundations of Statistical Natural Language Processing.

Statistical NLP links from Chris Manning

The loglikelihood calculator, including a reference to the seminal Ted Dunning's paper

Other useful links

The Survey of the State of the Art in Human Language Technology

The Centre for Linguistic Documentation

UCREL: Lancaster University Centre For Computer Corpus Research On Language

TRACTOR (TELRI Research Archive of Computational Tools and Resources)

Michael Barlow's Corpus Linguistics Page

Kenji Kita's list of corpora and texts

The list of corpora from the University of Essex

David Lee's Bookmarks for Corpus-Based Linguistics


 

<< Back to the home page

Last modified on 31/05/06 by Serge Sharoff, s.sharoffleeds.ac.uk