School of Computing

FACULTY OF ENGINEERING

World Wide English Corpus

  • Corpora: 200,000-word web-corpora compiled from English-language websites in each domain, collected by School of Computing students 2006-08

  • Wordlists: word-types sorted by frequency in English-language web-corpora for each domain, created by School of Computing students 2006-08

  • Reports: Reports on English-language websites in each national domain (concluding whether the English is closer to UK or US English), written by School of Computing students 2006-08

  • Journals: For coursework 2006-08, School of Computing students produced research reports on text-mining applied to language data from the WWW, to be targetted at the readership of a language-related research journal. This page list some relevant Journals; alternatively, students can find other suitable journals, eg from the Leeds University Library catalogue, or by using Google to find another journal for the chosen region.

  • UK and US text samples and derived word-frequency lists for use in training WEKA classifiers. Ten 200,000-word samples from .uk domains and 10 from .us domains; from each wordlist, a corresponding word-frequency list was derived
    (e.g. on linux: sort ukaa | uniq -c | sort -n -r > ukwa )

  • ukus.arff and test.arff files to train and test WEKA classifiers and clusterers, based on frequencies of colour/color and centre/center ... you can add more features!

  • draft research report on text-mining the World Wide English Corpus (with comments to show match to marking scheme) to start you off - you can rewrite this in your own words, and fill in the missing content...
  • . . .

    Contact: Eric Atwell, Language research group, School of Computing, University of Leeds