School of Computing

FACULTY OF ENGINEERING

World Wide English Corpus

  • Corpora: 200,000-word web-corpora compiled from English-language websites in each domain, collected by School of Computing students

  • Wordlists: word-types sorted by frequency in English-language web-corpora for each domain, created by School of Computing students

  • Reports: Reports on English-language websites in each national domain (concluding whether the English is closer to UK or US English), written by School of Computing students

  • Journals: For coursework, School of Computing students produced research reports on text-mining applied to language data from the WWW, to be targetted at the readership of a language-related research journal. This page list some relevant Journals; alternatively, students can find other suitable journals, eg from the Leeds University Library catalogue, or the Ranked Journal Spreadsheet from ARC or by using Google to find another journal for the chosen region.

  • UK and US text samples and derived word-frequency lists for use in training WEKA classifiers. Ten 200,000-word samples from .uk domains and 10 from .us domains; from each wordlist, a corresponding word-frequency list was derived
    (e.g. on linux: sort ukaa | uniq -c | sort -n -r > ukwa )

  • ukus.arff and test.arff files to train and test WEKA classifiers and clusterers, based on frequencies of colour/color and centre/center ... you can add more features!

  • draft research report on text-mining the World Wide English Corpus (with comments to show match to marking scheme) to start you off - you can rewrite this in your own words, and fill in the missing content...
  • . . .

    Contact: Eric Atwell, Language research group, School of Computing, University of Leeds