World Wide English Corpus
Corpora:
200,000-word web-corpora compiled from English-language websites in each
domain, collected by School of Computing students 2006-08
Wordlists:
word-types sorted by frequency in English-language web-corpora for each
domain, created by School of Computing students 2006-08
Reports:
Reports on English-language websites in each national domain
(concluding whether the English is closer to UK or US English), written by
School of Computing students 2006-08
Journals:
For coursework 2006-08, School of Computing students produced research
reports on text-mining applied to language data from the WWW,
to be targetted at the readership of a language-related research journal.
This page list some relevant Journals; alternatively, students can find
other suitable journals, eg from
the Leeds University Library catalogue, or by using Google to find another
journal for the chosen region.
UK and
US text samples and
derived word-frequency lists for use in training WEKA classifiers.
Ten 200,000-word samples from .uk domains and 10 from .us domains;
from each wordlist, a corresponding word-frequency list was derived
(e.g. on linux: sort ukaa | uniq -c | sort -n -r > ukwa )
ukus.arff
and test.arff
files
to train and test WEKA classifiers and clusterers, based on frequencies
of colour/color and centre/center ... you can add more features!
draft
research report on text-mining the World Wide English Corpus
(with
comments to show match to marking scheme)
to start you off - you can rewrite this in your own words, and fill in
the missing content...
. . .
Contact:
Eric Atwell, Language research
group, School of Computing,
University of Leeds