Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)


Eric Atwell, Language research group, School of Computing, Leeds University

AMALGAM Home PagePrevious PageUp A LevelNext Page

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE

The Lancaster/IBM Spoken English Corpus (SEC) Tag-set

The Spoken English Corpus is based on the the LOB corpus tag-set . As almost every SEC tag is identical to its LOB equivalent there is no need for a separate table for SEC. The the LOB corpus tag-set table should be consulted instead. The major difference between the tag-sets is that LOB differentiates between relative and interrogative WH-pronouns whereas SEC does not. For example, the LOB tag pair WP (WH-pronoun, interrogative, nominative or accusative) and WPR (WH-pronoun, relative nominative or accusative) are covered by the same SEC tag. Confusingly, this tag is also called WP, but, unlike for LOB, does not imply that the WH-pronoun is interrogative. The following table details the major difference between LOB and SEC with regard to WH-pronouns:

Tag

Description in SEC

Description in LOB

WP

WH-pronoun, nominative or accusative

WH-pronoun, interrogative, nominative or accusative

WPR

Not used in SEC (use WP instead)

WH-pronoun, relative, nominative or accusative

WP$

WH-pronoun, genitive

WH-pronoun, interrogative, genitive

WP$R

Not used in SEC (use WP$ instead)

WH-pronoun, relative, genitive

WPO

WH-pronoun, accusative

WH-pronoun, interrogative, accusative

WPOR

Not used in SEC (use WPO instead)

WH-pronoun, relative, accusative

As its name implies, the Spoken English Corpus is composed of transcriptions of spoken English. This inherently means that there will be differences between it and the LOB corpus which is comprised of written texts only. Phenomena that are used primarily for English in its written form will not be found in SEC. A good example is written abbreviations. These were marked in LOB in a pre-automatic-tagging phase by adding the sequence `\0' to the start of the abbreviated token whereas this is not required in SEC.

Some of the LOB tags do not appear in SEC even though, in theory, they would have been allowable. This is because, at just over 52 thousand words, SEC is much smaller than LOB which has over a million words. Naturally, in such a small corpus the coverage of rare parts-of-speech was reduced. An example parts-of-speech found in LOB but not found to occur in SEC is CD$ used for genitive cardinal numbers.

SEC is more selective about the use of ditto tags (those ending in a double quote character). In LOB, these were introduced during extensive post-editing. Ditto tags have less coverage in SEC partly because of a smaller post-editing phase and also, again, because the SEC is so small that rare ditto forms may not have occurred.

The smaller post-editing phase of SEC compared to LOB has caused a blurring of the distinction between the standard tags for ordinary adjectives and some types of nouns and those that signal a word-initial capital usage. Some words that started a sentence in SEC may have retained an initial capital which would have been decapitalised during the post-edit phase in LOB. Tag pairs that our affected are:

JJ (adjective) and JNP (adjective, word-initial capital)
NN (noun, singular, common) and NNP (noun, singular, common, word-initial capital )
NN$ (noun, singular, common, genitive) and NNP$ (noun, singular, common, word-initial capital, genitive)
NNS (noun, plural, common) and NNPS (noun, plural, common, word-initial capital)
NNS$ (noun, plural, common, genitive) and NNPS$ (noun, plural, common, word-initial capital, genitive)

The tag NC used for cited words was added manually to LOB in the post-edit phase which partially explains its absence from SEC as it was not known to the automatic tagger.

Further information on the SEC can be found at the International Computer Archive of Modern English (ICAME) corpus collection.

Further reading:

Taylor, L.J. & G. Knowles. 1988. Manual of information to accompany the SEC corpus: The machine readable corpus of spoken English. Unit for Computer Research on the English Language, University of Lancaster.

Eric Atwell, Language research group, School of Computing, Leeds University

AMALGAM Home PagePrevious PageUp A LevelNext Page

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE