Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
![]()
The Spoken English Corpus is based on the the LOB corpus tag-set . As almost every SEC tag is identical to its LOB equivalent there is no need for a separate table for SEC. The the LOB corpus tag-set table should be consulted instead. The major difference between the tag-sets is that LOB differentiates between relative and interrogative WH-pronouns whereas SEC does not. For example, the LOB tag pair WP (WH-pronoun, interrogative, nominative or accusative) and WPR (WH-pronoun, relative nominative or accusative) are covered by the same SEC tag. Confusingly, this tag is also called WP, but, unlike for LOB, does not imply that the WH-pronoun is interrogative. The following table details the major difference between LOB and SEC with regard to WH-pronouns:
|
Tag |
Description in SEC |
Description in LOB |
|
WP |
WH-pronoun, nominative or accusative |
WH-pronoun, interrogative, nominative or accusative |
|
WPR |
Not used in SEC (use WP instead) |
WH-pronoun, relative, nominative or accusative |
|
WP$ |
WH-pronoun, genitive |
WH-pronoun, interrogative, genitive |
|
WP$R |
Not used in SEC (use WP$ instead) |
WH-pronoun, relative, genitive |
|
WPO |
WH-pronoun, accusative |
WH-pronoun, interrogative, accusative |
|
WPOR |
Not used in SEC (use WPO instead) |
WH-pronoun, relative, accusative |
As its name implies, the Spoken English Corpus is composed of transcriptions of spoken English. This inherently means that there will be differences between it and the LOB corpus which is comprised of written texts only. Phenomena that are used primarily for English in its written form will not be found in SEC. A good example is written abbreviations. These were marked in LOB in a pre-automatic-tagging phase by adding the sequence `\0' to the start of the abbreviated token whereas this is not required in SEC.
Some of the LOB tags do not appear in SEC even though, in theory, they would have been allowable. This is because, at just over 52 thousand words, SEC is much smaller than LOB which has over a million words. Naturally, in such a small corpus the coverage of rare parts-of-speech was reduced. An example parts-of-speech found in LOB but not found to occur in SEC is CD$ used for genitive cardinal numbers.
SEC is more selective about the use of ditto tags (those ending in a double quote character). In LOB, these were introduced during extensive post-editing. Ditto tags have less coverage in SEC partly because of a smaller post-editing phase and also, again, because the SEC is so small that rare ditto forms may not have occurred.
The smaller post-editing phase of SEC compared to LOB has caused a blurring of the distinction between the standard tags for ordinary adjectives and some types of nouns and those that signal a word-initial capital usage. Some words that started a sentence in SEC may have retained an initial capital which would have been decapitalised during the post-edit phase in LOB. Tag pairs that our affected are:
The tag NC used for cited words was added manually to LOB in the post-edit phase which partially explains its absence from SEC as it was not known to the automatic tagger.
Further information on the SEC can be found at the International Computer Archive of Modern English (ICAME) corpus collection.
Further reading: