Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
Different word-tagging schemes assume different segmentation of text into LEXICAL UNITS or words. For example, sometimes compound names or idiomatic phrases are given a single wordtag; in contrast, sometimes affixes are stripped off and given a separate word-tag. It was necessary to build a sophisticated alignment program to allow us to align and compare rival taggings of the same text.
Originally, when re-training Brill's tagger on samples from each tagged corpus, we used the texts in their original format. However, this raised problems for Corpus texts which were transcriptions of spoken dialogue, such as the London-Lund Corpus. For example, the learning algorithm generally make use of punctuation to guide analysis, and can be misled by hesitations, false starts etc. These differences are also problematic in mapping between schemes.
The difference between the London-Lund scheme and the others we are dealing with led us to begin re-training the Brill tagger on an edited form of the London-Lund Corpus. The main problem for us is that the London-Lund scheme is designed for a transcribed spoken corpus whereas we want to be able to apply the scheme to written corpora. So, spoken text had to be pre-processed to add some punctuation, an
d remove markup for pauses, repeated phrases, subaudible text etc. Our editing process is analogous to the editing of Hansard, the official transcripts of parliamentary debates. Canadian Hasard transcripts have been widely used in NLP research, as "well-behaved" spoken text.
This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk)