Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)




School of Computer Studies Home PagePrevious Page - CCALAS Home PageCCALAS Home PageNext Page - Participating Bodies

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE

Our approach to lexico-grammatical word classes

Project AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) aims to investigate and develop methods of automatically mapping between a range of different syntactic annotation schemes applied to English corpora, thus assessing their differences and improving their reusability. Annotating a single corpus with the different grammar schemes allows for comparisons, and will provide a rich test-bed for automatic parsers.

The tag-mapping approach we eventually adopted is a hybrid. Firstly, to re-tag the "old" parts of speech of a corpus with a "new" scheme of another we apply a tagger to just the words of the corpus. Although some of the tagging programs used to annotate the LOB, ICE, SEC etc corpora are available to us and use similar underlying algorithms, they differ in significant ways. In some cases we do not have access to the tagging programs forcing us to train our own. We decided to train a publicly-available Machine Learning system, the BRILL TAGGER, to re-tag according to all of the schemes we are working with. As the Brill tagger will be the sole automatic annotator for the project we will have greater flexibility. We will be able to release the software of the AMALGAM project with a single tagger at its core.

The Brill system is first given a tagged Corpus as a training set, to extract or learn a complex set of tagging rules for the given lexico-grammatical annotation model. Then, the learnt rule-set can be applied to a new text, to annotate with the given tagset. We accept the tagged/parsed Corpus itself as definitive of the tagging/parsing scheme for that Corpus.

The second stage of the hybrid mapper compares the new and old tags, pair by pair, to a hand-crafted tagging INTERLINGUA to check for likely errors and update tags if necessary.

For example, mapping from SEC to Brown lexico-grammatical annotation models is achieved by retagging SEC text with Brown tags:

School of Computer Studies Home PagePrevious Page - CCALAS Home PageCCALAS Home PageNext Page - Participating Bodies

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE


This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk)