Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)

OVERVIEW



AMALGAMHomepagePrevious PageUp A LevelNext Page

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE

Many researchers in Linguistics have been gathering bodies (or corpora) of text that they want to analyse as a way of learning more about languages. It is believed that the more text we have, the more information we can gain. Many research groups have attached labels (or tags) to each word in their corpora, so that, for instance, 'The cat sat on the mat' has the correct grammatical labels attached - 'The' is a determiner, 'cat' is a noun, 'sat' is a verb and so on. However, not all researchers have used the same set of tags, which makes it difficult for these different research groups to work together.

For example, using 'The cat sat on the mat', researchers who use the ICE scheme will produce:

The/ART(def)
cat/N(com,sing)
sat/V(intr,past)
on/PREP(ge)
the/ART(def)
mat/N(com,sing)
./PUNC(per)


Researchers who use the LOB scheme will produce:

The/ATI
cat/NN
sat/VBD
on/IN
the/ATI
mat/NN
./.

This is an important problem because there is some evidence that the corpora which we currently have are not large enough for us to produce a general statistical model of grammatical structure. Even though these corpora contain hundreds of thousands, or even millions, of words, that is not enough. We need to collate them into an even larger corpus. This means that we need to find some way of mapping between one set of tags and the others so that we can join them together.

The AMALGAM project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the research corpora described above. We plan to develop a Multi-tagged Corpus and Multi-Treebank, a single text-set annotated with all the above tagging and parsing schemes.

AMALGAM HomepagePrevious PageUp A LevelNext Page

AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE


This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk) using text provided by the staff and students of the NLP group of the School of Computer Studies at Leeds University.