Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
![]()
Many researchers in Linguistics have been gathering bodies (or corpora)
of text that they want to analyse as a way of learning more about languages.
It is believed that the more text we have, the more information we can
gain. Many research groups have attached labels (or tags) to each word
in their corpora, so that, for instance, 'The cat sat on the mat' has the
correct grammatical labels attached - 'The' is a determiner, 'cat' is a
noun, 'sat' is a verb and so on. However, not all researchers have used
the same set of tags, which makes it difficult for these different research
groups to work together.
For example, using 'The cat sat on the mat', researchers who use the
ICE scheme will produce:
The/ART(def)
cat/N(com,sing)
sat/V(intr,past)
on/PREP(ge)
the/ART(def)
mat/N(com,sing)
./PUNC(per)
Researchers who use the LOB scheme will produce:
The/ATI
cat/NN
sat/VBD
on/IN
the/ATI
mat/NN
./.
This is an important problem because there is some evidence that the corpora which we currently have are not large enough for us to produce a general statistical model of grammatical structure. Even though these corpora contain hundreds of thousands, or even millions, of words, that is not enough. We need to collate them into an even larger corpus. This means that we need to find some way of mapping between one set of tags and the others so that we can join them together.
The AMALGAM project is an attempt to create a set of mapping algorithms
to map between the main tagsets and phrase structure grammar schemes used
in the research corpora described above. We plan to develop a Multi-tagged
Corpus and Multi-Treebank, a single text-set annotated with all the above
tagging and parsing schemes.
AMALGAM HOMEPAGE | PREVIOUS
PAGE | UP A LEVEL | NEXT
PAGE
This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk) using text provided by the staff and students of the NLP group of the School of Computer Studies at Leeds University.