Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
Mapping between full parsing schemes is much harder. Although the parsing schemes used in several treebanks have a familial similarity, ideally we would also like to be able to include the output of robust parsers from outside this ICAME heritage, such as the IPSM MULTITREEBANK (a corpus of sentences each of which is annotated with several rival syntax trees).
Our approach uses Machine Learning of syntax, assuming each annotated Corpus is definitive of the grammatical annotation scheme to be learnt. This suggests our approach to mapping between PARSING SCHEMES, for example to reparse SEC text with the POW parsing scheme:
This requires a parsing-scheme-neutral way of representing rival parse-trees, to simplify comparison of delicacy. We have tried extracting all CONTEXT-FREE RULES from each Treebank, to use in a CHART PARSER. However, this yields upwards of 8,000 context-free grammar rules from each Corpus parsing scheme; our current chart-parsing system cannot cope with such a large grammar in reasonable time. So, we are also experimenting with alternative representations.
For WORDTAGGED corpora, we assume sequence of word+wordtag pairs; this is amenable to N-gram-like modelling. For PARSE-TREES, an analogous N-gram-like model is used in the VERTICAL STRIP PARSER (VSP): a Vertical Strip Grammar.
For example, take the parse-tree in the EAGLES basic parsing scheme:
[S[VP select [NP the text [CL[NP you NP][VP want [VP to protect
VP]VP]CL]NP]VP] . S]
S
| \
| \
VP \
/ | \
/ | \
/ NP \
/ //| \
/ // | \
/ // CL \
/ // | \ \
/ // | \ \
/ / | NP VP \
/ / | | | \ \
/ / | | | \ \
/ / | | | VP \
/ | | | | | \ \
/ | | | | | \ \
select the text you want to protect .
Another way of drawing the same tree, using only vertical and horizontal lines, is:
S________________________________
| |
VP___ |
| | |
| NP_______ |
| | | | |
| | | CL__ |
| | | | | |
| | | NP VP___ |
| | | | | | |
| | | | | VP__ |
| | | | | | | |
select the text you want to protect .
This can be chopped into a series of Vertical Strips, one for each path from root S to each leaf:
S S S S S S S S
| | | | | | | |
VP VP VP VP VP VP VP .
| | | | | | |
select NP NP NP NP NP NP
| | | | | |
the text CL CL CL CL
| | | |
NP VP VP VP
| | | |
you want VP VP
| |
to protect
This Vertical Strip representation is highly redundant, as the "top" of each strip shares its path from the root with its predecessor. So, the final VSG representation only records the path to each leaf from the point of divergence from the previous Strip:
S
| |
VP .
| |
select NP
| | |
the text CL
| |
NP VP
| | |
you want VP
| |
to protect
AMALGAM HOMEPAGE | PREVIOUS PAGE | UP A LEVEL | NEXT PAGE
This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk)