Unification of Multiple Treebanks and Testing Them With Statistical Parser With Support of Large Corpus as a Lexical Resource

Abstract

There are many Treebanks, texts with the parse tree, available for the researcher in the field of Natural Language Processing (NLP). All these Treebanks are limited in size, and each one used private Context Free Grammar (CFG) production rules (private formalism) because its construction is time consuming and need to experts in the field of linguistics. These Treebanks, as we know, can be used for statistical parsing and machine translation tests and other fields in NLP applications. We propose, in this paper, to build large Treebank from multiple Treebanks for the same language. Also, we propose to use an annotated corpus as a lexical resource. Three English Treebanks are taken for our study which arePenn Treebank (PTB), GENIA Treebank (GTB) and British National Corpus (BNC). Brown corpus is used as a lexical resource which contains approximately one million tokens annotated with part of speech tags for each.Our work start by the unification of POS tagsets of the three Treebank then the mapping process between Brown Corpus tagset and the unified tagset is done. This is done manually according to our experience in this field. Also, all the non-terminals in the CFG production are unified.All the three Treebanks and the Brown corpus are rebuilt according to the new modification.Our test for the proposed unification are made in three types: (i) statistical parsing test for each Treebank alone without modification, (ii) statistical parsing test for each Treebank alone after the modification, (iii) statistical parsing test for the collection of the three Treebanks after modification without support of lexical resource, and (iv) statistical parsing test for the collection of the three Treebanks after modification with support of lexical resource. The unknown words are processed using a very simple suggested method.We can show, simply in our work, that (a) the unification of multiple Treebanks can be done and will increase the accuracy. (b) A large annotated corpus as Brown corpus can be used for (i) decreasing the unknown words and (ii) we can extract the probabilities nearest to the reality. (c) The mapping between the unified tagset and the lexical tagset (used in Brown corpus) can be done straightforward.