Download nonProj2Proj 0.2 and proj2NonProj 0.2

User guide for nonProj2Proj 0.2 and proj2NonProj 0.2

The jar-file nonProj2Proj.jar takes a file having dependency trees encoded in either Malt-XML or in the format defined by organizers' of the CoNLL-X Shared Task 2006. By means of graph transformations, it transforms all non-projective sentences to projective (Nivre and Nilsson, 2005). A number of marking strategies are possible to choose (see below), in order to encode the transformations.
usage: java -Xmx512m -jar nonProj2Proj.jar [options] <non-proj file (in Malt-XML or CoNLL-TAB)> <marking strategy (0|1|4|6|filter)> <file format (maltxml|conll)> [charSetName (default=ISO-8859-1)]

options:
covered roots: -cr (0|1|2|3); 0=no attachment (default), 1=root attachment, 2=left attachment, 3=right attachment
If the input treebank file was encoded in MaltXML, three files are output: If the input treebank file was encoded in the CoNLL format, two files are output: The option covered root (the cr-flag) is used for connecting unconnected sub-trees with the shortest arc that covers all tokens of the sub-tree. This may be helpful in case the data contains for example dangling punctuation in the middle of a sentence. The flag can take four values, (1|2|3) = attach the sub-tree to the (root|left|right) of the shortest covering arc, while 0 = do nothing.

The optional last argument (charSetName) is only used when the input file is in the CoNLL format. If the input data comes from the CoNLL Shared Task, this argument should be UTF-8.

The flag -Xmx512m is optional, but this flag might help the Java JRE 5.0 to cope with larger XML-files. For more details, see the documentation for Java JRE 5.0.

The jar-file proj2NonProj.jar makes the opposite to nonProj2Proj.jar. It takes a file containing projectivized trees and deprojectivizes them. In order to perform some consistency checks, the program requires a file containing all the non-projective (original) dependency relations, having one dependency relation per line (example). Also, the program is not aware of the marking strategy selected for the projective input. Therefore it requires the user to specify the deprojectivization strategy to use.

usage: java -jar proj2NonProj.jar <proj. file (in Malt-XML or CoNLL-TAB)> <non-proj. deprel file> <marking stategy (0|1|4|6)> <file format (maltxml|conll)> [charSetName (default=ISO-8859-1)]
If the treebank file was encoded in MaltXML, it outputs two files: If the input treebank file was encoded in the CoNLL format, it outputs one file:

The marking strategies

nonProj2Proj can be used for preprocessing the training data for MaltParser, and the parser output can be postprocessed using proj2NonProj. The program MaltEval is well-suited for evaluation of the output.

Reference

Nivre, J. and Nilsson, J. (2005) Pseudo-Projective Dependency Parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 99-106.