Download nonProj2Proj 0.2 and proj2NonProj 0.2
- NB: These tools have been superseded by the new version of
MaltParser, which
includes an implementation of pseudo-projective parsing. The old tools
are only maintained for reproducibility of old results.
-
[Download nonProj2Proj.jar]
Click on this link to download the program to projectivize a treebank in MaltXML.
-
[Download proj2NonProj.jar]
Click on this link to download the program to deprojectivize a treebank in MaltXML.
The software can be used freely for non-commercial research and educational
purposes. It comes with no warranty, but we welcome all comments, bug reports,
and suggestions for improvements.
To run the programs you need Java JRE 5.0.
User guide for nonProj2Proj 0.2 and proj2NonProj 0.2
The jar-file nonProj2Proj.jar takes a file having dependency trees encoded in either Malt-XML or in the format defined by organizers' of the CoNLL-X Shared Task 2006. By means of graph transformations, it transforms all non-projective sentences to projective (Nivre and Nilsson, 2005). A number of marking strategies are possible to choose (see below), in order to encode the transformations.
usage: java -Xmx512m -jar nonProj2Proj.jar [options] <non-proj file (in Malt-XML or CoNLL-TAB)> <marking strategy (0|1|4|6|filter)> <file format (maltxml|conll)> [charSetName (default=ISO-8859-1)]
options:
covered roots: -cr (0|1|2|3); 0=no attachment (default), 1=root attachment, 2=left attachment, 3=right attachment
If the input treebank file was encoded in MaltXML, three files are output:
- A file containing the projectivized sentences in MaltXML
- A file containing the projectivized sentences in MaltTab
- A file containing all old and newly created dependency relations given the selected marking strategy
If the input treebank file was encoded in the CoNLL format, two files are output:
- A file containing the projectivized sentences in the CoNLL format
- A file containing all old and newly created dependency relations given the selected marking strategy
The option covered root (the cr-flag) is used for connecting unconnected sub-trees with the shortest arc that covers all tokens of the sub-tree. This may be helpful in case the data contains for example dangling punctuation in the middle of a sentence. The flag can take four values, (1|2|3) = attach the sub-tree to the (root|left|right) of the shortest covering arc, while 0 = do nothing.
The optional last argument (charSetName) is only used when the input file is in the CoNLL format. If the input data comes from the CoNLL Shared Task, this argument should be UTF-8.
The flag -Xmx512m is optional, but this flag might help the Java JRE 5.0 to cope with larger XML-files. For more details, see the documentation for Java JRE 5.0.
The jar-file proj2NonProj.jar makes the opposite to nonProj2Proj.jar. It takes a file containing projectivized trees and deprojectivizes them. In order to perform some consistency checks, the program requires a file containing all the non-projective (original) dependency relations, having one dependency relation per line (example). Also, the program is not aware of the marking strategy selected for the projective input. Therefore it requires the user to specify the deprojectivization strategy to use.
usage: java -jar proj2NonProj.jar <proj. file (in Malt-XML or CoNLL-TAB)> <non-proj. deprel file> <marking stategy (0|1|4|6)> <file format (maltxml|conll)> [charSetName (default=ISO-8859-1)]
If the treebank file was encoded in MaltXML, it outputs two files:
- A file containing the projectivized sentences in MaltXML
- A file containing the projectivized sentences in MaltTab
If the input treebank file was encoded in the CoNLL format, it outputs one file:
- A file containing the projectivized sentences in the CoNLL format
The marking strategies
- 0: corresponds to the baseline encoding (Nivre and Nilsson, 2005).
- 1: corresponds to the encoding Head (Nivre and Nilsson, 2005).
- 4: corresponds to the encoding Head + Path (Nivre and Nilsson, 2005).
- 6: corresponds to the encoding Path (Nivre and Nilsson, 2005).
- filter: is only applicable for nonProj2Proj which simply removes all non-projective sentences.
nonProj2Proj can be used for preprocessing the training data for MaltParser, and the parser output can be postprocessed using proj2NonProj. The program MaltEval is well-suited for evaluation of the output.
Reference
Nivre, J. and Nilsson, J. (2005)
Pseudo-Projective Dependency Parsing.
In Proceedings of the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 99-106.