Relation to older versions:
MaltParser 0.21 uses libTimbl, part of TiMBL (Tilburg Memory-Based Learner), Version 5.1, and LIBSVM, Version 2.8, in order to learn parsing models from treebanks, and we gratefully acknowledge the use of these software packages. However, MaltParser 0.21 is a standalone application, so there is no need to install either TiMBL or LIBSVM separately.
> ./maltparser -f file
where file is the name of an option file, specifying all the parameters needed. The parser can be run in two basic modes, learning (inducing a parsing model from a treebank) and parsing (using the parsing model to parse new data). In the current version of the parser, new data must be tokenized and part-of-speech tagged in the Malt-TAB format. The option file, which also specifies the parser mode, is described in detail below.
$PARAMETER$
VALUE
In addition, the option file may contain comment lines starting with "--". The following table lists all the available parameters with their permissible values. Default values are marked with "*". Parameters that lack a default value must be specified in the option file (if they are required by the particular configuration of modules invoked). An example option file can be found here.
| I/O Parameters | Description | Values | Description |
| INFILE | Input file | Filename | The input (for both learning and parsing) must be in the Malt-TAB format. During learning the four columns form, postag, head, deprel are required; during parsing only the first two (form, postag) are required. An example input file can be found here. |
| OUTFILE | Output file | Filename | |
| OUTFORMAT | Output data format | TAB MALTXML* TIGERXML | Malt-TAB Malt-XML TIGER-XML |
| VERBOSE | Output to terminal | YES* NO | |
| MAXSENTENCELENGTH | Maximum number of tokens per sentence | Integer | Default = 512 |
| MAXTOKENLENGTH | Maximum number of characters per token | Integer | Default = 256 |
| Tagset Parameters | Description | Values | Description |
| POSSET | Part-of-speech tagset | Filename | The part-of-speech tagset must be specified in a text file with one tag per line (and no blank lines). An example file can be found here. |
| DEPSET | Dependency type tagset | Filename | The dependency type tagset must be specified in a text file with one tag per line (and no blank lines). The first tag must be the tag assigned to root nodes. An example file can be found here. |
| Parser Parameters | Description | Values | Description |
| MODE | Mode (learning or parsing) | PARSE* LEARN | Parsing (using an induced model to parse new data) Learning (inducing a model from treebank data) |
| ALGORITHM | Parsing algorithm (see description below) | NIVRE* COVINGTON | Nivre (2003, 2004) Covington (2001) (incremental) |
| PARSEROPTIONS | Parser options (algorithm specific) | -a [ES] | Arc order (NIVRE): E(ager), S(tandard) |
| -g [NP] | Graph condition (COVINGTON): N(on-Projective), P(rojective) | ||
| Guide Parameters | Description | Values | Description |
| MAXFEATURES | Maximum number of features of each type | Integer | Default = 30 |
| FEATURES | Feature model specification (see description below) | Filename | Model specified in Filename.par (If no feature model specification can be loaded, a default specification equivalent to m3.par is used.) |
| Learner Parameters | Description | Values | Description |
| LEARNER | Learner type (see description below) | MBL* SVM | Memory-based learning (TiMBL) Support vector machine (LIBSVM) |
| LEARNEROPTIONS | Parameter settings (learner specific) | String | TiMBL example: "-m M -k 5 -w 0 -d ID -L 3"
(see TiMBL Documentation)
LIBSVM example: "-t 0" (see LIBSVM Documentation) |
<fspec> ::= <feat>+ <feat> ::= <lfeat> | <nlfeat> <lfeat> ::= LEX \t <dstruc> \t <off> \t <suff> \n <nlfeat> ::= (POS|DEP) \t <dstruc> \t <off> \n <dstruc> ::= (STACK|INPUT|CONTEXT) <off> ::= <nnint> \t <int> \t <nnint> \t <int> \t <int> <suff> ::= <nnint> <int> ::= (...|-2|-1|0|1|2|...) <nnint> ::= (0|1|2|...)
As syntactic sugar, any <lfeat> or <nlfeat>
can be truncated if all remaining integer values are zero. An example feature specification can
be found here.
Each feature is specified on a single line, consisting of at least two tab-separated
columns. The first column defines the feature type to be lexical (LEX),
part-of-speech (POS), or dependency (DEP).
The second column identifies one of the main data structures in the
parser configuration, usually the stack (STACK) or the list of remaining input tokens (INPUT),
as the ``base address'' of the feature. (The third alternative, CONTEXT, is relevant only
together with Covington's algorithm in non-projective mode.)
The actual address is then specified by a series of
``offsets'' with respect to the base address as follows:
POS STACK 0 0 0 0 0 POS INPUT 1 0 0 0 0 POS INPUT 0 -1 0 0 0 DEP STACK 0 0 1 0 0 DEP STACK 0 0 0 -1 0The feature defined on the first line is simply the part-of-speech of the token on top of the stack (TOP). The second feature is the part-of-speech of the token immediately after the next input token in the input list (NEXT), while the third feature is the part-of-speech of the token immediately before NEXT in the original input string (which may not be present either in the INPUT list or the STACK anymore). The fourth feature is the dependency type of the head of TOP (zero steps down the stack, zero steps forward/backward in the input string, one step up to the head). The fifth and final feature is the dependency type of the leftmost dependent of TOP (zero steps down the stack, zero steps forward/backward in the input string, zero steps up through heads, one step down to the leftmost dependent). Using the syntactic sugar of truncating all remaining zeros, these five features can also be specified more succintly:
POS STACK POS INPUT 1 POS INPUT 0 -1 DEP STACK 0 0 1 DEP STACK 0 0 0 -1The only difference between lexical and non-lexical features is that the specification of lexical features may contain an eighth column specifying a suffix length n. By convention, if n = 0, the entire word form is included; otherwise only the n last characters are included in the feature value. Thus, the following specification defines a feature the value of which is the four-character suffix of the word form of the next left sibling of the rightmost dependent of the head of the token immediately below TOP.
LEX STACK 1 0 1 1 -1 4Finally, it is worth noting that if any of the offsets is undefined in a given configuration, the feature is automatically assigned a null value.
The following table shows three of the models provided with Version 0.1 of MaltParser (there called MBL2, MBL3 and MBL4 because MBL was the only learner type supported in that version). For each model we also give a link to the feature specification for that model.
| Models | Top | Next | T | N | TH | TL | TR | NL | TH | TL | TR | NL | L1 | L2 | L3 | Feature specification |
| M2 | + | + | + | + | + | + | + | m2.par | ||||||||
| M3 | + | + | + | + | + | + | + | + | + | m3.par | ||||||
| M4 | + | + | + | + | + | + | + | + | + | + | + | m4.par |