Stochastic Dependency Grammars for
Natural Language Parsing
The purpose of this project is to study stochastic dependency grammars
from three different perspectives:
- The formal perspective: What properties do
stochastic dependency grammars have with respect to expressive power
and complexity of parsing algorithms?
- The machine learning perspective: How can
stochastic dependency grammars be induced from linguistic corpus data
using supervised and/or unsupervised machine learning algorithms?
- The language technology perspective: How can stochastic dependency
grammars be used in practical parsing systems for natural language,
in particular in parsing unrestricted Swedish text.
The project is funded by the Swedish Research Council (Vetenskapsrådet
621-2002-4207).
Participants
Resources and Tools
- Malt-XML and Malt-TAB (Representation formats for dependency treebanks)
- MaltConverter (Conversion between different dependency treebank formats)
- MaltEval (Evaluation tool for taggers, parsers and treebanks using Malt-XML)
- MaltParser (Data-driven dependency parser)
- Proj (Pre- and post-processing tools for pseudo-projective parsing with MaltParser)
- Talbanken (A Swedish treebank from Antiquity)
Publications
- Nivre, J. (2002) Two
Models of Stochastic Dependency Grammar.
MSI Report 02118. Växjö University: School of Mathematics and Systems
Engineering.
- Nivre, J. and Nilsson, J. (2003)
Three
Algorithms for Deterministic Dependency
Parsing. To be presented at NODALIDA-2003.
- Nivre, J. (2003)
Optimizing
a Deterministic Dependency Parser
for Unrestricted Swedish Text. In Proceedings of Promote IT,
Gotland University, 3-5 May 2003.
- Nivre, J. (2003)
An Efficient
Algorithm for Projective Dependency Parsing. In Proceedings
of the 8th International Workshop on Parsing Technologies (IWPT 03),
Nancy, France, 23-25 April 2003, pp. 149-160.
- Nivre, J. (2004)
Inductive
Dependency Parsing. MSI Report 04070. Växjö University:
School of Mathematics and Systems
Engineering.
- Nivre, J., Hall, J. and Nilsson, J. (2004)
Memory-Based
Dependency Parsing.
In Ng, H. T. and Riloff, E. (eds.)
Proceedings of the Eighth Conference on Computational Natural
Language Learning (CoNLL), May 6-7, 2004, Boston, Massachusetts,
pp. 49-56.
- Nivre, J. and Nilsson, J. (2004)
Multiword
Units in Syntactic Parsing. In Dias, G., Lopes, J. G. P. and
Vintar, S. (eds.) MEMURA 2004 -
Methodologies and Evaluation of Multiword Units in Real-World
Applications, Workshop at LREC 2004,
May 25, 2004, Lisbon, Portugal, pp. 39-46.
- Nivre, J. (2004)
Bootstrapping
Lexical Models in Deterministic Dependency Parsing.
MSI rapport 04071. Växjö university:
School of Mathematics and Systems
Engineering.
- Nivre, J. (2004)
Incrementality
in Deterministic Dependency Parsing. In Incremental Parsing:
Bringing Engineering and Cognition Together. Workshop at ACL-2004,
Barcelona, Spain, July 25, 2004.
- Nivre, J. and Scholz, M. (2004)
Deterministic
Dependency Parsing of English Text. In Proceedings of COLING 2004,
Geneva, Switzerland, August 23-27, 2004.
- Nivre, J. (2005) Bootstrapping Lexical Models in Deterministic
Dependency Parsing. In Proceedings of Promote IT 2005.
Studentlitteratur, pp. 327-336.
- Nivre, J. and Nilsson, J. (2005) Pseudo-Projective Dependency Parsing.
In Proceedings of the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 99-106.
Degree Projects
- Deterministic dependency parsing of unrestricted English text:
The project consists in evaluating a family of deterministic parsing
algorithms by parsing the Wall Street Journal section of the Penn Treebank.
The parsing algorithms have previously been evaluated in parsing Swedish
text with good results (Nivre and Nilsson 2003, Nivre 2003). The goal of
this project is to see whether the results carry over to another language.
The work can be divided into three steps. The first step is to convert
the Penn Treebank data from phrase structures to dependency structures.
The second step is to develop an English grammar based on the converted
treebank and the existing Swedish grammar. The third step is to apply the
parsing algorithms to the converted treebank using the English grammar
and evaluate the results in relation to previously published results
for English dependency parsing. [Mario Scholz]
- Tokenization and sentence splitting for dependency parsing:
Tokenization and sentence splitting, i.e. the segmentation of natural
language text into tokens (words and punctuation marks) and into
sentences is a task of underestimated complexity, which has important
implications for the overall quality of a parsing system. The goal of
this project is to investigate different techniques for tokenization
and sentence splitting and to integrate them into the MALT parser
developed at Växjö University. Ideally, this should lead to a generic
model of tokenization and sentence splitting, with a clear specification
of dependencies to other components in the system.
[Staffan Hermansson]
- Assignment of grammatical relations in dependency parsing:
The MALT parser developed at Växjö University constructs a dependency
graph for each input string and assigns grammatical functions as labels
to each dependency arc. However, because the guided parsing algorithm
is limited to strictly local information, many of these labels will be
incorrect. In particular, there will be global inconsistencies, such as
one clause having more than one subject. The goal of this project is
to investigate how a combination of linguistically motivated rules and
inductive machine learning can be used to improve the accuracy with
which grammatical functions are assigned in the dependency tree.
This will (at least initially) be applied in the form of a
post-processing phase, that applies to the output of the existing
parser. [Stefan Jonasson]
- Named entity recognition for dependency parsing:
Multi-word names, such as "Växjö universitet" or "Länstrafiken i
Jämtlands län AB", are problematic in syntactic parsing, since they
do not obey ordinary rules for syntactic structure. The goal of this
project is to investigate how the recognition of these expressions,
often called "named entities", can be used to improve the accuracy
of the MALT dependency parser developed at Växjö University. The
class of expressions considered can be widened to include also other
so-called multi-word units, e.g. date expressions ("30 januari 2004")
and complex prepositions ("på grund av").
[Open]
- Deterministic dependency parsing using support vector machines:
The MALT dependency parser developed at Växjö University is a deterministic
parser that relies on data-driven classifiers to predict the next parsing
action. Currently, the best performing parser uses memory-based learning
to induce these classifiers from a treebank. The goal of this project is
to investigate the use of support vector machines in the learning phase
in order to achieve higher parsing accuracy, greater parsing speed
or both. [Open]
- Semi-deterministic dependency parsing:
The MALT dependency parser developed at Växjö University is strictly
deterministic, which is an advantage from the point of view of
efficiency but which can be a drawback from the point of view
of accuracy in that some structural decisions need to be postponed
until more information is available. The goal of this project is to
investigate different ways of introducing a mild form of nondeterminism
into the parsing process in order to improve accuracy while maintaining
efficiency. [Open]
- Dependency-based enhancement of translation memory:
Traditional translation memory systems do not include any information
about linguistic structure. Hence, they measure similarity between
sentences only based on string similarity. The goal of this project
is to see whether the precision with which sentences can be matched
(especially so-called fuzzy matches) can be improved by incorporating
information about dependency structure. [Open]
- Dependency-based vector space models of linguistic meaning:
Traditional vector space models of linguistic meaning, as used for example
in information retrieval, are based solely on statistics of cooccurrence
between words and documents or words and words. The goal of this project
is to investigate whether the quality of such models can be improved
using information about dependency structure. [Open]