Stochastic Dependency Grammars for Natural Language Parsing

The purpose of this project is to study stochastic dependency grammars from three different perspectives: The project is funded by the Swedish Research Council (Vetenskapsrådet 621-2002-4207).

Participants

Resources and Tools

Publications

Degree Projects

  1. Deterministic dependency parsing of unrestricted English text: The project consists in evaluating a family of deterministic parsing algorithms by parsing the Wall Street Journal section of the Penn Treebank. The parsing algorithms have previously been evaluated in parsing Swedish text with good results (Nivre and Nilsson 2003, Nivre 2003). The goal of this project is to see whether the results carry over to another language. The work can be divided into three steps. The first step is to convert the Penn Treebank data from phrase structures to dependency structures. The second step is to develop an English grammar based on the converted treebank and the existing Swedish grammar. The third step is to apply the parsing algorithms to the converted treebank using the English grammar and evaluate the results in relation to previously published results for English dependency parsing. [Mario Scholz]
  2. Tokenization and sentence splitting for dependency parsing: Tokenization and sentence splitting, i.e. the segmentation of natural language text into tokens (words and punctuation marks) and into sentences is a task of underestimated complexity, which has important implications for the overall quality of a parsing system. The goal of this project is to investigate different techniques for tokenization and sentence splitting and to integrate them into the MALT parser developed at Växjö University. Ideally, this should lead to a generic model of tokenization and sentence splitting, with a clear specification of dependencies to other components in the system. [Staffan Hermansson]
  3. Assignment of grammatical relations in dependency parsing: The MALT parser developed at Växjö University constructs a dependency graph for each input string and assigns grammatical functions as labels to each dependency arc. However, because the guided parsing algorithm is limited to strictly local information, many of these labels will be incorrect. In particular, there will be global inconsistencies, such as one clause having more than one subject. The goal of this project is to investigate how a combination of linguistically motivated rules and inductive machine learning can be used to improve the accuracy with which grammatical functions are assigned in the dependency tree. This will (at least initially) be applied in the form of a post-processing phase, that applies to the output of the existing parser. [Stefan Jonasson]
  4. Named entity recognition for dependency parsing: Multi-word names, such as "Växjö universitet" or "Länstrafiken i Jämtlands län AB", are problematic in syntactic parsing, since they do not obey ordinary rules for syntactic structure. The goal of this project is to investigate how the recognition of these expressions, often called "named entities", can be used to improve the accuracy of the MALT dependency parser developed at Växjö University. The class of expressions considered can be widened to include also other so-called multi-word units, e.g. date expressions ("30 januari 2004") and complex prepositions ("på grund av"). [Open]
  5. Deterministic dependency parsing using support vector machines: The MALT dependency parser developed at Växjö University is a deterministic parser that relies on data-driven classifiers to predict the next parsing action. Currently, the best performing parser uses memory-based learning to induce these classifiers from a treebank. The goal of this project is to investigate the use of support vector machines in the learning phase in order to achieve higher parsing accuracy, greater parsing speed or both. [Open]
  6. Semi-deterministic dependency parsing: The MALT dependency parser developed at Växjö University is strictly deterministic, which is an advantage from the point of view of efficiency but which can be a drawback from the point of view of accuracy in that some structural decisions need to be postponed until more information is available. The goal of this project is to investigate different ways of introducing a mild form of nondeterminism into the parsing process in order to improve accuracy while maintaining efficiency. [Open]
  7. Dependency-based enhancement of translation memory: Traditional translation memory systems do not include any information about linguistic structure. Hence, they measure similarity between sentences only based on string similarity. The goal of this project is to see whether the precision with which sentences can be matched (especially so-called fuzzy matches) can be improved by incorporating information about dependency structure. [Open]
  8. Dependency-based vector space models of linguistic meaning: Traditional vector space models of linguistic meaning, as used for example in information retrieval, are based solely on statistics of cooccurrence between words and documents or words and words. The goal of this project is to investigate whether the quality of such models can be improved using information about dependency structure. [Open]