Introduction to Statistical Natural Language Processing

Before diving into the sea of probability theory and statistics, it may be good to have a basic idea of where we are going and what we want to achieve. In this introductory lecture, I will therefore try to give preliminary answers to the following two questions:

  1. What is statistics?
  2. How can it be used in natural language processing?

Statistics is a vast field, but from the point of view of natural language processing, there are three components of statistics that are important:

  1. Probability theory: Mathematical theory of uncertainty (random experiments).
  2. Descriptive statistics: Methods for summarizing (large) datasets.
  3. Inferential statistics: Methods for drawing inferences from (large) datasets.

The use of statistics in natural language processing falls mainly in three categories:

  1. Processing: We may use probabilistic models or algorithms to process natural language input or output.
  2. Learning: We may use inferential statistics to learn from examples (corpus data). In particular, we may estimate the parameters of probabilistic models that can be used in processing.
  3. Evaluation: We may use statistics to assess the performance of language processing systems.

We can exemplify this with respect to part-of-speech tagging:

  1. Processing: A probabilistic part-of-speech tagger computes the most probable part-of-speech sequence for a given word sequence, using a probabilistic model M.
  2. Learning: The parameters of the model M used by the tagger can be estimated from corpus data using a variety of methods.
  3. Evaluation: The performance of the tagger can be evaluated by running it on a test data set and computing various statistical measures.

The rest of the course is organized as follows:

  1. Lectures 2-4 introduce the necessary concepts from probability theory and statistics.
  2. Lectures 5-9 deal with different areas of natural language processing, focusing on statistical methods in processing and learning.
  3. Lecture 10 is devoted to the use of statistics in evaluation.

Slides for lecture 1

Suggested Reading