Statistical Inference
Statistical inference is the science of making predictions or inferences
from finite sets of observations (samples) to potentially infinite sets
of new observations (called populations or models). We begin by defining
the fundamental notions of random sample and statistical material
for a stochastic variable X:
- A vector (X1, ..., Xn) of independent variables Xi
with the same distribution as X is said to be a random sample of X.
- A vector of values (x1, ..., xn) such that
Xi = xi in a particular experiment is called
a statistical material.
Given a random sample of a variable X, we can define new stochastic variables
that are functions of the sample, called sample variables:
- The sample mean: mn =
1/n ∑i Xi
- The sample variance: sn2 =
1/(n-1) ∑i (Xi - mn)2
There are essentially two kinds of statistical inference:
- Estimation: Use samples and sample variables to predict
population variables.
- Hypothesis testing: Use samples and sample variables to
test hypotheses about populations and population variables.
Estimation can in turn be divided into two main types:
- Point estimation: Use sample variable f(X1, ..., Xn)
to estimate parameter φ.
- Interval estimation: Use sample variables
f1(X1, ..., Xn) and
f2(X1, ..., Xn) to construct
an interval such that
P(f1(X1, ..., Xn) < φ
< f2(X1, ..., Xn)) = p, where
p is the confidence level adopted.
The most common method used for (point) estimation is maximum likelihood
estimation (MLE), which consists in choosing the estimate that maximizes
the probability of the statistical material. Formally:
- Given a statistical material (x1, ..., xn) amd
a set of parameters θ, the likelihood function L is:
L(x1, ..., xn, θ) =
∏i Pθ(xi)
where Pθ(xi) is the probability that the
variable Xi assumes the value xi given a set
of values for the parameters in θ.
- Maximum likelihood estimation means choosing θ so that
the likelihood function is maximized:
maxθ L(x1, ..., xn, θ)
Special cases of MLE:
- The sample mean of X is a MLE of E[X].
- The relative frequency of x is a
MLE of P(X = x).
MLE is a good solution to the estimation problem if the statistical
material is large enough. In practice, MLE is often suboptimal because
of sparse data. Practical solutions to the estimation problem often
use MLE as a starting point, applying more or less sophisticated
smoothing methods in order to improve the quality of the estimate.
The basic reasoning underlying most statistical hypothesis tests can
be summarized as follows:
- Choose a test statistic t whose distribution is known when the
null hypothesis is true.
- Use t to calculate the probability p of observing the data given
that the null hypothesis is true.
- If p < α, reject the null hypothesis, where α is the
significance level adopted.
Essentially the same reasoning can be used for interval estimation,
where the idea is to make the width of the interval such that the true
value of the parameter is inside the interval with probability p,
where p is the adopted confidence level (typically 0.95 or 0.99).
Slides for lecture 4
Suggested Reading
- Krenn, B. & Samuelsson, C. (1997) The Linguist's
Guide to Statistics. Section 1.5-1.7.
(Concentrate on discrete variables.)
Exercises
- Let Y be the sum of two dice and let (Y1, ..., Y10)
be a random sample of Y. Consider the following statistical material:
(Y1, ..., Y10) = (3, 5, 10, 6, 7, 4, 7, 11, 5, 2).
- What is the sample mean of Y.
- What is the sample variance of Y.
- What is the maximum likelihood estimation of E[Y]?
- What is the maximum likelihood estimation of P(Y = 7) and P(Y = 12)?
Solution
- Let X and Y be stochastic variables representing the word form and the number
of characters of an arbitrary English word, and consider the mini-corpus "to be or
not to be that is the question".
- Regarding the corpus as a random sample of X,
what is the maximum likelihood estimate of P(X = to)?
- What is the corresponding statistical material for the variable Y?
- What is the maximum likelihood estimate of E[Y]?
Solution