Statistical Inference

Statistical inference is the science of making predictions or inferences from finite sets of observations (samples) to potentially infinite sets of new observations (called populations or models). We begin by defining the fundamental notions of random sample and statistical material for a stochastic variable X:
  1. A vector (X1, ..., Xn) of independent variables Xi with the same distribution as X is said to be a random sample of X.
  2. A vector of values (x1, ..., xn) such that Xi = xi in a particular experiment is called a statistical material.
Given a random sample of a variable X, we can define new stochastic variables that are functions of the sample, called sample variables:
  1. The sample mean: mn = 1/n ∑i Xi
  2. The sample variance: sn2 = 1/(n-1) ∑i (Xi - mn)2
There are essentially two kinds of statistical inference:
  1. Estimation: Use samples and sample variables to predict population variables.
  2. Hypothesis testing: Use samples and sample variables to test hypotheses about populations and population variables.
Estimation can in turn be divided into two main types:
  1. Point estimation: Use sample variable f(X1, ..., Xn) to estimate parameter φ.
  2. Interval estimation: Use sample variables f1(X1, ..., Xn) and f2(X1, ..., Xn) to construct an interval such that
    P(f1(X1, ..., Xn) < φ < f2(X1, ..., Xn)) = p, where p is the confidence level adopted.
The most common method used for (point) estimation is maximum likelihood estimation (MLE), which consists in choosing the estimate that maximizes the probability of the statistical material. Formally:
  1. Given a statistical material (x1, ..., xn) amd a set of parameters θ, the likelihood function L is:
    L(x1, ..., xn, θ) = ∏i Pθ(xi)
    where Pθ(xi) is the probability that the variable Xi assumes the value xi given a set of values for the parameters in θ.
  2. Maximum likelihood estimation means choosing θ so that the likelihood function is maximized:
    maxθ L(x1, ..., xn, θ)
Special cases of MLE:
  1. The sample mean of X is a MLE of E[X].
  2. The relative frequency of x is a MLE of P(X = x).
MLE is a good solution to the estimation problem if the statistical material is large enough. In practice, MLE is often suboptimal because of sparse data. Practical solutions to the estimation problem often use MLE as a starting point, applying more or less sophisticated smoothing methods in order to improve the quality of the estimate.

The basic reasoning underlying most statistical hypothesis tests can be summarized as follows:

  1. Choose a test statistic t whose distribution is known when the null hypothesis is true.
  2. Use t to calculate the probability p of observing the data given that the null hypothesis is true.
  3. If p < α, reject the null hypothesis, where α is the significance level adopted.
Essentially the same reasoning can be used for interval estimation, where the idea is to make the width of the interval such that the true value of the parameter is inside the interval with probability p, where p is the adopted confidence level (typically 0.95 or 0.99).

Slides for lecture 4

Suggested Reading

Exercises

  1. Let Y be the sum of two dice and let (Y1, ..., Y10) be a random sample of Y. Consider the following statistical material:
    (Y1, ..., Y10) = (3, 5, 10, 6, 7, 4, 7, 11, 5, 2).
    1. What is the sample mean of Y.
    2. What is the sample variance of Y.
    3. What is the maximum likelihood estimation of E[Y]?
    4. What is the maximum likelihood estimation of P(Y = 7) and P(Y = 12)?
    Solution

  2. Let X and Y be stochastic variables representing the word form and the number of characters of an arbitrary English word, and consider the mini-corpus "to be or not to be that is the question".
    1. Regarding the corpus as a random sample of X, what is the maximum likelihood estimate of P(X = to)?
    2. What is the corresponding statistical material for the variable Y?
    3. What is the maximum likelihood estimate of E[Y]?
    Solution