Stochastic Variables

A stochastic variable is a function X from a sample space Ω to a value space ΩX. Depending on properties of the value space, different kinds of stochastic variables can be distinguished:
  1. If ΩX is a subset of the set of real numbers, then the variable is numeric; otherwise it is categorical.
  2. If ΩX is finite or countably infinite, then X is discrete.
The frequency function (or probability function) fX gives the probability of each possible value x of X: For discrete variables, this can be defined as the sum of the probabilities of all elementary outcomes (in the underlying sample space) that are mapped to x by X: For numerical variables, there is also a (cumulative) distribution function FX: For discrete numerical variables, we have: Two important parameters of a numerical variable X are the expectation E[X], which is the (weighted) average value, and the variance Var[X], which is the expected (squared) deviation from the expectation. For discrete variables, they can be defined as follows:
  1. E[X] = ∑x ∈ ΩX x ⋅ fX(x)
  2. Var[X] = ∑x ∈ ΩX (x - E[X])2 ⋅ fX(x)
Another useful quantity is the entropy H[X], which can be interpreted as the expected amount of information (measured in bits) when learning the value of X. The information value of a particular value is denoted I[x]. Definitions:
  1. I[x] = - log2 fX(x)
  2. H[X] = ∑x ∈ ΩX I[x] = - ∑x ∈ ΩX fX(x) ⋅ log2 fX(x)
Let X and Y be stochastic variables with value spaces ΩX and ΩY, respectively.
  1. The joint probability of X and Y is given by their joint probability function f(X, Y):
    f(X, Y)(x, y) = P(X = x, Y = y) = P({ (u, v) ∈ ΩX × ΩY | X(u) = x, Y(v) = y })
  2. The conditional probability of X given Y is given by the conditional probability function fX|Y:
    fX|Y(x | y) = P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y)
There are also corresponding notions of joint and conditional entropy:
  1. Joint entropy: H[X, Y] = - ∑x ∈ ΩXy ∈ ΩY f(X, Y)(x, y) ⋅ log2 f(X, Y)(x, y)
  2. Conditional entropy: H[X|Y] = - ∑x ∈ ΩXy ∈ ΩY f(X, Y)(x, y) ⋅ log2 fX|Y(x | y)
The notions of joint and conditional probability can be generalized to arbitrary vectors of variables:
  1. Joint probability:
    P(X1 = x1, ..., Xn = xn) = P({ (u1, ..., un) ∈ ΩX1 × ... × ΩXn | X1(u1) = x1, ..., Xn(un) = xn })
  2. Conditional probability:
    P(X1 = x1, ..., Xn = xn | Y1 = y1, ..., Ym = ym) = P(X1 = x1, ..., Xn = xn, ..., Y1 = y1, ..., Ym = ym) / P(Y1 = y1, ..., Yn = ym)
Finally, two variables X and Y are independent if and only if P(X = x, Y = y) = P(X = x) P(Y = y), for all x and y. If X and Y are independent, then the following conditions also hold:
  1. P(X = x|Y = y) = P(X = x) for all x, y
  2. P(Y = y|X = x) = P(Y = y) for all x, y
  3. H[X|Y] = H[X]
  4. H[Y|X] = H[Y]

Slides for lecture 3

Suggested Reading

Exercises

NB: Solutions to exercises on entropy are based on base 2 logarithms (log2) throughout, not natural logarithms (ln).
  1. Consider the following stochastic variables and define, for each of them, a suitable domain (sample space) and range (value space):
    1. The number of syllables per word.
    2. The number of words per sentence.
    3. The lexeme of a word form.
    4. The word class (part-of-speech) of a word.
    5. The percentage of nouns in a text.
    Solution

  2. Which of the variables in Exercise 1 are numerical?
    Which of them are discrete?
    Solution

  3. Consider the experiment of throwing two dice and consider the stochastic variable X which maps the outcome to the sum of the two dice. We assume that the underlying experiment has a uniform probability distribution (i.e. that all outcomes are equally probable).
    1. What is the domain (sample space) of X?
    2. What is the range (value space) of X?
    3. Give the frequency function fX.
    4. Give the distribution function FX.
    5. Compute the expectation value E[X].
    6. Compute the variance Var[X].
    Solution

  4. Consider the experiment of randomly choosing a pair of two adjacent words from a text. Let X1 be the stochastic variable which maps the first word to its word class (part-of-speech), and let X2 be the stochastic variable which maps the second word to its word class. Suppose we know the following probabilities:

    P(X2 = noun) = 0.2
    P(X2 = adjective) = 0.05
    P(X1 = article | X2 = noun) = 0.3
    P(X1 = article | X2 = adj) = 0.6
    P(X1 = article | X2 is neither noun nor adjective) = 0

    Compute the probability that
    1. the first word is an article,
    2. the second word is a noun, given that the first is an article,
    3. the second word is an adjective, given that the first is an article,
    4. the second word is a noun or an adjective, given that the first is an article.
    Solution

  5. Show that the two variables in Exercise 4 are not independent.
    Solution

  6. Consider the experiment of randomly choosing a word from an English text, and consider the following stochastic variables:

    X(w) = w (i.e. X maps a word to its orthographic form).
    Y(w) = 1 if w="the", 0 otherwise.

    Assume that fX("the") = 0.02 (where fX is the frequency function of the variable X). Compute
    1. the frequency function fY of the variable Y,
    2. the expectation value of Y,
    3. the variance of Y.
    Solution

  7. Let X be the stochastic variable which gives us the sum of two dice. What is the surprise value (information value) associated with the following events?
    1. X = 2
    2. X = 7
    3. X > 10
    Solution

  8. Compute the entropy of variable X in Exercise 7.
    Solution

  9. Suppose the variable X has value space {x1, ..., xn} and let {p1, ..., pn} be the corresponding probabilities. The entropy H[X] is maximized when pi=1/n (for all i). What is H[X] in this case?
    Solution

  10. Consider the experiment of throwing two dice. Let X be the stochastic variable which gives 1 if the sum of the two dice is 6, and 0 otherwise. And let Y be the value of the first die. Compute
    1. H[Y]
    2. H[X|Y]
    3. H[X,Y]
    Solution

  11. Consider the experiment of randomly choosing a pair of two adjacent letters from a text. Let X1 be the stochastic variable which tells us whether the first letter is a vowel or a consonant, and let X2 be the stochastic variable which gives the same information for the second letters. Suppose we know the following probabilities:

    P(X1=vowel) = P(X2=vowel) = 0.4
    P(X2=vowel|X1=vowel) = 0.01

    Compute
    1. H[X1]
    2. H[X2]
    3. H[X1,X2]
    4. H[X1|X2]
    5. H[X2|X1]
    Solution

  12. The variables in Exercise 11 are obviously not independent. Show this in at least three different ways.
    Solution