Stochastic Variables
A stochastic variable is a function X from a
sample space Ω to a value space ΩX.
Depending on properties of the value space, different kinds
of stochastic variables can be distinguished:
- If ΩX is a subset
of the set of real numbers, then the variable is numeric;
otherwise it is categorical.
- If ΩX is finite or countably infinite,
then X is discrete.
The frequency function (or probability function) fX
gives the probability of each possible value x of X:
For discrete variables, this can be defined as the
sum of the probabilities of all elementary outcomes
(in the underlying sample space) that are
mapped to x by X:
- fX(x) = P({u ∈ Ω | X(u) = x}) =
∑u : X(u) = x P(u)
For numerical variables, there is also a (cumulative)
distribution function FX:
For discrete numerical variables, we have:
Two important parameters of a numerical variable X are the
expectation E[X], which is the (weighted) average value,
and the variance Var[X], which is the expected (squared)
deviation from the expectation. For discrete variables, they
can be defined as follows:
- E[X] = ∑x ∈ ΩX
x ⋅ fX(x)
- Var[X] = ∑x ∈ ΩX
(x - E[X])2 ⋅ fX(x)
Another useful quantity is the entropy H[X], which can be
interpreted as the expected amount of information (measured in
bits) when learning the value of X. The information value of a
particular value is denoted I[x]. Definitions:
- I[x] = - log2 fX(x)
- H[X] = ∑x ∈ ΩX I[x] =
- ∑x ∈ ΩX fX(x)
⋅ log2 fX(x)
Let X and Y be stochastic variables with value spaces
ΩX and ΩY, respectively.
- The joint probability of X and Y is given by their joint
probability function f(X, Y):
f(X, Y)(x, y) = P(X = x, Y = y) = P({ (u, v) ∈
ΩX × ΩY |
X(u) = x, Y(v) = y })
- The conditional probability of X given Y is given by the
conditional probability function fX|Y:
fX|Y(x | y) = P(X = x | Y = y) =
P(X = x, Y = y) / P(Y = y)
There are also corresponding notions of joint and conditional
entropy:
- Joint entropy: H[X, Y] =
- ∑x ∈ ΩX
∑y ∈ ΩY
f(X, Y)(x, y)
⋅ log2 f(X, Y)(x, y)
- Conditional entropy: H[X|Y] =
- ∑x ∈ ΩX
∑y ∈ ΩY
f(X, Y)(x, y)
⋅ log2 fX|Y(x | y)
The notions of joint and conditional probability can be generalized
to arbitrary vectors of variables:
- Joint probability:
P(X1 = x1, ..., Xn = xn) =
P({ (u1, ..., un) ∈
ΩX1 × ... ×
ΩXn |
X1(u1) = x1, ...,
Xn(un) = xn })
- Conditional probability:
P(X1 = x1, ..., Xn = xn |
Y1 = y1, ..., Ym = ym) =
P(X1 = x1, ..., Xn = xn, ...,
Y1 = y1, ..., Ym = ym) /
P(Y1 = y1, ..., Yn = ym)
Finally, two variables X and Y are independent if and only if
P(X = x, Y = y) = P(X = x) P(Y = y), for all x and y. If X and Y are
independent, then the following conditions also hold:
- P(X = x|Y = y) = P(X = x) for all x, y
- P(Y = y|X = x) = P(Y = y) for all x, y
- H[X|Y] = H[X]
- H[Y|X] = H[Y]
Slides for lecture 3
Suggested Reading
- Krenn, B. & Samuelsson, C. (1997) The Linguist's
Guide to Statistics. Section 1.3-1.4, 2.2.
(Concentrate on discrete variables.)
- Manning, C. D. & Schütze, H. (1999) Foundations of Statistical
Natural Language Processing. MIT Press. Chapter 2.
Exercises
NB: Solutions to exercises on entropy are based on base 2
logarithms (log2) throughout,
not natural logarithms (ln).
- Consider the following stochastic variables and define,
for each of them, a suitable domain (sample space) and range
(value space):
- The number of syllables per word.
- The number of words per sentence.
- The lexeme of a word form.
- The word class (part-of-speech) of a word.
- The percentage of nouns in a text.
Solution
- Which of the variables in Exercise 1 are numerical?
Which of them are discrete?
Solution
- Consider the experiment of throwing two dice and consider the
stochastic variable X which maps the outcome to the sum of the
two dice. We assume that the underlying experiment
has a uniform probability distribution (i.e. that all outcomes
are equally probable).
- What is the domain (sample space) of X?
- What is the range (value space) of X?
- Give the frequency function fX.
- Give the distribution function FX.
- Compute the expectation value E[X].
- Compute the variance Var[X].
Solution
- Consider the experiment of randomly choosing a pair of two adjacent
words from a text. Let X1 be the stochastic
variable which maps the first word to its word class (part-of-speech),
and let X2 be the stochastic
variable which maps the second word to its word class.
Suppose we know the following probabilities:
- P(X2 = noun) = 0.2
- P(X2 = adjective) = 0.05
- P(X1 = article | X2 = noun) = 0.3
- P(X1 = article | X2 = adj) = 0.6
- P(X1 = article | X2 is neither noun nor
adjective) = 0
Compute the probability that
- the first word is an article,
- the second word is a noun, given that the first is an article,
- the second word is an adjective, given that the first is an article,
- the second word is a noun or an adjective, given that the first is an article.
Solution
- Show that the two variables in Exercise 4 are not independent.
Solution
- Consider the experiment of randomly choosing a word from an English text,
and consider the following stochastic variables:
- X(w) = w (i.e. X maps a word to its orthographic form).
- Y(w) = 1 if w="the", 0 otherwise.
Assume that fX("the") = 0.02 (where
fX is the frequency function of
the variable X). Compute
- the frequency function fY of
the variable Y,
- the expectation value of Y,
- the variance of Y.
Solution
- Let X be the stochastic variable which gives us
the sum of two dice. What is the surprise value (information
value) associated with the following events?
- X = 2
- X = 7
- X > 10
Solution
- Compute the entropy of variable X in Exercise 7.
Solution
- Suppose the variable X has value space
{x1, ..., xn} and let
{p1, ..., pn} be the
corresponding probabilities. The entropy H[X]
is maximized when pi=1/n (for all i).
What is H[X] in this case?
Solution
- Consider the experiment of throwing two dice. Let X be
the stochastic variable which gives 1 if the sum of the two dice is 6,
and 0 otherwise. And let Y be the value of the first die.
Compute
- H[Y]
- H[X|Y]
- H[X,Y]
Solution
- Consider the experiment of randomly choosing a pair of two adjacent
letters from a text. Let X1 be the stochastic
variable which tells us whether the first letter is a vowel or a consonant,
and let X2 be the stochastic
variable which gives the same information for the second letters.
Suppose we know the following probabilities:
- P(X1=vowel) = P(X2=vowel) = 0.4
- P(X2=vowel|X1=vowel) = 0.01
Compute
- H[X1]
- H[X2]
- H[X1,X2]
- H[X1|X2]
- H[X2|X1]
Solution
- The variables in Exercise 11 are obviously not independent. Show this
in at least three different ways.
Solution