Contents
1. Random variables
A random variable on a probability space (Ω,F,P) is a function whose domain is Ω. Typically the codomain will be the reals or the integers, although any set is possible. Random variables are generally written as capital letters with their arguments suppressed: rather than writing X(ω), where ω∈Ω, we write just X.
A technical condition on random variables is that the inverse image of any measurable subset of the codomain must be in F—in simple terms, if you can't nail down ω exactly, being able to tell which element of F you land in should be enough to determine the value of X(ω). For a simple random variable taking on finitely many values, this just means that X-1(x)∈F for each possible value x. For real-valued random variables the requirement is that the event [X ≤ x] is in F for any fixed x. In each case we say that X is measurable with respect to F (or just "measurable F").1 Usually we will not worry about this issue too much, but it may come up if we are varying F to represent different amounts of information available to different observers (e.g., if X and Y are the values of two dice, X is measurable to somebody who can see both dice but not to somebody who can only see the sum of the dice).
Examples of random variables:
Indicator variables: The indicator variable for an event A is a variable X that is 1 if A occurs and 0 if it doesn't (i.e., X(ω) = 1 if ω∈A and 0 otherwise). Indicator variables are often written using the Greek letter chi (e.g. χA) or using bracket notation (e.g., [A], [Y2 > 3], [all six coins come up heads]).
- Functions of random variables: Any function you are likely to run across of a random variable or random variables is a random variable, e.g. X+Y, XY, log X, etc.
- Counts: Flip a fair coin n times and let X be the number of times it comes up heads. Then X is an integer-valued random variable.
Random sets and structures: Suppose that we have a set T of n elements, and we pick out a subset U by flipping an independent fair coin for each element to decide whether to include it. Then U is a set-valued random variable. Or we could consider the infinite sequence X0, X1, X2, ..., where X0 = 0 and Xn+1 is either Xn+1 or Xn-1, which being chosen by an independent fair coin flip. Then we can think of the entire sequence X as a sequence-valued random variable.
2. The distribution of a random variable
The distribution function of a real-valued random variable describes the probability that it takes on each of its possible values; it is specified by giving a function F(x) = Pr[X ≤ x]. The reason for using Pr[X ≤ x] instead of Pr[X = x] is that it allows specifying continuous random variables such as a random variable that is uniform in the range [0,1]; this random variable has a distribution function given by F(x) = x when 0 ≤ x ≤ 1, F(x) = 0 for x < 0, and F(x) = 1 for x > 1. For discrete random variables the distribution function will have discontinuous jumps at each possible value of the variable; for example, the distribution function of a variable X that is 0 or 1 with equal probability is F(x) = 0 for x < 0, 1/2 for 0 ≤ x < 1, and 1 for x ≥ 1.
Knowing the distribution of a random variable tells you what that variable might do by itself, but doesn't tell you how it interacts with other random variables: for example, if X is 0 or 1 with equal probability then X and 1-X both have the same distribution, but they are connected in a way that is not true for X and some independent variable Y with the same distribution. For multiple variables, a joint distribution gives the probability that each variable takes on a particular value; for example, if X and Y are two independent uniform samples from the range [0,1], their distribution function F(x,y) = Pr[X ≤ x ∧ Y ≤ y] = xy (when 0 ≤ x,y ≤ 1). If instead Y = 1-X, we get the distribution function F(x,y) = Pr[X ≤ x ∧ Y ≤ y] equal to x when y ≥ 1-x and 0 when y < 1-x (assuming 0 ≤ x,y ≤ 1).
For discrete random variables, it is often more useful to look at the function f(x) = Pr[X=x]. We will often treat such a function as specifying the distribution of X even if it is not technically a distribution function.
2.1. Some standard distributions
Here are some common distributions for a random variable X:
- Bernoulli distribution
- Pr[X = 1] = p, Pr[X = 0] = q, where p is a parameter of the distribution and q = 1-p. This corresponds to a single biased coin-flip.
- Binomial distribution
Pr[X = k] = (n choose k) pk q(n-k), where n and p are parameters of the distribution and q = 1-p. This corresponds to the sum of n biased coin-flips.
- Geometric distribution
Pr[X = k] = qkp, where p is a parameter of the distribution and q is again equal to 1-p. This corresponds to number of tails we flip before we get the first head in a sequence of biased coin-flips.
- Poisson distribution
Pr[X = k] = e-λλk/k!. This is what happens to a binomial distribution when we fix p = λ/n and then take the limit as n goes to infinity; we can think of it as counting the number of events that occur in one time unit if the events occur at a constant continuous rate that averages λ events per time unit. The canonical example is radioactive decay.
- Uniform distribution
- The distribution function F of X is given by F(x) = 0 when x ≤ a, (x-a)/(b-a) when a ≤ x ≤ b, and 1 when b ≤ x, where a and b are parameters of the distribution. This is a continuous random variable that has equal probability of landing anywhere in the [a,b] interval.
- Normal distribution
- See below.
2.2. Densities
If a real-valued random variable is continuous in the sense of having a continuous distribution function, we may be able to describe its distribution by giving a density instead. The density is the derivative of the distribution function. We can also think of it as a probability at each point defined in the limit, by taking smaller and smaller regions around the point and dividing the probability of landing in the region by the size of the region.
For example, the density of a uniform [0,1] random variable is f(x) = 1 for x∈[0,1], f(x) = 0 otherwise. For a uniform [0,2] random variable, we get a density of ½ throughout the [0,2] interval. The density always integrates to 1.
Some distributions are easier to describe using densities than using distribution functions. The normal distribution, which is of central importance in statistics, has density
![\[
\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}.
\] \[
\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}.
\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_406b4e845a1972b0901b9a599ce6977c9df339b8_p1.png)
Its distribution function is the integral of this quantity, which has no closed-form expression.
Joint densities also exist; the joint density of a pair of random variables with joint distribution function F(x,y) is given by the partial derivative f(x,y) = ∂²/∂x∂y F(x,y). The intuition here again is that we are approximating the (zero) probability at a point by taking the probability of a small region around the point and dividing by the area of the region.
3. Independence of random variables
Two random variables X and Y are independent if any pair of events of the form X∈A, Y∈B are independent. For real-valued random variables it is enough to show that their joint distribution F(x,y) is equal to the product of their individual distributions FX(x)FY(y). For real-valued random variables with densities, showing the densities multiply also works.
4. The expectation of a random variable
For a real or integer-valued random variable X, its expectation E[X] (sometimes just EX) is its average value, weighted by probability. If it is discrete, meaning that it takes on only finitely many values, then the expectation is defined by
![\[E[X] = \sum_{x} x \Pr[X=x].\] \[E[X] = \sum_{x} x \Pr[X=x].\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_5c2c724af156b8e9df574c01d744b8ad99d83c3d_p1.png)
For continuous random variables the situation is more complicated and involves integration. If the variable has a density f(x), the formula is
![\[
E[X] = \int x f(x) dx.
\] \[
E[X] = \int x f(x) dx.
\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_5cf3ad7b3838a08192b37b24516dddc2a0119a36_p1.png)
For continuous random variables without densities we land in a rather swampy end of integration theory. We will not talk about this case if we can help it. But in each case the expectation depends only on the distribution of X and not on its relationship to other random variables.
- Example (discrete variable)
- Let X be the number rolled with a fair six-sided die. What is E[X]? Compute E[X] = (1/6) (1+2+3+4+5+6) = 3½.
- Example (continuous variable with density)
Let X be a uniform random variable in the range [a,b]. What is E[X]? Compute E[X] = ∫[a,b] x 1/(b-a) dx = (1/(b-a))(x²/2)|x=a..b = (1/(b-a))(b²/2 - a²/2) = (b+a)/2.
Terminology note: If you hear somebody say that some random variable X takes on the value z on average, this usually means that E[X] = z.
4.1. Expectation of a sum
The expectation operator is linear: this means that E(X+Y) = E(X)+E(Y) and E(aX) = aE(X) when a is a constant. This fact holds for all random variables X and Y, whether they are independent or not, and is not hard to prove for discrete probability spaces:
![\[E[aX+Y] = \sum_{\omega \in \Omega} \Pr[\omega] (aX(\omega) + Y(\omega)] = a\sum_{\omega \in \Omega} \Pr(\omega) X(\omega) + \sum_{\omega \in \Omega} \Pr(\omega) Y(\omega) = aE[X] + E[Y].\] \[E[aX+Y] = \sum_{\omega \in \Omega} \Pr[\omega] (aX(\omega) + Y(\omega)] = a\sum_{\omega \in \Omega} \Pr(\omega) X(\omega) + \sum_{\omega \in \Omega} \Pr(\omega) Y(\omega) = aE[X] + E[Y].\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_1a558ddfafb4d890531ead09dae1fe2028bbc69e_p1.png)
Linearity of expectation makes computing expectations easy. Example: Flip a fair coin n times, and let X be the number of heads. What is E[X]? We can solve this problem by letting Xi be the indicator variable for the event "coin i came up heads". Then X = ∑i Xi and EX = E[∑i X] = ∑i EXi,, = n(½) = n/2. In principle it is possible to calculate the same value from the distribution of X (which involves a lot of binomial coefficients), but linearity of expectation is much easier.
- Example
- Choose a random permutation f, i.e. a random bijection from 1..n → 1..n. What is the expected number of values i for which f(i) = i?
- Solution
Let Xi be the indicator variable for the event that f(i) = i. Then we are looking for E[X1 + X2 + ... Xn] = E[X1] + E[X2] + ... E[Xn]. But E[Xi] is just 1/n for each i, so the sum is n(1/n) = 1. (Calculating this by computing Pr[∑Xi] = x first would be very painful.)
4.2. Expectation of a product
For products of random variables, the situation is more complicated. Here the rule is that E[XY] = E[X]⋅E[Y] if X and Y are independent. But if X and Y are not independent, their expectation can be arbitrary.
- Example
- Roll 2 dice and take their product. What value do we get on average? The product formula gives E[XY] = E[X]E[Y] = (7/2)² = (49/4) = 12¼. We could also calculate this directly by summing over all 36 cases, but it would take a while.
- Example
- Roll 1 die and multiply it by itself. What value do we get on average? Here we are no longer dealing with independent random variables, so we have to do it the hard way: E[X²] = (1²+2²+3²+4²+5²+6²)/6 = 91/6 = 15⅙. This is substantially higher than when the dice are uncorrelated. (Exercise: How can you rig the second die so it still comes up with each value ⅙ of the time but minimizes E[XY]?)
We can prove the product rule without too much trouble for discrete random variables. The easiest way is to start from the right-hand side.
![\begin{eqnarray*}
E[X] \cdot E[Y]
&=&
\left(\sum_{x} x \Pr[X=x]\right)\left(\sum_{y} y \Pr[Y=y]\right)
\\
&=&
\sum_{x,y} xy \Pr[X=x] \Pr[Y=y]
\\
&=&
\sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x] \Pr[Y=y]\right)
\\
&=&
\sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x \wedge Y=y]\right)
\\
&=&
\sum_{z} z \Pr[XY=z]
\\
&=&
E[XY].
\end{eqnarray*} \begin{eqnarray*}
E[X] \cdot E[Y]
&=&
\left(\sum_{x} x \Pr[X=x]\right)\left(\sum_{y} y \Pr[Y=y]\right)
\\
&=&
\sum_{x,y} xy \Pr[X=x] \Pr[Y=y]
\\
&=&
\sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x] \Pr[Y=y]\right)
\\
&=&
\sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x \wedge Y=y]\right)
\\
&=&
\sum_{z} z \Pr[XY=z]
\\
&=&
E[XY].
\end{eqnarray*}](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_a7c3f503b24704f1ac07401deb4bfcc00238cad4_p1.png)
Here we use independence in going from Pr[X=x] Pr[Y=y] to Pr[X=x ∧ Y=y] and use the union rule to convert the x,y sum into Pr[XY=z].
4.3. Conditional expectation
Like conditional probability, there is also a notion of conditional expectation. The simplest version of conditional expectation conditions on a single event A, is written E[X|A], and is defined for discrete random variables by
![\[E[X|A] = \sum_{x} x \Pr[X=x|A].\] \[E[X|A] = \sum_{x} x \Pr[X=x|A].\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_1a326a63ee558cc8497c1d1cafc16da3a9a109b7_p1.png)
This is exactly the same as unconditioned expectation except that the probabilities are now conditioned on A.
- Example
- What is the expected value of a six-sided die conditioned on not rolling a 1? The conditional probability of getting 1 is now 0, and the conditional probability of each of the remaining 5 values is 1/5, so we get (1/5)(2+3+4+5+6) = 4.
Conditional expectation acts very much like regular expectation, so for example we have E[aX+bY|A] = aE[X|A] + bE[Y|A].
One of the most useful applications of conditional expectation is that it allows computing (unconditional) expectations by case analysis, using the fact that
- E[X] = E[X|A]Pr[A] + E[X|¬A]Pr[¬A].
- Example
- I have a 50% chance of reaching the top of Mt Everest, where Sir Edmund Hilary and Tenzing Norgay hid somewhere between 0 and 10 kilograms of gold (a random variable with uniform distribution). How much gold do I expect to bring home? Compute E[X] = E[X|reached the top]Pr[reached the top] + E[X|didn't make it]Pr[didn't make it] = 5⋅0.5 + 0⋅0.5 = 2.5.
- Example
- Suppose I flip a coin that comes up heads with probability p until I get heads. How many times on average do I flip the coin? We'll let X be the number of coin flips. Conditioning on whether the coin comes up heads on the first flip gives E[X] = 1⋅p + (1+E[X'])⋅(1-p),
where X' is random variable counting the number of coin-flips needed to get heads ignoring the first coin-flip. But since X' has the same distribution as X, we get
- EX = p + (1-p) (1+EX)
or
- EX = (p+(1-p))/p = 1/p.
So a fair coin must be flipped twice on average to get a head, which is about what we'd expect if we hadn't thought about it much.
4.3.1. Conditioning on a random variable
There is a more general notion of conditional expectation for random variables, where the conditioning is done on some other random variable Y. Unlike E[X|A], which is a constant, the expected value of X conditioned on Y, written E[X|Y], is itself a random variable: when Y = y, it takes on the value E[X|Y=y].
- Example
- Let us compute E[X+Y|X] where X and Y are the values of independent six-sided dice. When X = x, E[X+Y|X] = E[X+Y|X=x] = x+E[Y] = x + 7/2. For the full random variable we can write E[X+Y|X] = X + 7/2.
Another way to get the result in the preceding example is to use some general facts about conditional expectation:
- E[aX+bY|Z] = aE[X|Z] + bE[Y|Z]. This is the conditional-expectation version of linearity of expectation.
- E[X|X] = X. This is immediate from the definition, since E[X|X=x] = x.
- If X and Y are independent, then E[Y|X] = E[Y]. The intuition is that knowing the value of X gives no information about Y, so E[Y|X=x] = E[Y] for any x in the range of X. (To do this formally requires using the fact that Pr[Y=y|X=x] = Pr[Y=y /\ X=x] / Pr[X=x] = Pr[Y=y]Pr[X=x]/Pr[X=x] = Pr[Y=y], provided X and Y are independent.)
- Also useful: E[E[X|Y]] = E[X]. Averaging a second time removes all dependence on Y.
These in principle allow us to do very complicated calculations involving conditional expectation. For example:
- Example
- Let X and Y be the values of independent six-sided dice. What is E[X | X+Y]? Here we observe that X+Y = E[X+Y | X+Y] = E[X | X+Y] + E[Y | X+Y] = 2E[X | X+Y] by symmetry. So E[X | X+Y] = (X+Y)/2. This is pretty much what we'd expect: on average, half the total value is supplied by one of the dice. (It also works well for extreme cases like X+Y = 12 or X+Y = 2, giving a quick check on the formula.)
- Example
What is E[(X+Y)2 | X] when X and Y are independent? Compute E[(X+Y)2 | X] = E[X2 | X] + 2E[XY | X] + E[Y2 | X] = X2 + 2X E[Y] + E[Y2]. For example, if X and Y are independent six-sided dice we have E[(X+Y)2 | X] = X2 + 7X + 91/6, so if you are rolling the dice one at a time and the first one comes up 5, you can expect on average to get a squared total of 25 + 35 + 91/6 = 75 1/6. But if the first one comes up 1, you only get 1 + 7 + 91/6 = 23 1/6 on average.
4.4. Markov's inequality
Knowing the expectation of a random variable gives you some information about it, but different random variables may have the same expectation but very different behavior: consider, for example, the random variable X that is 0 with probability 1/2 and 1 with probability 1/2 and the random variable Y that is 1/2 with probability 1. In some cases we don't care about the average value of a variable so much as its likelihood of reaching some extreme value: for example, if my feet are encased in cement blocks at the beach, knowing that the average high tide is only 1 meter is not as important as knowing whether it ever gets above 2 meters. Markov's inequality lets us bound the probability of unusually high values of non-negative random variables as a function of their expectation. It says that, for any a > 0,
Pr[X > aEX] < 1/a.
This can be proved easily using conditional expectations. We have:
EX = E[X|X > aEX] Pr[X > aEX] + E[X|X ≤ aEX] Pr[X > aEX].
Since X is non-negative, E[X|X ≤ aEX] ≥ 0, so subtracting out the last term on the right-hand side can only make it smaller. This gives:
EX ≥ E[X|X > aEX] Pr[X > aEX] > aEX Pr[X > aEX]
and dividing both side by aEX gives the desired result.
Another version of Markov's inequality replaces > with ≥:
- E[X ≥ aEX] ≤ 1/a.
The proof is essentially the same.
- Example
Suppose that that all you know about high tide is that EX = 1 meter and X ≥ 0. What can we say about the probability that X > 2 meters? Using Markov's inequality, we get Pr[X > 2 meters] = Pr[X > 2EX] < 1/2.
There is, of course, a conditional version of Markov's inequality:
Pr[X > aE[X|A] | A] < 1/a.
This version doesn't get anywhere near as much use as the unconditioned version, but it may be worth remembering that it exists.
5. The variance of a random variable
Expectation tells you the average value of a random variable but it doesn't tell you how far from the average the random variable typically gets: the random variable X = 0 and Y = ±1,000,000,000,000 with equal probability both have expectation 0, though their distributions are very different. Though it is impossible to summarize everything about the spread of a distribution in a single number, a useful approximation for many purposes is the variance Var(X) of a random variable X, which is defined as the expected square of the deviation from the expectation, or E((X-EX)2).
- Example
Let X = 0 or 1 with equal probability. Then EX = 1/2, and (X-EX)2 is always 1/4. So Var(X) = 1/4.
- Example
Let X be the value of a fair six-sided die. Then EX = 7/2, and E((E-EX)2) = (1/6)((1-7/2)2 + (2-7/2)2 + (3-7/2)2 + ... + (6 - 7/2)2) = 35/12.
Computing variance directly from the definition can be tedious. Often it is easier to compute it from E(X2) and EX:
Var[X] = E[(X-EX)2] = E[X2 - 2 X EX + (EX)2] = E[X2] - 2 EX EX + (EX)2 = E[X2] = (EX)2.
(The second-to-last step uses linearity of expectation and the fact that EX is a constant.)
- Example
For X = 0 or 1 with equal probability, we have E[X2] = 1/2 and (EX)2 = 1/4, so Var[X] = 1/4.
- Example
- Let's try the six-sided die again, except this time we'll use an n-sided die. We have
![\begin{eqnarray*}
Var[X] &=& E[X^2] - (E X)^2 \\
&=& \frac{1}{n} \sum_{i=1}^{n} i^2 - \left(\frac{n+1}{2}\right)^2 \\
&=& \frac{1}{n} \cdot \frac{n(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4} \\
&=& \frac{(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4}.
\end{eqnarray*} \begin{eqnarray*}
Var[X] &=& E[X^2] - (E X)^2 \\
&=& \frac{1}{n} \sum_{i=1}^{n} i^2 - \left(\frac{n+1}{2}\right)^2 \\
&=& \frac{1}{n} \cdot \frac{n(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4} \\
&=& \frac{(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4}.
\end{eqnarray*}](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_cac108beeb5e4f57d75e79c841369a6aca06cb6d_p1.png)
When n = 6, this gives 7⋅13/6 - 49/4 = 35/12. (Ok, maybe it isn't always easier).
5.1. Multiplication by constants
Suppose we are asked to compute the variance of aX, where a is a constant. We have
- Var[aX] = E[(aX)²] - E[aX]² = a²E[X²] - (aE[X])² = a² Var[X].
So, for example, if X = 0 or 2 with equal probability, Var[X] = 4⋅(1/4) = 1. This is exactly what we expect given that X-EX is always ±1.
Another consequence is that Var[-X] = (-1)²Var[X] = Var[X]. So variance is not affected by negation.
5.2. The variance of a sum
What is Var[X+Y]? Write
![\begin{eqnarray*}
\mbox{Var}[X+Y]
&=& E[(X+Y)^2] - \left(E[X+Y]\right)^2 \\
&=& E[X^2] + 2E[XY] + E[Y^2] - (EX)^2 - 2 EX \cdot EY - (EY)^2 \\
&=& (E[X^2] - (EX)^2) + (E[Y^2] - (EY)^2) + 2(E[XY] - EX \cdot EY) \\
&=& \mbox{Var}[X] + \mbox{Var}[Y] + 2(E[XY] - EX \cdot EY).
\end{eqnarray*} \begin{eqnarray*}
\mbox{Var}[X+Y]
&=& E[(X+Y)^2] - \left(E[X+Y]\right)^2 \\
&=& E[X^2] + 2E[XY] + E[Y^2] - (EX)^2 - 2 EX \cdot EY - (EY)^2 \\
&=& (E[X^2] - (EX)^2) + (E[Y^2] - (EY)^2) + 2(E[XY] - EX \cdot EY) \\
&=& \mbox{Var}[X] + \mbox{Var}[Y] + 2(E[XY] - EX \cdot EY).
\end{eqnarray*}](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_c13a46a60e870b4b39023df051c61695a62011ee_p1.png)
The quantity E[XY] - EX EY is called the covariance of X and Y and is written Cov(X,Y). So we have just shown that
- Var[X+Y] = Var[X] + Var[Y] + 2 Cov(X,Y).
When Cov(X,Y) = 0, or equivalently when E[XY] = EX EY, X and Y are said to be uncorrelated and their variances add. This occurs when X and Y are independent, but may also occur without X and Y being independent.
For larger sums the corresponding formula is
![\[\mbox{Var}\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} \mbox{Var}[X_i] + \sum_{i\neq j} \mbox{Cov}(X,Y).\] \[\mbox{Var}\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} \mbox{Var}[X_i] + \sum_{i\neq j} \mbox{Cov}(X,Y).\]](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_517cd6d85211bcf8c4dd355e53d5f299da9ada4e_p1.png)
This simplifies to Var(∑Xi) = ∑Var(Xi) when the Xi are pairwise independent, so that each pair of distinct Xi and Xj are independent. (This is implied by independence but is not equivalent to it.)
- Example
What is the variance of the number of heads in n independent fair coin-flips? Let Xi be the indicator variable for the event that the i-th flip comes up heads and let X be the sum of the Xi. We have already seen that Var[Xi] = 1/4, so Var[X] = n Var[Xi] = n/4.
- Example
- Let a be a constant. What is the variance of X+a? We have Var[X+a] = Var[X] + Var[a] = Var[X], since (1) E[aX] = aE[X] = E[a]E[X] means that a (considered as a random variable) and X are uncorrelated, and (2) Var[a] = E[a-Ea] = E[a-a] = 0. So shifting a random variable up or down doesn't change its variance.
5.3. Chebyshev's inequality
Variance is an expectation, so we can use Markov's inequality on it. The result is Chebyshev's inequality:
Pr[|X - EX| > r) < Var[X] / r2.
Proof: The event |X - EX| > r is the same as the event (X-EX)2 > r2. By Markov's inequality, the probability that this occurs is at most E[(E-EX)2]/r2 = Var[X]/r2.
5.3.1. Application: showing that a random variable is close to its expectation
This is the usual statistical application.
- Example
Flip a fair coin n times, and let X be the number of heads. What is the probability that |X-n/2| > r? Recall that Var[X] = n/4, so Pr[|X-n/2| > r] < (n/4)/r2 = n/(4r2). So, for example, the chances of deviating from the average by more than 1000 after 1000000 coin-flips is less than 1/4.
- Example
Out of n voters in Saskaloosa County, m plan to vote for Smith for County Dogcatcher. A polling firm samples k voters (with replacement) and asks them who they plan to vote for. Suppose that m < n/2; compute a bound on the probability that the polling firm incorrectly polls a majority for Smith.
Solution: Let Xi be the indicator variable for a Smith vote when the i-th voter is polled and let X = ∑Xi be the total number of pollees who say they will vote for Smith. Let p = EXi = m/n. Then Var[Xi] = p - p2, EX = kp, and Var[X] = k(p-p2). To get a majority in the poll, we need X > k/2 or X-EX > k/2 - kp. Using Chebyshev's inequality, this event occurs with probability at most
![\begin{eqnarray*}
\frac{\mbox{Var}[X]}{(k/2-kp)^2}
&=& \frac{k(p - p^2)}{(k/2-kp)^2} \\
&=& \frac{1}{k} \cdot \frac{p-p^2}{(1/2 - p)^2}.
\end{eqnarray*} \begin{eqnarray*}
\frac{\mbox{Var}[X]}{(k/2-kp)^2}
&=& \frac{k(p - p^2)}{(k/2-kp)^2} \\
&=& \frac{1}{k} \cdot \frac{p-p^2}{(1/2 - p)^2}.
\end{eqnarray*}](/pinewiki/RandomVariables?action=AttachFile&do=get&target=latex_609ce5295a2c5aca50cc1e86fd31298dde7f0ede_p1.png)
Note that the bound decreases as k grows and (for fixed p) does not depend on n.
In practice, statisticians will use a stronger result called the Central_limit_theorem, which describes the shape of the distribution of the sum of many independent random variables much more accurately than the bound from Chebyshev's inequality.
5.3.2. Application: lower bounds on random variables
Unlike Markov's inequality, which can only show that a random variable can't be too big too often, Chebyshev's inequality can be used to show that a random variable can't be too small, by showing first that its expectation is high and then that its variance is low. For example, suppose that each of the 1030 oxygen molecules in the room is close enough to your mouth to inhale with pairwise independent probability 10-4 (it's a big room). Then the expected number of oxygen molecules near your mouth is a healthy 1030 * 10-4 = 1026. What is the probability that all 1026 of them escape your grasp?
Let Xi be the indicator variable for the event that the i-th molecule is close enough to inhale. We've effectively already used the fact that E[Xi] = 10-4. To use Chebyshev's inequality, we also need Var[Xi] = E[Xi2] - E[Xi]2 = 10-4 - 10-8 ~= 10-4. So the total variance is about 1030*10-4 = 1026 and Chebyshev's inequality says we have Pr[|X - EX| >= 1026] <= 1026/(1026)2 = 10-26. So death by failure of statistical mechanics is unlikely (and the real probability is much much smaller).
But wait! Even a mere 90% drop in O2 levels is going to be enough to cause problems. What is the probability that this happens? Again we can calculate Pr[90% drop] <= Pr[|X-EX| >= 0.9*1026] <= 1026/(0.9*1026)2 ~= 1.23*10-26. So even temporary asphyxiation by statistical mechanics is not something to worry about.
6. Probability generating functions
For a discrete random variable X taking on only values in ℕ, we can express its distribution using a probability generating function or pgf:
F(z) = ∑ Pr[X=n] zn.
These are essentially standard-issue GeneratingFunctions with the additional requirement that all coefficients are non-negative and F(1) = 1.
A trivial example is the pgf for a Bernoulli random variable (1 with probability p, 0 with probability q=1-p). Here the pgf is just q+pz.
A more complicated example is the pgf for a geometric random variable. Now we have ∑ qk p zk = p ∑ (qz)k^ = p/(1-qz).
6.1. Sums
A very useful property of pgf's is that the pgf of a sum of independent random variables is just the product of the pgf's of the individual random variables. The reason for this is essentially the same as for ordinary GeneratingFunctions: when we multiply together two terms (Pr[X=n] zn)(Pr[Y=m] zm), we get Pr[X=n ∧ Y=m] zn+m, and the sum over all the different ways of decomposing n+m gives all the different ways to get this sum.
So, for example, the pgf of a binomial random variable equal to the sum of n independent Bernoulli random variables is (q+pz)n (hence the name "binomial").
6.2. Expectation and variance
One nice thing about pgf's is that the can be used to quickly compute expectation and variance. For expectation, we have
F'(z) = ∑ n Pr[X=n] zn-1.
So
- F'(1) = ∑ n Pr[X=n] = E[X].
If we take the second derivative, we get
F''(z) = ∑ n(n-1) Pr[X=n] zn-1
or
F''(1) = ∑ n(n-1) Pr[X=n] = E[X(X-1)] = E[X²] - E[X].
So we can recover E[X²] as F''(1) + F'(1) and get Var[X] as F''(1) + F'(1) - (F'(1))².
- Example
If X is a Bernoulli random variable with pgf F = (q+pz), then F' = p and F'' = 0, giving E[X] = F'(1) = p and Var[X] = F''(1) + F'(1) - (F'(1))² = 0 + p - p² = p(1-p) = pq.
- Example
If X is a binomial random variable with pgf F = (q+pz)n, then F' = n(q+pz)n-1p and F'' = n(n-1)(q+pz)n-2p², giving E[X] = F'(1) = np and Var[X] = F''(1) + F'(1) - (F'(1))² = n(n-1)p² + np - n²p² = np - np² = npq. (These values would, of course, be a lot faster to compute using the formulas for sums of independent random variables, but it's nice to see that they work.)
- Example
If X is a geometric random variable with pgf p/(1-qz), then F' = pq/(1-qz)² and F'' = 2pq²/(1-qz)3. E[X] = F'(1) = pq/(1-q)² = pq/p² = q/p. Var[X] = F''(1) + F'(1) - (F'(1))² = 2pq²/(1-q)3 + q/p - q²/p² = 2q²/p² + q/p - q²/p² = q²/p² + q/p. This would probably be a pain to calculate by hand.
- Example
Let X be a Poisson random variable with rate λ. We claimed earlier that a Poisson random variable is the limit of a sequence of binomial random variables where p = λ/n and n goes to infinity, so (cheating quite a bit) we expect that X's pgf F = lim[n⇒∞] ((1-λ/n) + (λ/n)z)n = (1+(-λ+λz)/n)n = exp(-λ+λz) = exp(-λ) ∑ λnzn/n!. We can check that the total probability F(1) = exp(-λ+λ) = exp(0) = 1, that the expectation F'(1) = λ exp(-λ+λ) = λ, and that the variance F''(1) + F'(1) - (F'(1))² = λ²exp(-λ+λ) + λ - λ² = λ. These last two quantities are what we'd expect if we calculated the expectation and the variance directly as the limit of taking n Bernoulli random variables with expectation λ/n and variance (λ/n)(1-λ/n) each.
7. Summary: effects of operations on expectation and variance of random variables
Operation |
Effect on expectation |
Effect on variance |
Multiplication by a constant |
E[aX] = aE[X] |
Var[aX] = a2 Var[X] |
Addition |
E[X+Y] = E[X] + E[Y]. Does not require independence. |
Var[X+Y] = Var[X] + Var[Y] + 2 Cov(X,Y). Note that Cov(X,Y)=0 if X and Y are independent. |
Product |
E[XY] = E[X]E[Y] + Cov(X,Y). (Or just E[X]E[Y] if X and Y are independent.) |
(No simple formula.) |
The detail we are sweeping under the rug here is what makes a subset of the codomain measurable. The essential idea is that we also have a σ-algebra F' on the codomain, and elements of this codomain σ-algebra are the measurable subsets. The rules for simple random variables and real-valued random variables come from default choices of σ-algebra. (1)
PineWiki