Think of any experiment with chance outcomes – buying a lottery ticket, betting on a horse race, going on a blind date, undergoing some medical treatment. We use the word distribution to specify all the possible outcomes, along with their associated probabilities. (We slipped in that word when writing about Poisson’s analysis of how many rare events will happen, given a large number of opportunities.)
The ‘distribution’ is central to analysing the range of consequences from a chance experiment. Plainly, we need to be clear about the full extent of the possible outcomes. To give sensible values for their probabilities, we must spell out our assumptions, and hope that they are appropriate for the experiment we seek to investigate.
Discrete distributions
First, we look at circumstances where the possible outcomes can be written as a list, each outcome having its own probability. The phrase discrete distribution applies here.
The most straightforward case is when we can count the number of outcomes, and agree that they should all be taken as equally likely. The term uniform distribution is used, as the total probability is spread uniformly over the outcomes. Many experiments are expected to fit this bill – roulette, dice, hands of cards, selecting the winning numbers in a lottery, etc. Accurate counting generates the appropriate answer.
Recall the term ‘Bernoulli trials’ to mean a sequence of independent experiments with a constant probability of Success each time. With a fixed number of Bernoulli trials, there is a simple formula, called the binomial distribution , that gives the respective probabilities of exactly 0, 1, 2, . . . Successes. This formula depends only on the number of trials, and the Success probability. As you run through the outcomes in order, their probabilities initially increase up to a maximum value, then fall away towards zero. (Poisson distributions also follow this pattern.)
We expect a binomial distribution for the number of Sixes among twenty throws of a die; or the number of correct answers when a student guesses randomly among five choices at each of thirty questions on a multiple choice exam. But we do not expect it when asking how many Clubs a bridge player has among his thirteen cards: although each separate card has probability one quarter of being a Club, successive cards are not independent, as the chance of a Club on the next card is affected by all previous outcomes.
Always read the small print. A binomial distribution requires three conditions: a fixed number of trials, each independent of the rest, and with a constant chance of Success.
In a sequence of Bernoulli trials, what is the chance it takes exactly five goes to achieve the first Success? The only way this happens is to begin with four Fails, then have a Success; and since all trials are independent, the answer comes from multiplying the respective probabilities of these outcomes together, giving a pleasingly simple expression, the so-called geometric distribution .
The probabilities of taking exactly 1, 2, 3, . . . trials for the first Success decrease steadily. Each time, the next probability comes from multiplying the present value by the chance of one more Fail, some fixed value less than unity. Thus, whatever the chance of Success, the single most likely number of trials to achieve the first Success is always unity!
Make a leap of faith, and suppose that, in cricket, successive balls form Bernoulli trials. A bowler, who interprets ‘Success’ as meaning that he takes a wicket, can think optimistically: when he comes in to bowl, the single most likely time he will take his next wicket is with the next delivery. Conversely, a batsman who takes a similar view must fatalistically accept that the most likely duration of his innings is that he faces just one ball. (Even for the best batsmen, records confirm that their single most likely total score is usually zero!)
4. Some common discrete distributions
Figure 4 illustrates some of the common discrete distributions. For each possible value, the height of the vertical bar gives its probability, and the sum of all the heights is always, of course, unity.
Continuous distributions
How might we extend the classical ideas of probability to deal with the experiment of choosing a random point on a stick of length 80cm? Here there is a continuum of possible outcomes, not just a list.
‘At random’ means that all individual points have the same probability. But if that common value were to exceed zero, then, by taking sufficiently many points, their total probability would exceed unity, which is impossible. Each separate point must have probability zero, and we can no longer use pictures like Figure 4 . Rather than associate probabilities with individual points, we need to associate probabilities with segments, or intervals.
To give equal treatment along the 80cm stick, all segments having the same length must have the same probability. Imagine chopping the stick into eight equal pieces: a ‘random’ point must, by definition, fall in each with the same probability, so, for example, the segment from 20cm to 30cm must have probability 1/8.
Figure 5a shows how to proceed, using the mantra ‘Area represents probability’. The height of the horizontal line labelled h is chosen so that the shaded area beneath that line is unity, representing the fact that it is 100% certain that the random point falls somewhere along the interval from 0 to 80. Then Figure 5b shows how to find the probability of falling in the segment from 32cm to 52cm, by calculating the corresponding shaded area. Plainly, this is 1/4.
To find the probability that a randomly selected point is within 10cm of either end of the stick, or within 10cm of the centre, we could use Figure 5c , and appeal to the Addition Law. The required probability is the sum of the three shaded areas, namely one half.
Figure 6 illustrates a similar path for other situations where the outcome takes continuous values, such as the time until the next accident on a particular stretch of motorway. We will argue below that the general shape of the curve shown is reasonable in this situation, but the main point is that the scale is chosen so that the total area above the line marked ‘Time’, but below the curve beginning at the point E, is unity, as it is 100% certain that the time to wait takes some non-negative value.
5a. The shaded area is unity
5b. The probability of falling between 32 and 52 is 1/4
5c. See text
6. A continuous distribution
The probability that the time is at least B, but no more than C, is the size of the area shaded. In a similar fashion, we can find the probability that the time to wait falls in any given interval, and then, using the Addition Law as above, the chance it falls in more complex regions.
A curve that generates probabilities in this manner is called a probability density . Now area is calculated as ‘length times breadth’, and the breadth of any line is zero. Hence the ‘area’ of either of the vertical lines at A or D in Figure 6 is zero, so both those individual points have probability zero, as before. But the density curve is higher at A than at D, so values near A are more likely than values near D. At a glance, the Figure indicates the regions of relatively low or high probability. The term continuous distribution is used.
In all such experiments, since individual points have probability zero, we can be a little slipshod: whether an interval includes both endpoints, just one of them, or neither, the probability the outcome falls in it is the same.
To qualify as a probability density, a curve must have two properties: it cannot take negative values, and the total area underneath it must be unity. This ensures that all calculations of probabilities lead to sensible values.
Many probability density functions arise often enough for them to be given names. For the experiment of selecting a random point within a given interval, the density function will be completely fiat over that interval, as in Figure 5: plainly, all segments of the same length do indeed have the same probability. Again, the term uniform distribution is used.
Suppose we are interested in the time to wait for some special event. For example, 210 Pb is an unstable isotope of Lead, and the claim ‘Its half-life is 22 years’ appears in physics textbooks. The meaning is that, whenever we take a lump of this substance, only half of it is unchanged after 22 years, the rest having decayed into other substances through radioactive emission.
This lump consists of a gigantic number of atoms, all acting independently. Focus on one atom: at some random time, it decays by emitting a particle. We do not know when this will be, but since half the atoms in the lump decay in 22 years, the chance that this particular atom decays within that time period is 50%. Suppose it has not decayed after five years: at that time, it is just one atom in the residual lump of 210 Pb, so the chance it decays within a further 22 years is again 50%. And if it has not decayed in the next three years, the same applies, and so on.
It turns out that the only way this can happen is when the random time until a given atom decays has what is known as an exponential distribution , whose density has the general shape shown in Figure 6 , the height of the curve falling at a constant rate. A similar background applies to road accidents: if none has occurred in the past week, that seems unlikely to affect the chances of an accident in the future, so we expect the time to wait for a road accident also to follow an exponential distribution.
This distribution is intimately tied up with the Poisson distribution. Whenever things are occurring essentially at random – fiashes of lightning in a storm, spontaneous mutations in reproduction, the arrivals of some customer at the Post Office – the number of such events in a fixed time period tends to follow some Poisson distribution, while the time to wait between events has this exponential format.
7. Gaussian distributions
The most important continuous distribution is the one we have already named the Gaussian distribution . As Figure 7 illustrates,members of this family are symmetrical around a single peak and fall away rapidly towards zero, while never actually attaining that value. Two numbers tell us to which member of this family any example belongs: one number picks out the peak, the other describes the spread – small values of the spread lead to tall and narrow graphs like Figure 7a , larger values give short, fat graphs like 7c. Any probability for a member of this family can be found by using these two numbers to relate it to Figure 7b , which has its peak at zero and its measure of spread standardized at unity. Suitable tables have been widely available since de Moivre first produced them.
7. Continued
Resolving an issue
You may have noticed a problem. Provided the set of supposed outcomes is finite, or an unending list like {1, 2, 3, . . . }, then even if some members of this list turn out to have probability zero, any event whose probability is zero will never happen. However, with a continuous distribution , although each separate point has probability zero, one of them will occur when the experiment is performed! We can no longer take ‘will not happen’ and ‘probability zero’ as meaning exactly the same.
To reconcile matters, think of choosing one marble at random from a box holding a million identical marbles. We would be very surprised if we correctly guessed the outcome in advance, as the chance of doing so is only one in a million. But whichever marble gets chosen, we do not then express surprise, even though an outcome, whose probability was as small as one in a million, has occurred.
Make the box bigger – a billion marbles, a trillion – and the corresponding chance of the actual outcome can be made as close to zero as we like – but it did happen. Choosing one point at random on a continuous line is not very different from this: for any point, its chance is zero, but one of them will occur.
We will show below that, in a repeatable experiment where the chance of guessing the outcome is one in six, we can expect to perform that experiment six times in order to be right just once. Similarly, if the chance is one in a million, we expect to take a million repetitions to guess right once. Reduce the chance of occurrence by a factor of another million, and the time we expect to wait for a correct guess gets multiplied by a million. Outcomes with really tiny probabilities do occur, but more and more rarely.
If the probability falls all the way down to zero, we can expect to wait longer than any finite time – it just won’t happen! It is rational to act as though any event of probability zero, that is named in advance , will not occur.
Mean values
Knowing the distribution of the outcomes from a chance experiment, we can calculate any probability we like. But sometimes, all this detail gets in the way – we can’t see the wood for the trees: so we want to pick out the main features of the distribution.
To illustrate, suppose the only outcomes possible are 2, 3, and 7, with respective probabilities 60%, 10%, and 30%. We expect that, over a hundred repetitions of this experiment, the value 2 will occur some sixty times, 3 about ten times, and 7 the remaining thirty times. The total of all these values is 120 + 30 + 210 = 360, so the average over all the one hundred outcomes is 360/100 = 3.6. This answer is just the weighted sum of the values 2, 3, and 7, the weights being their probabilities.
Whatever distribution we have, similar calculations lead to the average outcome over a large number of repetitions. ‘Average’ is a loose word, we prefer the term mean for the result of this calculation. There may be short cuts: if the values are uniformly distributed over some range, the mean is just midway between the two extremes; the mean number of Successes in a sequence of Bernoulli trials comes from multiplying the number of trials by the chance of Success.
When rolling a fair die, the chance of getting a Four is 1/6. So among 600 throws, we should see around 100 Fours: simple arithmetic then says that the mean gap between successive appearances of a Four is 6. It is plainly no coincidence that a chance of size 1/6 leads to a mean gap of 6. The length of any gap is just the time to wait for the next Success, so we have the pleasing result that, during a sequence of Bernoulli trials,
the mean time to wait for a Success is the reciprocal of the probability of Success.
With continuous distributions, the idea is the same, but the weighted sum is found by using the mathematical technique known as integration . For Gaussian distributions, the mean is where the peak occurs. Exponential distributions arise as the time to wait for a random event, which occurs at some characteristic overall frequency: it should be no surprise that the mean time to wait is just the reciprocal of that frequency.
The terms ‘expectation’ and ‘expected value’ are also used instead of ‘mean’ and ‘mean value’. Tossing a fair coin a dozen times, the ‘expected’ number of Heads is six, and the ‘expected’ score when throwing an ordinary fair die is 3.5. Of course, just because the expected number of Tails on a single toss is 0.5, you don’t actually expect to get half a Tail! The English language has many quirks.
Means are very friendly animals: the mean of a sum is always the sum of the means, whether or not the different components arise independently. The Law of Large Numbers says that, in the long run, means dominate: if you spend £1 on a Lottery ticket, where half that sum goes into the prize fund, then, however the prize distribution is structured, your mean return is 50p and, in the (very) long run, that is what you will get. Variability
It is also useful to have a succinct way of describing the variability of a distribution. We could calculate the difference between each value and the mean, and then find the (properly weighted) average value of these differences. But, as any trial calculation will show, this path is fruitless: the negative differences inevitably exactly cancel the positive ones, always giving a final answer of zero. But whether a difference is positive or negative, we will get a positive number when we square it. So we could use the weighted average of these squared values to assess the variability. This quantity is called the variance . If the distribution is concentrated near the mean, the variance will be small; it will be larger when there is a reasonable chance of getting values well away from the mean.
When considering income distributions, with data in dollars, the squared data are in ‘square dollars’, whatever that might mean. Taking the square root of the variance returns us to the original measurement units, giving what is called the standard deviation .
The mean and standard deviation together often give a swift and helpful way of picking out main features of a probability distribution. And in the Gaussian case, these two numbers suffice to find any probability at all! As useful touchstones, the outcome when the distribution is Gaussian will be within one standard deviation of the mean about 68% of the time, within two standard deviations over 95% of the time, and only one time in 400 will it be more than three standard deviations away.
These figures are the basis for the guidelines offered in Chapter 1 about how close an agreement we can reasonably expect between Success probability, and the actual frequency of Success: the key is the Central Limit Theorem, which says that quantities that arise as the sum of a large number of random components are expected to follow a distribution close to the Gaussian.
In Figure 7 , showing three Gaussian density functions, the means of the graphs are at 2, 0, and 2, while their respective standard deviations are 1/2, 1, and 2.
But be warned: although the mean of a sum is always the sum of the means, the same is not generally true of either the variance or the standard deviation. If the components of the sum happen to be independent – say a casino’s profits over seven separate days in Las Vegas – then the variance of the sum will indeed be the sum of the individual variances, but otherwise it could be higher or lower.
Adding standard deviations together seldom leads anywhere sensible.
Extreme-value distributions
In several applications of probability, interest centres on the largest or the smallest of a large number of random quantities. For example, the strength of a thread or a cable rests on the properties of the weakest fibre;flood defences take account of the maximum surge that might be expected over the next hundred years; the subject of survival analysis examines what fraction of a population remains after a given time. Extreme events may occur rarely, but when they happen, the consequences can be important.
The simplest plausible model assumes there are independent random quantities, each following a particular distribution; for example, the claims made on an insurance company in each separate year. The company is interested in how big is the largest total claim it can expect to receive over the next fifty years. There is a useful mathematical result that goes a long way to answering this question: however the claims vary over a single year, there are only three possible types of answer for the maximum claim over a large number of years. They are known as the extreme-value distributions, with the specific names of the Fréchet, the Gumbel, and the Weibull distributions. There is a sound mathematical principle that if there is some theorem about maxima, there is a corresponding result about minima. So if the item of interest is some minimum value, the same conclusion pertains.
To be able to limit the possibilities to these three families of distributions is very helpful. By estimating the mean and variance of an extreme value, and selecting whichever of them seems to fit the data best, sensible estimates of other probabilities – the chances of really extreme and devastating events – can be found.