Chapter 3 Historical sketch(1 / 1)

Beginnings

A game popular in Florence around 1600 rested on the total score from three ordinary dice. The scores of Three, when all dice scored one, and Eighteen when they all scored six, arose rarely, with most scores near the middle of the range. You should check that there are six different ways of scoring Nine (e.g. 6+2+1, 5+2+2, etc.), and also six ways of scoring Ten. It was commonly believed that this ‘ought’ to make totals of Nine or Ten equally frequent, but players noticed that, over a period of time, the total of Ten occurred appreciably more often than Nine. They asked Galileo for an explanation.

Galileo pointed out that their method of counting was flawed. Colour the dice as Red, Green, and Blue, and list the outcomes in that order. To score Nine from 3+3+3 requires all three dice to show the same value, and that can happen in one way only, (3,3,3). But the 5+2+2 combination could arise as (5,2,2), (2,5,2),

or (2,2,5), so this combination will tend to arise three times as often as the former; and 6+2+1 arises via (6,2,1), (6,1,2), (2,6,1), (2,1,6), (1,6,2), and (1,2,6), so this combination has six ways to occur. A valid approach to how often we can get the different totals takes this factor into account, and does indeed lead to more ways of obtaining Ten than Nine. The Florentine gamblers learned a vital lesson in probability – you must learn to count properly .

In the summer of 1654, Pascal (in Paris) and Fermat (in Toulouse) had an exchange of letters on the problem of points . Suppose Smith and Jones agree to play a series of contests, the victor being the first to win three games; unfortunately, fate intervenes, and the contest must end when Smith leads Jones by 2-1. How should the prize be split?

Such questions had been aired for at least 150 years without a satisfactory answer, but Pascal and Fermat independently found a recipe that, for any target score, and any score when the contest was abandoned, would divide the prize fairly between them. They took different approaches, but reached the same conclusion, and each showered praise on the other for his brilliance. For the specific problem stated, the split should be in the ratio 3:1, with Smith getting 3/4 of the prize, Jones 1/4.

The essence of their solution was to suppose that both players were equally likely to win any future game. They counted how many of the possible outcomes of these hypothetical games would give overall victory to either player, and proposed dividing the prize in the ratio of these two numbers. In different language, the prize should be split as the ratio of the two probabilities of either player winning the series, assuming they were evenly matched in future games. The systematic study of probability had begun.

This issue was settled via the objective approach to probability, but Pascal also thought more widely. He suggested a wager about the existence of God. ‘God is, or is not. Reason cannot answer. A game is on at the other end of an infinite distance, and Heads or Tails is going to turn up. Which way will you bet?’

He argued that if God exists, the difference between belief and unbelief is that between attaining infinite happiness in heaven, or eternal damnation in Hell. If God does not exist, belief or unbelief lead to only minor differences in earthly experience. Thus an agnostic should lean strongly to belief in God.

In this game, the values of the chances of ‘Heads’ or ‘Tails’ are personal choices, not derivable from symmetry or counting arguments. Thus Pascal was a pioneer in the subjective approach to probability too.

The Swiss Family Bernoulli

During the 17th and 18th centuries, members of the Bernoulli family from Basle made significant advances in mathematics, including probability. Rivalry was a spur: one of them would pose challenges, another would respond, the originator of the challenge would claim to find flaws in the supposed solution, and so on.

Games of chance inspired much of the early interest in the workings of probability. In these games, be it rolling dice, dealing cards, or tossing coins, some ‘experiment’ is carried out repeatedly under essentially the same conditions. The natural question, raised earlier, is: how does the observed frequency of an outcome relate to its objective probability?

Jacob Bernoulli gave an answer in his posthumously published The Art of Conjecturing (1713), nicely illustrated by his example. Suppose 60% of the balls in an urn are White, the rest are Black, and one ball is drawn at random. That ball is replaced, and the experiment repeated many times. Bernoulli showed that, so long as at least 25,550 drawings are made, for every time the proportion of White balls falls outside the range from 58% to 62%, it will fall inside that range at least one thousand times. Informally, the observed frequency of White balls is, in the long run, overwhelmingly likely to be close to its objective probability.

A similar analysis applies to any experiment that can be repeated indefinitely under identical conditions, where the result of one experiment has no effect on the others. Each time, certain outcomes denote Success, and their objective probability is some fixed value p . (This notion now carries the label Bernoulli trials .) Take any interval, as small as you like, around the value p – plus or minus 2%, plus or minus 0.1%, it matters not. Also, say how much more often you want the running frequency of Successes to be inside this interval, rather than outside it – a hundred times as often, a million times, whatever. Bernoulli’s methods show that any such demand can always be met, provided the experiment is repeated often enough. The observed frequency will be as close to

the objective probability as you like, given enough data. This assertion is known as the Law of Large Numbers . The family’s fame was honoured in 1975 by the name choice ‘The Bernoulli Society’ for an international society whose main purpose is to foster advances in the study of probability and mathematical statistics.

Abraham de Moivre

De Moivre settled in England as a Huguenot refugee, and made a living from chess and from his knowledge of probability. Isaac Newton, then over 50 years old and with many calls on his time, deflected enquiries about mathematics with the words ‘Go to Mr de Moivre, he knows these things better than I do.’ De Moivre’s Doctrine of Chances appeared in English in 1718, and its second edition, in 1738, contained a major advance on Bernoulli’s work. To appreciate what he did, consider something specific: if a fair die is rolled 1,000 times, how far from the average frequency can we reasonably expect the number of Sixes to be?

De Moivre developed a simple formula that was widely useful for questions of this nature. One of his superb insights was to realize that the deviation of the actual number of Sixes from the average expected was best described by comparing it to the square root of the number of rolls.

It is hard to overplay the significance of this discovery. When you hear that an opinion poll has put support for a political party at 40%, it is often accompanied by a reminder that this is only an estimate, but that the true value is ‘very likely’ to be in some range like 38% to 42%. The width of such a range tells you about the precision of the initial figure of 40%, and if you want higher precision, you need a larger sample: this square root factor means that to double the precision, the sample needs to be four times as large! We have a law of diminishing returns with a vengeance – to do twice as well, we must spend four times as much.

De Moivre’s approach can be illustrated by looking at how many Heads will occur in twenty throws of a fair coin. Taking all sequences of length 20 such as HHHTH . . . HTHT as equally likely, we can construct Figure 1 , where the heights of the vertical bars show how many of the one million or so different sequences produce exactly 0, 1, 2, . . . ,19, 20 Heads. The respective objective probabilities are then proportional to these heights. De Moivre showed that the best-fitting continuous smooth curve through the tops of these bars is very close to a particular form, now often called the normal distribution .

A curve of this nature arises for any large number of coin throws, and also when the chance of Heads differs from one half. All these curves bear a simple relation to each other, so de Moivre could produce a single numerical table for just one basic curve, and use it everywhere. A good estimate of the proportion of times that the overall frequency of Successes would be within certain limits could now easily be found – all that was needed was the chance of Success, and the number of times the experiment was to be conducted. You’re going to roll a fair die 200 times and you want to know how likely it is that the number of Sixes will be between

1. Relative frequencies of Heads in 20 throws

30 and 40? Or how likely is it that a fair coin will fall Heads more than 60 times in 100 tosses? No problem – de Moivre had the solution.

Suppose we know the ages of death for a group of men, all of whom reached at least their fiftieth birthday. De Moivre’s work could answer the question: ‘If a man aged 50 is more likely than not to die before reaching 70, how likely is it that the figures observed for that group would arise?’ Useful though this was, it did not answer the key question posed by the nascent life insurance industry: ‘How sure can we be that a 50-year-old man is more likely than not to die before he reaches the age of 70?’

Inverse probability

The ideas of Thomas Bayes, a Presbyterian minister who dabbled in mathematics, are far better appreciated now than in his lifetime. His Essay towards solving a problem in the doctrine of chances , published in 1764, three years after he died, gives the beginnings of a general approach to subjective probability, and a way of addressing the actuaries’ problem about inferring probabilities from data. It also included an essential tool for working with probabilities, termed Bayes’ Rule.

To illustrate the latter, suppose we throw a fair die twice. Given that the score on the first throw is three, it is easy to find the chance that the total score is eight, as this happens precisely when five is scored on the second throw. With hardly a pause, we give the answer 1/6. But turn the problem round, and ask: given that the total score is eight, what is the chance that the first throw yielded three? The answer is far less obvious, but can be found by applying Bayes’ Rule. Under the standard model of dice throws, the chance turns out to be 1/5.

This notion of inverse probability is central to the way evidence should be considered in criminal trials. Suppose fingerprints found at a crime scene are identified as belonging to a known individual, Smith. The probability of finding this evidence, if Smith is innocent, is likely to be very low. But it is not ‘How likely is this evidence, given that Smith is innocent?’ that the Court passes judgment on: it is ‘How likely is Smith to be innocent, given this evidence?’ Bayes’ Rule is the only sound way to obtain an answer. We will see in later chapters how this Rule helps in making sensible decisions.

The insights shown by Bayes were overlooked for many years, but he did identify the central problem: if the chance of Success in a series of Bernoulli trials, like dice throws, is unknown, but the respective numbers of trials and Successes are known, how likely is it that this unknown chance falls between specified limits? Laplace, a far superior mathematician, was able to carry out the computations that had defeated Bayes.

From tentative beginnings in 1774 to a synthesis in 1812, Laplace steadily improved his analysis, and gave explicit formulae to answer Bayes’ question. For example, using data on the numbers of male and female births in Paris, he concluded that it was beyond doubt that the chance of a male birth exceeded that for a female – he put the probability this was false as about 10 –42 !

Bayes in buried in the London cemetery of Bunhill Fields, near the Royal Statistical Society. The vault has been restored, and displays a tribute to Bayes paid for by statisticians worldwide.

The Central Limit Theorem

Write the list of outcomes of a collection of Bernoulli trials as a sequence of Successes and Failures, e.g. FFFSF FFSSF SFF . . . Now replace each S by the number one, and each F by zero, giving 00010 00110 100 . . . This indicates a cunning way to think about the total number of Successes in these trials: it is just the sum of these numbers (agreed?). De Moivre had given a good approximation that described how this sum would vary, using his so-called normal curve.

A vast array of quantities we might want to consider do arise as a sum of randomly varying individual values. For example, a local authority responsible for rubbish disposal is interested mainly in the total amount over the town, and not in the separate random amounts from each household. When a gardener sows runner beans, it is the total yield, not that in each pod, that concerns him. A casino judges its financial success on its overall winnings, irrespective of the fates of individual gamblers. Being able to regard an item of interest as the sum of a large number of random bits is often fruitful.

Laplace extended de Moivre’s work to cover cases like these. He established a Central Limit Theorem , which says that something that is the sum of a large number of random bits will, to a good approximation and in a wide range of circumstances, fit de Moivre’s normal distribution. We don’t need the details of how the individual components tend to vary, the way the total amount varies will closely follow this normal law.

To use this idea, we require just two numbers: first, the overall average amount, and second a simple way of expressing its variability. Given those two numbers, any probability can be found from de Moivre’s tables.

Enter Carl Friedrich Gauss (1777–1855), bracketed with Newton and Archimedes at the top of the mathematical tree of genius. He was investigating how to deal with errors in the observations of the positions of the stars and planets. He suggested that on average the error was zero – observations were just as likely to be wrong a bit to the left as a bit to the right – and its size followed this same normal distribution. He took this path for its mathematical simplicity, but when Laplace saw Gauss’s book, he linked it to his own work. He argued that because the total error in an observation arises as the agglomeration of many random factors, such an error ought to follow the normal law. Gauss’s lame excuse of ‘mathematical convenience’ was replaced by Laplace’s more persuasive ‘mathematics indicates that . . . ’.

The term ‘normal’, applied to this distribution, is unfortunate. It suggests that, in the first instance, we should expect any data we come across to follow its format, but this is far from the case. To avoid this implication, and to honour a great man, we will switch to the alternative term Gaussian . If you can persuade yourself that your item of interest can plausibly be regarded as the sum of a large number of smaller variable bits, having largely unrelated origins, this Central Limit Theorem says that the item can be expected to vary in a Gaussian manner.

Do observational errors really follow this law? According to Henri Poincaré, the last mathematician to feel comfortable across the whole existing mathematical spectrum, ‘Everybody believes in it, because the mathematicians imagine it is a fact of observation, and observers that it is a theory of mathematics.’

Siméon Denis Poisson

Poisson is best known for a distribution – the way in which probabilities vary around an average – that carries his name. An example arose in the work of the physicist Ernest Rutherford and his colleagues, when they counted how many alpha particles were emitted from a radioactive source over intervals of length 7.5 seconds. This number varied from zero to a dozen or so, with an average just under four. Figure 2 illustrates two typical experiments, showing (in those cases) four/five emissions. Rutherford expected the emissions to occur at random.

Chop the 7.5 seconds into a huge number of really tiny intervals, so small that we can neglect the possibility of more than one emission in them. All but a few intervals will have zero emissions, the rest will have just one. Within each tiny interval, regard an emission as a Success, so that the total number of particles emitted is just the number of Successes – Bernoulli trials again.

In a really tiny interval, the chance of a Success is effectively proportional to its length, so as this length shrinks, we have an increasing number of intervals, each having a decreasing chance of Success. Poisson worked out the exact chances for 0, 1, 2, . . . emissions altogether as the lengths of the tiny intervals reduce down to zero.

2. Times of emission of alpha particles

This Poisson distribution arises frequently, at least as an excellent approximation, whenever the things we count are happening ‘at random’. It was appropriate for Rutherford’s data; it fits the numbers of flying bombs that landed on different parts of south London in World War II; it seems useful as a model for the number of misprints in each block of 1,000 words in a book. If you simultaneously deal out two randomly shuffled decks of cards, face up, on average you will have exactly one match between them; but the actual number of matches will vary very much like a Poisson distribution. A gruesome example of this distribution, foisted on generations of students, is of twenty years of data for the numbers of officers in the different Prussian Cavalry Corps who were kicked to death by their horses.

All of those examples conform to the same pattern: a large number of opportunities, each with a tiny chance of coming off. Whenever the phenomenon you are studying fits that template, this Poisson model is likely to be useful.

The Russian School

A mathematical theorem takes the format: if certain assumptions are true, then a desired conclusion follows. The main interest is in applying the desired conclusion, so it is most useful when the required assumptions are not very onerous. Sometimes the desired conclusion can be demonstrated only under very restrictive assumptions, or with great difficulty: later workers may find easier ways to use the same assumptions, or reach the same conclusion under less restrictive conditions. Best of all is when the conclusion can be shown true under very mild assumptions, and with a short and elegant argument. The work of Pafnuty Chebychev (1821-94) gives a fine example of this ideal.

Chebychev helped to show how a Law of Large Numbers applied in wider circumstances. The original Law related to Bernoulli trials, describing how well the proportion of Successes in a sequence of trials could estimate the chance of Success. If we want to estimate the average height of soldiers joining an army, or the cost of feeding a family for a week, it seems obvious that we can do so by taking a suitable sample from the relevant population. But how good will that estimate be? Chebychev’s work gave a firm idea of the probability that the error would be small enough to make the estimate reliable.

Much of Statistics rests on the applications of these ideas.

Chebychev’s best-known student is Andrey Markov, whose teaching inspired a further generation of talented Russians. Markov applied his ideas to poetry and literature. By replacing the vowels and consonants in Pushkin’s Eugene Onegin by the respective letters v, c, he generated a sequence with just those two symbols. In the original Cyrillic alphabet, vowels formed about 43% of the text. After a vowel, another vowel occurred some 13% of the time, while after a consonant, vowels arose 66% of the time.

To predict whether the next symbol would be v or c, he discovered that, given the current symbol, he could effectively ignore all its predecessors, so little help did they give.

This ‘forgetting’ property holds widely. Examples include: the successive values of a gambler’s fortune; the daily weather (Wet or Dry) in Tel Aviv; the lengths of many queues, counted as each customer leaves; the genetic compositions of consecutive generations; the diffusion of gases between two linked containers.

Whenever in a randomly varying sequence where we wish to predict future values, we can, knowing the present, ignore earlier values, the sequence is said to have the Markov property . The theory of such sequences is now well developed, and is the basis for many successful applications of probability ideas.

Markov, also active in politics, had a fine sense of mathematical history. In 1913 the Russian government organized celebrations to mark 300 years of Romanov rule, so Markov countered with events to commemorate the 200 years since Bernoulli’s discovery of the first Law of Large Numbers.

I digress to mention the work, early in the 20th century, of the Frenchman émile Borel. Recall the Law of Large Numbers for Bernoulli trials: that after many trials, it is overwhelmingly likely that the actual frequency of Successes is very close to the probability of Success. This still leaves open the possibility that, during an indefinite number of trials, the actual Success frequency occasionally ventures outside any given tolerance band around that Success probability. But Borel’s work killed that notion stone dead. Given any such tolerance band, there will come a time (we can’t be sure when, but it will happen) after which the actual frequency of Successes stays inside the band permanently. This is known as the Strong Law of Large Numbers.

3. Illustration of the Strong Law of Large Numbers; p is the probability of Success, the dashed lines show a tolerance band. After trial T , the actual Success frequency stays permanently within the tolerance band

This Strong Law also extends to wider circumstances: we can sum up the message of the Laws of Large Numbers by the informal phrase:

in the long run, averages rule.

In 1924, Alexander Khinchin published the wonderfully named Law of the Iterated Logarithm . Like the earlier work of Bernoulli and Laplace, when this was applied to a random quantity that arose as a sum, it could give even more precise information on how close that sum would be to its average value.

For some three hundred years, advances in the workings of probability came from a range of ad hoc methods. Then in 1933, the outstanding Russian scientist Andrey Kolmogorov used the recently developed ideas of measure theory to set the subject in a satisfactory logical framework. All the known theorems could be recast in Kolmogorov’s setting, giving a precision that was a catalyst for the developments that followed.

Kolmogorov, along with Khinchin and their student Boris Gnedenko, also greatly extended Laplace’s work on sums of random quantities. They were motivated by ways of increasing the reliability of machines used in textiles and other manufacturing industry, quality control on production lines, and the problems caused by congestion.

Kolmogorov was a superb researcher and teacher. When he died in 1987, the then Soviet president Mikhail Gorbachev rearranged his duties so as to be able to attend the funeral.

More modern times

War has frequently provoked scientific advances. The 1939–45 conflict boosted the development of operations research, with much of its success resting on sensible use of the ideas of probability. To maximize the probability that a supply ship would avoid being sunk by enemy submarines, a combination of data and calculation led to the conclusion that convoys were better than single ships, and large convoys better than smaller ones. When this conclusion was acted on, losses fell dramatically. The outline of the codebreaking work in Bletchley Park is now well known: however, the importance of the use of Bayes’ Rule to identify the most promising path to discover the settings of the reels on the Enigma machines is often overlooked.

In 1950, William Feller published his magnificent introductory book on probability, with further editions in 1957 and 1968. This book is my nomination for the best non-fiction book ever written. Directly and indirectly, with its mixture of intuition and rigorous argument, it led to a spectacular growth of interest in the subject. A little later, Joe Doob used the term martingale (which originally meant the gambling ‘system’ of doubling the stake after each loss) for a collection of random quantities where, loosely speaking, their average value at some future time is the same as the current value. He developed the main properties of martingales and closely related ideas: this work was widely useful, as it turned out that many collections of random quantities of practical interest fell within the scope of this theoretical investigation. Later we will illustrate how probability concepts have been usefully applied across a range of fields.

Many academic journals specializing in probability have been launched, some have spawned offspring, none report that they are short of material well worth publishing. The capabilities of modern computers have transformed the environment for calculating probabilities: their speed and storage capacity have greatly increased the range of soluble problems. Earlier, most work was when probabilities were influenced by just one factor, say time or distance, and exact calculation by humans was often possible; now, complex problems where probabilities change with time, three dimensions of space, and other influences, have been successfully attacked.

Even so, it may well be that the largest influence of computers on the development of probability is through ease of communication. The language T E X has become the standard framework in which mathematics and much of science is written up. Research workers post their thoughts and ideas on the internet, scholarly articles can easily be accessed from home or office on the World Wide Web.