As the world knows, on July 4 it was announced that the Higgs boson, or a reasonable facsimile, has been seen by two independent experiments at CERN. The statistical significance reported was expressed as "5-sigma." Let's look at what this means.
When subatomic particles are smashed together at high energy, they create a complex mix of secondary particles. Before you can claim a new discovery in that mix, you have to show, among other things, that the effect upon which you base the claim is very unlikely to be simply a statistical artifact.
The effect reported from CERN is of a type that particle physicists have been exploring for over fifty years. Basically, they looked for evidence of a particle with such a short lifetime that it would not leave any measurable track in the detector. Instead, it decays into secondary particles after travelling only a few nuclear diameters.
When I was a graduate student at UCLA in the early 1960s, bubble chambers and other detectors at the Lawrence Radiation Laboratory in Berkeley (now Lawrence Berkeley Lab), Brookhaven National Laboratory on Long Island, and CERN in Geneva were finding signs of many such short-lived particles that had never before been seen or even anticipated. They were clearly not composed of the well-known particles such as protons, neutrons, and electrons but seemed elementary in their own right.
We experimenters did then just as the Higgs-searchers are doing now, measured the energies and momenta of all the outgoing particles produced in high-energy particle collisions. Using these data, we formed a quantity called the "invariant mass" for each of the various particle combinations. Accumulating a large number of collision events, we then looked at the statistical distribution of the various invariant masses. When an unexpected "bump" appeared above what was the expected background, we had a candidate for a new particle.
Of course, everyone wanted to discover a new elementary particle and we all got excited whenever even the smallest bump appeared. At first, it seemed that a bump of three standard deviations, that is, "3-sigma," above the background was sufficient for a discovery. At the time, most physicists were not experts in statistics (and still aren't) and this struck us as reasonable. If the statistical fluctuations were given by a normal distribution (bell curve), then in only one in every 740 times you look at such a distribution will you get a 3-sigma bump or larger from a statistical fluctuation. That is, what is called the "p-value" was 1/740 = 0.00135.
We had a simple rule of thumb that drove statistics experts crazy, and still does. If the background under the bump, estimated by looking at either side, or calculated from some model, contained N events, then sigma was set equal to the square root of N-1.
Now, here my fifty-year old memory gets hazy, and I have not been able to dig up any documentation. (If anyone has any, I would greatly appreciate getting a copy). As I recall, at first the journals were publishing 3-sigma results. But many were not being independently replicated. So, again according to what I remember, the primary physics journal for rapid publication, Physical Review Letters, asked Art Rosenfeld at Berkeley to come up with a criterion for publication. He used frequentist probability arguments, which advocates of Bayesian statistics despise but have served us particle physicists well over time.
Art counted up all the experiments being done, all the plots being looked at, all the bins on the plots, all the combinations of particles for which invariant masses were being measured, and came up with a rule that has been at least informally in use since: the probability of the bump being a statistical fluctuation must be less than 1 in 10,000. For a normal distribution, only one in 31,574 times will you get an upward statistical fluctuation of 4-sigma or greater. The observed 5-sigma fluctuation for the Higgs, or a larger one, would result only once in 3.5 million trials.
However, this method of analysis is open to question. Several observers have pointed out a flaw, which is known in the literature as "sampling to a foregone conclusion." That is, the experimenters keep collecting data until the reach the level, in this case 5-sigma, where they then can reject the null hypothesis. The proper method according to the experts is to decide ahead of time what criterion you will use and also how much data you will take before rejecting the null hypothesis. Since that is not generally done, it is technically illegitimate to interpret the result as a probability.
But it's the method we have used in particle physics for half-century and, so far, it has not resulted in any major discovery claim being later proven to be in error. Furthermore, in my experience I saw many 3-sigma bumps go away as more data were accumulated. In any case, physicists no longer leave it just at that. They perform sophisticated Monte Carlo computer simulations of the experiment using their best available models and compare results with (signal plus background) and without (background only) the assumed signal. This was a major activity of mine when I was in research.
The assumption of a normal distribution of fluctuations may not be a good one. In today's experiments, the events are cut in so many different ways that biases away from normal statistics can occur. The Monte Carlo analyses can avoid this by calculating the relative probabilities for the data fitting to signal plus background and background alone.
Of course, statistical significance is a major concern in all experimental sciences, and for a long time I have been critical of what I regard as unacceptably low publication standards used in some fields.
I still remember going to the World Skeptics Conference in Buffalo in the 1996, which featured many prominent speakers including Stephen Jay Gould and Chris Carter, the creator of X-Files. One speaker was Jessica Utts, a professor of statistics at UC Davis. She argued that the standard that was used for publication in medicine and psychology, p-value = 0.05, was adequate to show that ESP exists. She said that evidence for ESP was just as good as the evidence that aspirin helps avoid heart attacks.
I stood up and protested that this implied that one out of every twenty reports of some positive effect was a statistical fluctuation. Furthermore, since negative results are often not published, one can only wonder how many reports in these fields are trustworthy, if any.
Note the difference between an extraordinary claim (ESP) and an ordinary one (aspirin). In the case of aspirin, we can provide a simple explanation: aspirin thins the blood and makes arterial blockages less likely. We have no explanation for ESP within existing knowledge. Claims that it is supported by quantum mechanics are total nonsense.
And there's more. Here again I must rely on memory, since to my knowledge no documentation exists. When in the 1980s I was working on very high-energy gamma ray astronomy on Mt Hopkins in Arizona, using the atmospheric Cherenkov technique, a collaborator reported to the rest of us that he thought he saw a signal from a certain pulsar. We all rushed to a meeting at his university and spent the better part of a week going over the data. His original estimate of the probability that the observation was a statistical fluctuation was one in a billion. After we counted all the various combinations he had looked at, the probability dropped to one in a thousand. This would have been more than adequate for a parapsychology or other pseudoscientific journal, but not a physics or astronomy one. We didn't publish. No one else since has reported a gamma ray signal from that pulsar.
I have looked at the results just reported by the two experiments at CERN, Atlas and CMS. Both show 5-sigma signals for a range of secondary particles at a mass of 125-126 GeV. The standard model of elementary particles and forces predicted a 4.6-sigma effect at that mass, although the value of mass itself was not predicted. Not only is each individual result significant, rejecting the null hypothesis at a probability of one in over three million, the fact that two independent experiments agree surely makes the case for a previously unknown particle at 125 GeV proven beyond a reasonable doubt. Whether it is the long sought-after Higgs boson or a composite of known particles is yet to be definitively established.