Meaning of $5.1\sigma$ significance with regards to GW150914
I see where you are going with your question. Let me feed the flames.
The sigma value that is quoted is equivalent to a false alarm probability. It tells you how unlikely it is for your experiment, given your understanding (theoretically and empirically) of the noise characteristics, to have produced a signal that looked like GWs from a merging BH.
Personally, I prefer the statement in the text you quote. Such an event would have been seen (in both detectors) about once every 200,000 years. Given that the observations were for 16 days, that means an expectation there would be $2.2 \times 10^{-7}$ such events in the data. i.e a one in 4.6 million chance.
The LIGO team have just converted this number into a numbers of sigma significance using an integral under one tail of the normal distribution. Using one of the readily available calculators e.g http://www.danielsoper.com/statcalc3/calc.aspx?id=20 we see that 5.0-5.1$\sigma$ (known as z-scores) corresponds to p-values of $2.7\times 10^{-7}$ to $1.7\times10^{-7}$, bracketing the value found above.
However this is not the confidence level that this is a gravitational wave or a merging black hole. There is always the possibility that some unanticipated source of error could have crept in that mimics a GW signal (but note that it needs to affect both detectors) or that some other astrophysical source could be capable of producing the signal. As far as I am aware, apart from the usual conspiracy theories (yawn), nobody has come up with a plausible alternative to GWs from a merging BH.
In all frequentist hypothesis testing, one finds a so-called $p$-value: the probability of obtaining such "extreme" observations (i.e. such an extreme test-statistic) were the null hypothesis true.
The null hypothesis is rejected iff the $p$-value is less than a pre-specified critical value or confidence level. Otherwise, the null is not accepted or confirmed - it is merely not rejected.
In this case, the null hypothesis is that
model of background noise correctly describes all input to the detectors
and it was rejected at high confidence.
The $p$-values are conventionally converted into one-tailed Gaussian significances, i.e. a number of standard deviations such that an identical probability is in the tail of Gaussian distribution, $$ Z = \Phi^{-1}(\text{$p$-value}) $$ where $\Phi^{-1}$ is the inverse of a Gaussian CDF. This convention is annoying as the relation between $p$-value and significance isn't algebraic or easy to approximate. It would make more sense to simply report a $p$-value.
In your comment, you allude to Bayes' theorem and a calculation of the probability or plausibility of the null hypothesis. The LIGO hypothesis testing is, however, strictly frequentist. Only the probability of data and pseudo-data is considered. Since the data appears to be so strong in this case, there shouldn't be any qualitative differences in the conclusions of Bayesian or frequentist methods.
You are correct, of course, that $$ P(\text{Any signal-like features due to chance}|\text{data}) $$ is not equal to $$ P(\text{data}|\text{Any signal-like features due to chance}) $$ They are related by Bayes' theorem. Frequentist methods, including LIGO methodology, considers only the latter.
You may find arXiv:1609.01668 interesting, as it discusses differences between Bayesian and frequentist analyses of LIGO signals. Remarkably, even small significances could correspond to colossal Bayes-factors. The $5.1\sigma$ event had a Bayes-factor of $10^{125}$, which is the largest number I've seen in this context.
It's a p-value, written in terms of a z-score.
Any computation of a chance is predicated on a model, sometimes it is even enshrined in the name Null Hypothesis. For the first direct sighting of a gravitational wave, the Null Hypothesis could be that gravitational waves don't exist, but your detectors can react to noise.
Now, the computation isn't as simple as the chance of getting any one particular set of data. You actually order the data into those that look like the predicted wave data and those that don't. And then within those that look like the predicted wave data, you order them on strength.
And then you find out the chance that it reacts like that strong a signal ... or stronger (and that or stronger part is what these last two paragraphs are all about). And that's your p-value. It really is about making an error about saying you saw a signal when actually that data sometimes happens by chance ... given the null hypothesis.
Finally you take the probability computed from above and find the z-score cuttoff that has that probability as its tail. And then you report that z-score in "units" of $\sigma.$
The point is that such a standard can decrease how often we announce discoveries to each other that were really just noise. And Physicists have a pretty high standard (compared to p-values of 0.05 or 0.01).
Can someone give the exact calculation of getting from false alarm rate$=1/200,000$ years to $5.1\sigma$?
Intuitively you are looking at the theory to identify things called signals. And then looking at the detectors to find out how often the detectors produce results that look like those signals just from noise. So it involves knowing what the signals look like and how the detectors react to noise. Both are things you should know if you are designing a detector. Neither of those is going to be a simple calculation. The theory required many very long and tedious calculations, hours of computer time. The noise is also hard to compute since they put so many things in to reduce noise. They literally adjusted how the arms work to adjust the noise to have less noise in some areas than the zero point energy produces naturally.
But you have the set of signals and the model of how the detector reacts to noise. The rate (in time) at which the detector generated (from noise) results that look like the signals will depend on the length of the different signals. A signal that is short has many times it could appear in a 200,000 year time period. A longer signal has fewer times.
It isn't a simple calculation when you have lots of different signals, of different lengths and different shapes. You can't just look it up on a table. You can look up the p-value to z-score in a table. But the conversion to a rate in time will depend how often the machine is put into data collection mode and how long the possible signals take to collect when the machine is on.
To address the subtext. If you sit on your hands and don't report a 5$\sigma$ result, then you shouldn't have built your detector. That doesn't mean any particular alternative to the Null Hypothesis is correct. It means that's the agree upon standard about when to report your results.
It's designed to not have too many reports about things that are just noise.