How Bayes’ Rule Emerged Triumphant from Two Centuries of Controversy

arcanus · on June 27, 2016

Bayesian here. 'The theory that would not die' is a wonderful read, and notes how many scientists used Bayesian techniques (subjective probability) in a variety of contexts while the field was still unpopular (arguably, heretical) in the mainstream statistical community.

Bayes was nevertheless used to inform ballistics calculations, help crack the enigma code, or inform search patterns for lost nuclear weapons.

I don't think we have seen the end of Bayes, either, as it is very useful for uncertainty quantification in the engineering Sciences, machine learning techniques, or even discovering the Higgs Boson.

amasad · on June 27, 2016

As someone new to Bayesianism I'd be interested in hearing your experience applying it in day-to-day life. How useful do you think it is to the ordinary Joe?

From my brief experience, after learning Bayes, my intuition about things involving probabilities grew very different than the people around me. For example, my friends were planning a skydiving trip and I googled the name of the skydiving business and found that they had fatal incidents in the past. My friends were convinced that given that they had incidents either the probability that we would have an incident using them doesn't change. Or worse, that since it has already happened, it's now less likely that an incident would happen. Wat.

mikekchar · on June 27, 2016

The one place I've used Bayes (hopefully properly!) is in a spaced repetition flash card program. Usually spaced repetition algorithms wait a certain amount of time based on how many times you have seen and remembered a card. The more times you have remembered it, the longer you wait. It then creates a schedule for each day. You review the cards that have "expired" their wait time.

I wanted to turn this upside down. Instead I sorted the cards by how likely you would be to remember them (a function of the number of times you have already remembered it and the amount of time that has passed since you last saw it). I put the least likely to be remembered cards first and the most likely to be remembered cards last.

As you reviewed the cards you would either remember them or not remember them. I supposed that cards grouped together had a similar probability of being remembered (only valid if my sort algorithm was correct, but I had some confidence on that since I based it on somebody else's research ;-) ). I then used Bayes to estimate the probability of getting the card correct.

So instead of scheduling the cards, I simply had the user keep reviewing until I had a certain confidence that there was a 90% or more of getting the cards correct. At which point I left the rest for another day (the probability goes down over time). Interestingly as the test is binary (remembered/not remembered) Bayes could be simplified down to getting it right a certain number of times in a row. This was simpler for the user to understand without significantly reducing the accuracy of the estimate.

Gratifyingly the system worked incredibly well. I added an option to continue drilling the cards after you hit the 90% mark and I very rarely got into a situation where the estimate was wrong.

kybernetikos · on June 27, 2016

That sounds very interesting. Is your program available anywhere?

mikekchar · on June 27, 2016

See my reply to mtrimpe below. I maintained this program for quite a long time, but realistically my choice of development platform was a poor one ;-) Also my code was pretty awful as I was experimenting with several strange ideas and also writing Ruby code is if I had spent the last 20 years writing C++ code (which... um... might have been true...)

You can likely get it to work for some definitions of "work" on a Linux box, but anything else would require serious effort ;-)

Link in case you don't see the other message: https://github.com/mikekchar/JLDrill

mtrimpe · on June 27, 2016

That sounds fascinating. Is there anywhere you could look at the result (or code ;) of this work?

mikekchar · on June 27, 2016

Umm... The code is horrific as I was experimenting with a few different ideas. It is written in Ruby for GTK+ and is not at all idiomatic Ruby. It's also somewhere between slightly and completely broken at the moment... But with all that in mind: https://github.com/mikekchar/JLDrill

Probably more interesting is simply my description of the scheduling algorithm: https://github.com/mikekchar/JLDrill/blob/master/web/src/Str...

There is also one detail missing which is forgetting. Because the items are sorted by the ratio of time waited to "ideal schedule", we can easily stick anything over a certain amount into a separate set (called the forgotten set). That way if you don't study for a long time you can "forget" those items and they are treated like a high priority "new" set until the set is empty. Probably that makes no sense, but if you read my strategy document, you will hopefully be able to understand.

hacker42 · on June 27, 2016

That basically seems to ignore the exponential Ebbinghausian forgetting curve. I doubt it results in better scheduling of the cards.

mikekchar · on June 27, 2016

It's not exponential. It has a gamma distribution. The nice thing about the gamma distribution is that within a short part of the tail it is nearly linear. So while it's hard to explain the math in an HN posting, below a certain probability your odds of getting a 90-95% confidence of a 90% recall rate (10% forgetting) rate is very, very low. So you will have to review all of those cards anyway. Once you get into the range where false positives are more likely (say above 80%) the curve starts to get more linear. This is especially true for cards with a very shallow curve (those that you have seen many times). So even when you get it wrong, the ones that are likely to be problematic are those that you will review again quickly anyway.

As for whether it results in a better scheduling of cards, it's hard to say. It has the advantage of being self adjusting and containing far fewer magic numbers than something like SM(insert-any-version-here).

kybernetikos · on June 27, 2016

Did you find out the base statistics for fatalities in skydiving? That that particular business had fatalities may just indicate that they do a lot more volume or more advanced types of skydiving than other companies (in which case they may actually be safer than the norm). For me, one of the big differences Bayes rule makes is spotting that information nearly always needs context to understand it correctly. You don't have the whole story unless you have a sensible prior.

amasad · on June 27, 2016

Yeah I thought about this and could find a news article that said they did around 5000 jumps a year since they started in 2001 (and they're located in Las Vegas so they I assume most of their customers are first timers). So that is 75000 jumps. In the news I could find two fatalities, and two serious injuries (but let's ignore those for now). Apparently, statisticians use the term "micromorts" which means a death per million activities when talking about risks from things like this. So for this business the track record is 26.7 micromorts. Compared to the base statistics in skydiving--which is 9--this means that this business has 3 times the likely risk. Am I getting this right?

function_seven · on June 27, 2016

Sounds like it. But like you said they're almost all probably first-timers, given the location. It's hard to tell if that completely explains the micromort difference or not.

Also, it is possible that due to those fatalities, the business implemented better safety protocols as a response and may actually be safer now than it was then.

It's my Jack in the Box theory. Back in the 90s there were a couple high profile food poisoning cases (salmonella I think) where customers had died after eating at Jack in the Box. It nearly ruined the company. But my feeling is that the safest time to eat there is a month or so after those events. When all the restaurants would be kicking their food-handling practices into high gear.

So now I eat at Chipotle more often than I did before. Shorter lines and probably the cleanest food. I hope. My "theory" may be bullshit.

erroneousfunk · on June 28, 2016

You aren't applying Bayes correctly. As a general guide, you need to add the word "given" to your problem statement, assign two "situation A" and "situation B" variables, and then work the math.

For example, assigning variables:

A = You died B = You're skydiving at that particular dropzone

Then: "What is the risk of mortality GIVEN that I am skydiving at this drop zone?" (P(A|B))

"What's the chance that I'm skydiving at this dropzone, given the fact that I died?" (P(B|A))

"What's the risk of my mortality while skydiving?" (P(A))

"What is my probability of skydiving at this dropzone?" (P(B))

P(A | B) = (P(B | A) * P(A)) / P(B)

So you calculated P(A | B) in a non-Bayesian way, without finding out the other information to calculate it using Bayes, and then stopped there, like most people do. This is why Bayes is often difficult for people to understand and apply correctly, and, honestly, it's probably not the equation you want for the situation you're looking at.

Another approach -- and this seems to be the one you want -- is to calculate a 95% confidence interval using a binomial distribution, to find out if their statistics are really anomalous. Death is a relatively rare event, and, even if they're distributed perfectly randomly, you'll find odd-looking clusters here and there.

To figure out if it's anomalous, many people would use the normal approximation to the binomial confidence interval, which would be wrong -- the probabilities of death are so relatively tiny, that they can't be approximated normally (rule of thumb is P(A) x P(not A)x sample size > 5 to use the normal distribution, which this fails), so we need to do an exact calculation. I've done this by hand before, but it's a pain in the butt. That's why we have calculators!

http://epitools.ausvet.com.au/content.php?page=CIProportion (If you don't trust it, you can use another one)

You enter your numbers:

sample size = 75000

number of deaths = 2

This gives the exact binomial confidence interval as: [3.23e-06, 9.633e-05]

This means that their actual death rate could be anything from .00000323 (that's 3 deaths per million) to .000096 (that's nearly 100 deaths per million)

Clearly, you cannot say with any meaningful level of confidence whether or not this drop zone is safer, or less safe, than average. Sorry!

Edit: In my Bayes example earlier, I realized that you could actually use it in an interesting-ish way (I guess?) to find out an unknown: "What are the odds that I was skydiving at this particular dropzone, given that I died skydiving?"

So: A = You're skydiving at that particular dropzone

B = You died :(

P(A | B) = (P(B | A) * P(A)) / P(B)

We know that: P(B | A) = 2/75000 = .0000267

P(A) = their skydives / all skydives = (75000 / 3,300,000 x 15) = 0.0015

Note: I found out that there were 3.3 million skydives in the US in 2012, so, let's extrapolate that out 15 years as a rough approximation, and limit us to just the US.

P(B) = .000009

Then:

P(A | B) = (.0000267 * .0015) / .000009 = 0.00445

So, if you died, then the probability that you were skydiving at that dropzone is 0.4%!

Note though, that there's some uncertainty in the calculation of P(B | A) (the probability of dying while skydiving at that dropzone) which you need to use confidence intervals, above, to actually figure out. Anyway, I certainly clarified some of my thoughts while writing this, and I hope that it helps you too, figuring out when to use confidence intervals, and when to use Bayes, and why it matters!

amasad · on June 29, 2016

This is great. Thank you

Houshalter · on June 27, 2016

I think your anecdote is good. Knowing the formula for Bayes theorem isn't necessary, and I've never needed to try to do the math in my head or anything. But understanding probability theory, and having an intuitive feel for what probability is, is really important.

tingol · on June 27, 2016

I think your friends were right. Peoples lives are not the same as a coin flip. You could almost surely say that whatever caused the accidents in the past has been a point of focus for the agency specifically so it never happens again. You can't really apply your statistical thinking in that scenario.

amasad · on June 27, 2016

Maybe they were but they actually statistically came to that conclusion. In the same sense that people think that given that it has been all heads for the past 10 flips then the chance it is tails is higher in the next (it's not)

>You can't really apply your statistical thinking in that scenario.

The main idea behind Bayes is that you can apply probabilistic reasoning to anything where you don't have complete knowledge (which means, short of 1+1, is everything)

Jommi · on June 27, 2016

Weren't your friends right? Since the incident happened, it's quite likely they fired an incompetnet instructor or tightened security protocols to such an effect that future dives would be safer.

amasad · on June 27, 2016

Maybe but 3 years after the fatalities they had 2 serious injuries which makes me give less credence to this. (Also see my reply to sibling comment, my friends reasoned about this statistically while falling for the gamblers fallacy, sort of)

krastanov · on June 27, 2016

To your last paragraph: most of my fellow physicists do not care much about the distinction between bayesian and frequentist approaches, they just use a prepackaged tool to process their data. But of the few that learn those tools intimately, cosmologists are always bayesians (because we have only one Universe), and experimental particle physicists are always frequentist (because generating a ton of repeated test events is "cheap").

jessriedel · on June 27, 2016

The particle physics do sometimes make Bayesian plots and end up arguing over priors. See this, for example:

https://cds.cern.ch/record/1375842/files/ATL-PHYS-PUB-2011-0...

And I've heard more than one confused argument between particle physicists about whether Bayesianism or frequentism is "right"...

asher_ · on June 27, 2016

Given the high level of statistical certainty that particle physicists strive for from experimental data, how much of a difference does the approach matter by the point where they are confident enough to endorse the theory?

For instance, if you are going for five sigma (like in the Higgs Boson discovery), how would the different techniques change when the data is considered sufficient, if at all?

jessriedel · on June 27, 2016

The five-sigma threshold is a limit for making big public announcements, or claiming definitive discrete discoveries (like new particles) that non-particle physicists can take for granted. But there's lots of intermediate work done at accelerators that can't reach that threshold but still informs further work, and even influences the construction and design of the next multi-billion dollar machines. The bi-annual Review of Particle Physics is as thick as a phone book and filled with (usually) 95%-confidence intervals for measured values. For instance, here is the 89-page summary of what is known about the tau particle:

http://pdg.lbl.gov/2015/listings/rpp2015-list-tau.pdf

j_s · on June 27, 2016

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy

https://amzn.com/dp/B0050QB3EQ

hacker42 · on June 27, 2016

As a Bayesian you can maybe answer my question here: https://news.ycombinator.com/item?id=11985312

I would appreciate it!

hacker42 · on June 27, 2016

I could never quite understand the divide between Bayesian statistics and frequentist statistics. Both seem to be ultimately about counting the frequency by which something occurs and normalizing this frequency with respect to the number of all possible outcomes. Bayesian statistics essentially is concerned with the application of the Bayesian updating technique by which one can iteratively improve a distribution over the values a parameter toward the true distribution using Bayes' rule. One can prove that given sufficiently many updates, the initial distribution does not matter as long it is non-zero for all possible values.

What I am not quite seeing is how this has philosophical implications which divide the field into Frequentists and Bayesians. It rather seems that Bayesian statistics is just a frequentist method that is helpful for dealing with noisy measurements as one can always recover from a bad distributions using more updates.

kgwgk · on June 27, 2016

Bayesian probability cannot always be interpreted as a frequency. For example, one could assign a bayesian probability to the extra-terrestrial origin of life. It wouldn't make much sense to think of it as a frequentist probability: one can easily imagine playing the future several times, but it's not so easy when dealing with the past.

And statistics is not just probability. Frequentist inference is based on procedures that "behave well" in the long term, but may or may not make sense for the particular outcome at hand.

For example, a 95% confidence interval calculated using a procedure that guarantees that the interval contains the true value 95% of the time may yield an interval that cannot contain the true value (for example the interval covers only negative values and the true value is known to be positive). See http://learnbayes.org/papers/confidenceIntervalsFallacy/ for a discussion of confidence intervals.

Another issue is related to how the "possible outcomes" are defined. For example, a frequentist analysis of the fairness of a coin after getting four heads and then a tail will be different depending on whether we decided to throw the coin until getting a tail or we had fixed beforehand the number of trials. Look for "stopping rules" or "optional stopping."

hacker42 · on June 27, 2016

What would you respond to my other comment downthread?

https://news.ycombinator.com/item?id=11985863

kgwgk · on June 27, 2016

If you are able to think of the different realities that would lead to us having this discussion today, some of them with life being originated on planet Earth and some of them with life coming from elsewhere, and you're able to reason about the relative frequency of these two kinds of realities, you definitely have more imagination than me.

And I don't know what do you gain with that. Does your frequentist interpretation have a physical meaning? I could assign a Bayesian probability to panspermia, and you could assign a different probability. Is there, according to your equivalent definition in terms of frequencies, a "correct" probability?

hacker42 · on June 28, 2016

It is already acknowledged that the initial guess can be wrong. We don't need to consider all possibilities to come up with a prior distribution because that is basically the 'trick' of the Bayesian updating scheme: Just start somewhere and we'll get closer to the true distribution by collecting more data and by using it the most logical way, namely by solving for the posterior. No imagination needed. It is more efficient to distribute the initial probability mass according to our best guess using a lot of imagination, but that itself is basically Bayesian updating because human reasoning is approximately Bayesian (or Bayesian with a noisy prior/bias).

A physical interpretation might be that all the other realities are realized in terms of the Many-worlds interpretation or in terms of Tegmark's level 4 multiverse. Without much information we cannot really nail down which reality we find ourselves in (cf. the sleeping beauty problem and Boltzmann brains). But we can use Bayesian updating to become more certain about what reality is about.

What do I gain from that? I am not entirely sure, but I find a probability is just better interpretable as frequency or fraction compared to a subjective quantity.

twanvl · on June 27, 2016

Bayesian and frequentist approaches ultimately have a different notion of probability.

In the frequentist approach, a probability of 10% means that if you repeat an experiment many times, roughly 1 out of 10 times you will observe an event.

In Baysian statistics, a probability of 10% means that you are that certain about the event happening. So you would be willing to bet at 10 to 1 odds on the event happening. There doesn't have to be any repetition of the experiment for the probability to make sense. And as you can hopefully see, there is always a prior, that is, your belief about the event before doing any experiments.

eximius · on June 27, 2016

Another way to look at it is this way:

P(H|D) = P(D|H) P(H) / P(D)

Bayesians are interested in the probablity of various hypotheses h in H given some data D.

Frequentists calculate the probability of some data given a hypothesis (p-value is not strictly a probability but it can be one - it is ALWAYS a measure of extremity of data coming from the assumed hypothesis, which can be considered a relative probability).

Most interesting to me is that the Bayesian formula includes P(D|H) which is basically what the frequentists are calculating. In this sense, the question Bayesians answer is far closer to what we want to ask and far more powerful. In practice, the frequentist approach is often more than enough, though. The tradeoff is computability and simplicity.

hacker42 · on June 28, 2016

Interesting, I've never thought about the likelihood as a confidence, but it makes sense. But sometimes the confidence also seems to reflect the opposite of extremity (e.g. for the null hypothesis).

hacker42 · on June 27, 2016

But it seems to me one can still define this in terms of frequencies using a more general definition of what is meant by repeating an experiment. For example, when you use a probability P as a degree of belief or certainty about whether a patient X with symptoms W has a particular disease Y, one would define the universe as containing all possible realities in which X has the same symptoms W but with different underlying causal factors that lead to the same symptoms. An observation is a uniform sample from this universe and the belief is that a fraction of P of these realities has the cause Y.

Eliezer · on June 30, 2016

Sorry, it took a while, but I wrote two Arbital nodes for you:

- https://arbital.com/p/subjective_probability/

- https://arbital.com/p/likelihood_vs_pvalue/

hacker42 · on July 2, 2016

This is great! Thanks.

GregBuchholz · on June 27, 2016

Have you ever taken a look at: "Probability Theory: The Logic of Science"

http://www.med.mcgill.ca/epidemiology/hanley/bios601/Gaussia...

jordigh · on June 27, 2016

The biggest historical obstacle to Bayesian stats was low amounts of data available and difficulty of computation. Frequentist stats is optimised around these. With 30 samples and very easy computations, you are able to produce a reasonable frequentist confidence interval, sometimes even with less data. On the other hand, even the simplest Bayesian analysis of determining the probability of heads in a coin flip requires some interesting integration and the somewhat obscure Beta distribution.

Now that we have lots of data and lots of computing power, Bayesian stats can show its results, after getting rebranded as "machine learning".

tgb · on June 27, 2016

> On the other hand, even the simplest Bayesian analysis of determining the probability of heads in a coin flip requires some interesting integration and the somewhat obscure Beta distribution.

Isn't this kind of misleading? The end result of the Beta distribution, etc. is just the extremely simple-to-compute Rule of Succession [1].

[1] https://en.wikipedia.org/wiki/Rule_of_succession

jordigh · on June 27, 2016

A bit, but in general, computing posterior distributions tends to very quickly lead to more complicated integrals. It just so happens that the Beta distribution is somewhat nice and symmetrical, if a bit obscure.

XFrequentist · on June 27, 2016

More generally, this is why there's (still) a mild obsession with "conjugate" prior/likelihood distribution pairs - i.e. combinations of prior and likelihood that give analytically tractable posteriors[1] - despite the ability to get easily get results with MCMC.

[1] Yes, everyone wants a nice posterior. You're very clever.

eli_gottlieb · on June 27, 2016

>Now that we have lots of data and lots of computing power, Bayesian stats can show its results, after getting rebranded as "machine learning".

Partly lots of computing power, and partly the Monte Carlo revolution that enabled us to replace intractable integrals with lots of cheap CPU time.

dkbrk · on June 27, 2016

I found this talk extremely interesting despite being already somwhat familiar with the history of Bayesian methods; I heartily recommend watching it in its entirety regardless of prior familiarity with the subject.

The talk was also presented very well. It's almost novel nowadays for something like this to not be some sort of powerpoint presentation, and I can't say it suffered for it.

danbruc · on June 27, 2016

Is there any (uncontroversial) theory that rigorously defines what a 50 % probability for heads and tails means? It certainly doesn't mean that in the long run you will obtain the same number of heads and tails because there is a (vanishing) chance that you will always get heads even though the coin is actually fair. And just saying that you will obtain the same or at least similar number of heads and tails with high probability is a circular definition at best.

Does Bayesian thinking just deny the existence of intrinsic probabilities or considers them out of scope? Assuming there is such a thing as true randomness, for example in quantum physics (and it is not due to our ignorance of hidden variables like in Bohemian mechanics), could a Bayesian assign probabilities to the outcomes? Would a Bayesian be mislead into assigning something other than 50/50 for spin up and spin down if he, by chance, only observes spin up although he is really confident that this is just a statistical fluke? If he is mislead, does that mean that uncertainty about the truth of his assumptions creeps in in case the system momentarily deviates from the expected behavior?

jordigh · on June 27, 2016

> Is there any (uncontroversial) theory that rigorously defines what a 50 % probability for heads and tails means?

Yeah, Kolmogorov's axioms:

https://en.wikipedia.org/wiki/Probability_axioms

To interpret these axioms for 50% probability means that the measure underneath the density function corresponding to the event "heads" is one-half.

But "rigourous" doesn't have anything to do with the natural world. You can't make physics "rigourous", for example, but you can make the mathematics inspired by physics rigourous. Kolmogorov's axioms just give a mathematical description of probability in a formal sense i.e. only discussing its form, not its meaning. Formalism is about saying, "whatever this means, this is how it should behave". It's a very 20th century notion of mathematics.

danbruc · on June 27, 2016

I know the Kolmogorov's axioms but I am really more interested in that part they avoid - what is the meaning of a probability of 0.5? It is surly nice that we can operate with probabilities in a (hopefully) self-consistent way, but it bugs me quite a bit that I don't really precisely unterstand what the result of a calculation implies for the real world.

jordigh · on June 27, 2016

The modern approach to mathematics is that there is no "meaning", just like "2" or "derivative" has no meaning. We just say how it behaves, or define it in terms of other things, which eventually bottoms out with undefined terms, such as sets and set membership. This is formalism.

How you apply mathematics to the world is not the business of formal mathematics. Whatever you want to do with it is "mere" philosophy. ;-)

danbruc · on June 27, 2016

That is what I meant - I am more asking for a solid philosophical interpretation than a mathematical theory, after all frequentism, Bayesianism and all the other interpretations seem to carry quite a bit of philosophy.

On the other hand in another comment it just boiled down to the question whether there is a measure that gives 1 for the set of all infinite binary sequences with 50/50 zeros and ones and 0 for the set of all the other sequences. So it is not pure philosophy what I am interested in.

rer0tsaz · on June 27, 2016

No, but let me quote Jaynes (Probability Theory: the Logic of Science, 10.10):

> ‘When I toss a coin, the probability for heads is one-half.’ [...] the issue is between the following two interpretations:

> (A) ‘The available information gives me no reason to expect heads rather than tails, or vice versa – I am completely unable to predict which it will be.’

> (B) ‘If I toss the coin a very large number of times, in the long run heads will occur about half the time – in other words, the frequency of heads will approach 1/2.’

These are not the same except for special circumstances like controlled experiments. Frequentism usually assumes or restricts itself to those special circumstances. The long run here means `for any ε > 0, the probability that the observed frequency n/N lies in the interval (1/2±ε) goes to 1 as N goes to infinity'.

There is no such thing as an intrinsic probability, a coin only has a chance of landing heads when tossed. If we know everything about the coin and how it is tossed we could calculate the result. Some ways of tossing a fair coin are biased. Some coins are biased when tossed in fair ways. A fair coin toss exaggerates factors that are difficult to know and control exactly, like the force that we apply on the coin with our finger.

Jaynes thinks that `true randomness' such as is postulated by conventional quantum physics is unscientific (10.7). In any case it doesn't matter for calculating, and I don't think many Bayesians lose sleep over whether it's unknown, unknowable or `true randomness'.

If the Bayesian has an informative prior, they won't be mislead too much. If a Bayesian is 100% sure beforehand with a δ(x - 0.5) prior they won't be mislead at all, but of course no one is ever 100% sure (Cromwell's rule). On the other hand a frequentist might say p<0.05 and be mislead.

kareemsabri · on June 27, 2016

> Is there any (uncontroversial) theory that rigorously defines what a 50 % probability for heads and tails means? It certainly doesn't mean that in the long run you will obtain the same number of heads and tails because there is a (vanishing) chance that you will always get heads even though the coin is actually fair.

It doesn't really require a theory, outside of probability theory itself.

According to the law of large numbers, P(x) = 0.5 does imply that, for N trials, the probability of N/2 outcomes being x (heads) approaches 1 as N approaches infinity. And as you pointed out, the probability of all heads is vanishing, as N approaches infinity. P(all heads) = 1 - P(not all heads), and P(all heads) approaches 0 as N approaches infinity, so P(not all heads) must approach 1. The same exercise could be done with the relative distributions of heads and tails, in a large number of trials.

Each coin flip is random, but the results will converge to the mean. You will obtain the same number of heads and tails with certainty, not high probability, if you can make N big enough.

danbruc · on June 27, 2016

That means it is totally impossible to get H H H H... or H T T H T T... ad infinitum? I really have no well-founded opinion on that, it just seems very counterintuitive that the coin is not allowed to yield any such sequence, they same as good as any other sequence to me.

kareemsabri · on June 27, 2016

Not at all! It yields that sequence all the time, in small numbers. But, it is impossible to get only heads, if you could truly flip a coin infinitely many times, which of course you can't. The larger you make the sequence, the smaller you make the probability of all heads. In your example:

P(H,H,H,H) = 0.5 ^ 4 = 0.0625 P(H,H,H,H,H,H,H,H) = 0.5 ^ 8 = 0.00390625

As you can see it's getting pretty small already. However, the probably will never actually reach 0. It's the half step concept, if you keep halving the distance between yourself and an object, you will never actually reach it. But the distance between yourself and the object will approach 0 as the number of steps approaches infinity. 0.5 ^ infinity = 0

Note that P(H|H,H,H) = probability of heads, given 3 previous heads, is still just 0.5. Any independent coin flip is 0.5 chance of being heads. But, when you've done 0 coin flips, and you ask, what are the odds of all heads for the next, say, million flips, they are nearly 0.

danbruc · on June 27, 2016

But now you are contradicting yourself, aren't you? We agree that for any finite sequence of tosses it is unlikely but possible to get all heads and therefore it is possible to not converge to 0.5. The question is what difference it makes to go from a large but finite to an infinite number of tosses. Either it is impossible to get only heads an infinite number of times, then I have a problem understanding why that is, or all heads is still a possible outcome even in the infinite case, then the process does not necessarily yield 0.5 even in the limit of an infinite number of tosses.

kareemsabri · on June 27, 2016

> The question is what difference it makes going from a large but finite to an infinite number of tosses.

Convergence is only guaranteed as N -> ∞. The difference between large but finite and infinite is.. well, infinite :) So that's a pretty significant difference.

> Either is impossible to get only heads an infinite number of times, then I have a problem understanding why that is, or all heads is still a possible outcome even in the infinite case, then the process does not necessarily yield 0.5 even in the limit of an infinite number of tosses.

It's the former. It is impossible to get only heads an infinite number of times. It is possible, but increasingly unlikely, to get only heads a REALLY LARGE, but finite, number of times.

danbruc · on June 27, 2016

Okay, assuming that is true, is there an intuitive way to understand that? And the fact that lim[n-> ∞] 0.5^n = 0 unfortunately won't do for me, that is true for any specific infinite sequence, even those containing 50/50 heads and tails. I think most of the sequences - hand-waving, most of an infinite set - are 50/50 heads and tails just because there are more possibilities - hand-waving again - to arrange 50/50 heads and tails versus 100/0 or 40/60 heads and tails. So can I sum over all the 50/50 sequences and get 1 and sum over the rest and get 0? I still would not really understand what forces my coin to show tails eventually, but if there were a measure showing that those two sets have measure 1 respectively 0 it would already be easier to swallow.

jerf · on June 27, 2016

"Okay, assuming that is true, is there an intuitive way to understand that?"

I mean this quite seriously and am not being sarcastic or dismissive: Probably not. I think the general concept of "intuitive" is that there is some experiential analog to the concept in the real world, with which we've had a lot of experience and can thus "intuit" the behavior. The real world does not include infinity.

You can develop mathematical intuitions, but I don't think that's what you were saying.

In the mathematical intuition sense, it's worth pointing out the probability of any given infinite series of coin flips is zero. The all-heads or all-tails series are not special that way. While this may not be the best way to intuit it, if a metaphysical you who would live forever sat down and started flipping coins, you will never at any point be done with flipping an infinite series of coins. You could sit there until you flip any finite series of coins, but you will never flip an infinite one, even with our unrealistic stipulations of life span. From that perspective, a probability of zero of flipping an infinite number of heads should seem reasonable; there is probability zero that you will ever have flipped an infinite number of coins.

kareemsabri · on June 27, 2016

> Okay, assuming that is true, is there an intuitive way to understand that? And the fact that lim[n-> ∞] 0.5^n = 0 unfortunately won't do for me, that is true for any specific infinite sequence, even those containing 50/50 heads and tails.

That's true, but only if you pick ONE specific sequence. The likelihood of any precise sequence of heads/tails is equally unlikely as all heads. However, we are not fixing the outcomes ordering, but simply saying taken to infinity the distribution of outcomes will tend to 50/50.

> I still would not really understand what forces my coin to show tails eventually

Nothing forces it to show heads either.

> but if there were a measure showing that those two sets have measure 1 respectively 0 it would already be easier to swallow

Well, we've already proven it's impossible to have all heads infinitely. The same would be true of any non-uniform sequence. Any sequence that favours heads by any margin at all, taken to infinity, would necessarily contain an infinite sub-sequence that contains all heads, which we've proven is impossible by the limit to infinity.

danbruc · on June 27, 2016

Well, we've already proven it's impossible to have all heads infinitely. The same would be true of any non-uniform sequence. Any sequence that favours heads by any margin at all, taken to infinity, would necessarily contain an infinite sub-sequence that contains all heads, which we've proven is impossible by the limit to infinity.

Have we? We only said that the probability of having an infinite sequence of all heads is zero but not that it is impossible to obtain that result. After all the probability of any infinite sequence is zero and therefore if we would equate probability zero with impossible to obtain, then we could obtain no infinite sequence at all. The second part is obviously wrong, H H T H H T... ad infinitum obviously favors heads over tails but does not contain an infinite subsequence of all heads.

conjectures · on June 27, 2016

When heads has probability p the probability for any n-length sequence of all heads is p^n. For p between 0 and 1 this vanishes as n goes to infinity.

The reason is that the conjunction of some events is always less (than or equal to) probable than the events themselves. This should make intuitive sense as the middle bit of a Venn diagram. Even more so for a conjunction of events with a conjunction of events. When taken to the infinite limit you tend to end up with the conjunction having probability 0 or 1.

danbruc · on June 27, 2016

I have no problem with that, all infinite sequences of coin tosses have probability zero. The question is whether a infinite sequence of all heads is an obtainable outcome which would be an counterexample to the statement that the relative frequency converges to 0.5 in the limit.

paol · on June 27, 2016

> doesn't mean that in the long run you will obtain the same number of heads and tails

IANAM, but in my understanding that's exactly what it means, if you express "in the long run" as "the ratio tends to 0.5 as the number of tosses tends toward ∞"

danbruc · on June 27, 2016

IANAM either, but that seems to be wrong to me. For any finite number of tosses you may certainly not converge to 0.5 even if it becomes very unlikely pretty quickly not to do so. In order for this to be true, it would have to be literally impossible to always get heads on every toss no matter how often you try. But it seems counter-intuitive at the very least that it is inevitable to get tails eventually. What rules out the possibility to get heads over and over again, an infinite number of times? [1] But if it is possible to obtain an infinite number of heads and not a single tails, then it is not true that ratio always converges towards 0.5. In consequence one would have to modify the definition and say that it is only very likely to converge towards 0.5 but that obviously raises the problem that »very likely« sounds a lot like »with high probability« which is the thing we are trying to define, hence this would become a somewhat circular definition.

[1] Actually it must not only be impossible to only ever get heads or tails, it must be impossible to obtain any sequence of outcomes that has not exactly the same number of heads and tails. There is only one sequence of outcomes with only heads and only one with only tails but there is already an infinite number of sequences with all heads but one tails or vice versa. And there is an enormous number of possible sequences with 40 % heads and 60 % tails or any other ratio other than 50/50.

daemonomania · on June 27, 2016

If a coin were flipped arbitrarily many times, and it landed on heads each time, and if we had no other information about the coin, a frequentist would say that the probability the coin lands on heads is one.

Now, you might protest and say, "But this is a fair coin that just happened to land on heads arbitrarily many times". But then you are relying on a prior definition of probability in order to justify your objection to the frequentist's claim, since presumably what you mean by "fair" is that the probability the coin lands on heads is 1/2.

danbruc · on June 27, 2016

I am not sure if that really addresses my issue. My problem is that I don't see what eventually forces convergences to 50/50. We can use two or even better many coins or, at least superficially equivalent, one coin and let several experimenters take turns. The results will converge towards 50/50 but only with high probability, or at least I don't see why they alway would, i.e. why not at least one sequence could not converge indefinitely. And that seems to lead to an infinite tower of experiments.

When tossing a coin infinitely often the ratio of heads and tails will converge to 50/50 but only in that sense that you have to toss the coin an infinite number of times an infinite number of times. And you can't really stop here because you may still deviate arbitrarily far from the expected outcome. So you toss the coin an infinite number of times an infinite number of times an infinite number of times. And you have to keep nesting your experiments until you made an infinite tower of experiments.

And I have no idea if that would be sufficient to ensure that you can no longer deviate from the expected outcome, my intuition tends towards no, but then again this is certainly far outside of the realm of things for which I would intuition expect to work. Likely I am really just confused and missing something obvious.

mvf0 · on June 27, 2016

Your problem is that you don't seem to have sufficient understanding of the concept of limits.

paol · on June 27, 2016

> For any finite number of tosses you can certainly not converge to 0.5

I think you mean "may not" rather than "can not". But yes, that's why you need to resort to limits[0] to understand the problem.

[0] https://en.wikipedia.org/wiki/Limit_of_a_sequence

danbruc · on June 27, 2016

Thanks, fixed. It understand that it is the limit, but the question I am asking is whether it is possible to always get heads, an infinite number of times. If this is possible, then it is not true that the relative frequency always approaches 0.5 as the number of repetitions goes to infinity. I understand that the probability of obtaining heads an infinite number of times is zero, but does that mean it is impossible? After all any specific infinite sequence of outcomes has probability zero.

paol · on June 27, 2016

> the question I am asking is whether it is possible to always get heads, an infinite number of times

No. By my definition (lay, may be wrong), the probability is the ratio of heads to tails to which we converge at the limit (infinity). The only way for it to converge to 0 is if the probability of heads was 0 to begin with, which would contradict the initial assertion that it was a fair coin.

By the way, from your previous comment it seems to me that you're struggling with getting an intuitive grasp on arguments that rely on infinity. To most people (me included for sure) it's initially difficult to accept that infinity is qualitatively different than a finite amount. Our brain seems to superficially accept the concept, but really it keeps trying to reason about it as a really large (but still finite) quantity.

Are you familiar with the argument of whether 0.999... equals 1? (Spoiler: it does) Some people have a really hard time coming to terms with it, and the fundamental problem for many comes down to the same difficulty of reasoning about infinity. Also, it's a fun way to troll people :)

danbruc · on June 27, 2016

No. By my definition (lay, may be wrong), the probability is the ratio of heads to tails to which we converge at the limit (infinity). The only way for it to converge to 0 is if the probability of heads was 0 to begin with, which would contradict the initial assertion that it was a fair coin.

This seems problematic to me. You can not perform an infinite number of coin tosses and know to what the relative frequency converges. So how do you conclude that a fair coin has probability of 0.5 for heads then? I can't really put my finger on it, but that argument is somehow circular. The probability is what the relative frequency converges to and I can not get anything other than 0.5 because that means to probability would have to have been not 0.5 to begin with. I can really only say that I disagree, I just can't pin it down exactly.

Are you familiar with the argument of whether 0.999... equals 1? (Spoiler: it does) Some people have a really hard time coming to terms with it, and the fundamental problem for many comes down to the same difficulty of reasoning about infinity.

I am aware of that and it seems totally obvious to me. And I also certainly know that infinity is not just a really huge number, but I am not sure I internalized that well enough to not make any mistakes because of the difference. Actually I am pretty sure I make mistakes because of that.

kgwgk · on June 27, 2016

There are different ways to "approach 0.5 as the number of repetitions goes to infinity": https://en.wikipedia.org/wiki/Convergence_of_random_variable...

I think in this case you have "convergence in probability" (but I've not read carefully the discussion). There is a stronger form, "almost sure convergence", where there is exact convergence with probability one (i.e. in some cases there is no convergence, but those happen with probability zero).

danbruc · on June 27, 2016

That is exactly what I mean. We try to establish what a probability of 0.5 means, i.e. that the relative frequency converges to 0.5 if the number of trails goes to infinity, but that is not exactly true because there is a (possibly vanishingly small) set of outcomes where the relative frequency does not converge to 0.5. This in turn forces us to state that the relative frequency only converges with high probability, but now we have used a probability in our definition of probability, i.e. we made some kind of circular argument. I just skimmed the article you linked and I am not sure if I read it before so I will read it again, but from a first glance it does not look like any definition in there is suitable to avoid the problem.

kgwgk · on June 28, 2016

There is no circularity. You're discussing the real-world interpretation of the probability of an event in terms of long-term frequencies. The probability in the definition of convergence is a well-defined mathematical concept that has nothing to do with frequencies.

Edit: thinking more about it, I agree that the passage from mathematical probability to real-world probability is not very satisfactory. But it's not really surprising, the very notion of "real-world" is troublesome.

effie · on June 27, 2016

"whether it is possible to always get heads, an infinite number of times" No if we actually make the experiment, for we cannot toss coin infinitely many times. Yes, if we are just considering and measuring all possible infinite sequences. Measure of such mathematically possible sequence is zero, like measure of a point somewhere in a circle is zero. The point exists and is a possible result of choosing a point, but its probability is zero. Thus zero probability does not mean thing is impossible; it is only very unlikely.

danbruc · on June 27, 2016

If an infinite sequence of all heads is a possible outcome, then it is not true that the relative frequency converges to 0.5 in the limit. It is obviously true for many infinite sequences but if all heads is a possibility then it is not true in general.

Just to clarify, I am talking about the frequentist's claim that in the long run the relative frequency becomes the probability, i.e. P(x) = lim[n -> ∞] nx / n where n is the number of trails and nx is the number of trails that yielded x. So again, if an infinite sequence of heads is a possible outcome, then that would mean P(heads) = 1 while the frequentist asserts that it necessarily must become 0.5.

effie · on June 29, 2016

Does the frequentist view really claim that infinite sequence nx/n always converges, for all possible runs? That is obviously false. The laws of large numbers only state that probability that the limit of nx/n is equal to 1/2 is 1. This is because all sequences are equally probable and there are overwhelmingly many more sequences that limit to 1/2 than sequences that do not. That is different from saying that the limit is always equal to 1/2, because there are these exceptional cases of low measure like HHHHH..., HTTHTTHTT...

moon_of_moon · on June 27, 2016

If you are in London you can visit the tombstone of the good Reverend Bayes at the Bunhill Fields burial grounds, a short walk from the Old Street tube station, and pay your respects. There is also a bench facing a spacious lawn that you can sit on and contemplate Bayes in silence. He probably sat there at some point, come to think of it.

dschiptsov · on June 27, 2016

Brexit is the best example so far.

That painful dissonance between so called reality and these probabilistic models.

dschiptsov · on June 27, 2016

Probabilities makes sense only with absolutely certain things like a fair coin or a dice.

In cases where there is no absolute certainty about how many sides or dimensions your "dice" has and that it is not biased and that there is no other forces or factors in play probability ceases to make sense.

Probability of A, given B becomes meaningless when either A or B aren't precisely defined (like in the case of a "fair coin") and so is the relationship between the two.

Application of the Bayesian rule to "estimated" probabilities is just wrong and unscientific (in the face of ambiguity avoid the temptation to guess). Multiplying and dividing nonsense by nonsense yields nonsense.

The global financial crisis and recent cock-sure consensus about outcome of the brexit referendum the day before voting are good evidences.

greenshackle · on June 27, 2016

I disagree. There's a large grey area between 'completely unknown' and 'scientific certainty'. I prefer guessing and doing computations with my guesses than throwing my hands up in the air and calling it unknowable.

When presented with new evidence it's better to write a number down for your degree in belief in X, ask how much you should change that belief based on the new data, and update your probability estimate, than to just go with your feelings. You're not doing actual bayesian computations, that's totally untractable for anything that's not a very well defined problem, but doing 'pseudo-bayesian' updating is better than not.

I think of it as the fermi approximation of probabilities. You won't get accurate numbers that way but you'll get better numbers than if you just invent the answer.

EDIT to add: most of the time you should then throw out the number. Just like a fermi estimate, you get a ballpark sense for the answer, not a precise answer.

In the superforecasting experiment by Tetlock the best forecasters did this. They were writing down probability estimates and methodically updating them based on new data (news articles, data, etc.). They were forecasting geopolitical events, not dice rolls, and it worked (better than the alternative, obviously no one can forecast geopolitical events with high certainty).

Houshalter · on June 29, 2016

That's wrong. With the Bayesian interpretation of probability, you can assign probability to any event. In fact all certainty about belief is just probability in disguise.

There were betting markets for the brexit vote. They assigned 25% probability to brexit. Of course markets aren't perfect. But anyone who really believes they know better should be able to get rich off them. And somehow that doesn't happen. So they are the best estimates of probability we have.