Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Central Limit Theorem Visualized with D3 (vctr.me)
179 points by vicapow on May 30, 2013 | hide | past | favorite | 72 comments


This isn't the central limit theorem. This is a binomial distribution. Nice animation though.


It does illustrate a special case of the CLT, in the sense that as the number of bins increases, the distribution (the mean of a bunch of +1/-1 Bernoulli random variables) converges to a normal distribution.


yes but the number of bins isn't increasing.

EDIT: nm, I saw where you can increase the bins.


Why isn't it the central limit theorem? It is taking random results and building a distribution which looks normal.


Imagine you have a population that is NOT normally distributed. Take a large sample of it, and calculate the mean of your sample. Then take another large sample of that population and calculate its mean. Keep doing this. Think of each sample mean as a data point that gives you an estimate of the mean of the whole population. The Central Limit Theorem says (approximately, and there are conditions) that those sample-mean data points of yours will be normally distributed, even though they were obtained by sampling a non-normal distribution. This also works when the population IS normally distributed, but that's not a surprise. The surprise is that the sample means are normally distributed even when the population isn't. (And again, there are some qualifications I'm not mentioning.)

So, sample means (estimates of the population mean) tend to be normally distributed even when the population they are sampling is not.

The D3 demo here doesn't show multiple sample means, it shows that as you take more and more items from a normal population, the distribution of your sample of items gradually comes to resemble the distribution of the population as a whole.


No, it does show multiple sample means. Each landing spot is the sum (scaled average) of the n - 1 bernoulli rvs, where n is the number of bins. The issue is that n is small.


You are correct in pointing out this error of the parent comment (by SiVal). Each ball is indeed the sum of "n-1" bernoulli RVs, and the CLT does apply to these sums.

As btilly points out elsewhere, to actually obtain the correct limit, you have to normalize the sum correctly. Because of the way the scaling is done in this graphic, as you increase the number of levels, it's in effect normalizing the sum by dividing by "n". To get the right limit, you need to divide by sqrt(n).

In this sense, the CLT is a high-resolution version of the SLLN ("law of averages"). If you normalize the sums by 1/n, the resulting average converges to a number, the mean.

But if instead you subtract this mean, and normalize by 1/sqrt(n) rather than 1/n, the result converges in distribution, and looks like a normal random variable.

By less aggressive normalization, you get information about the fluctuations rather than just pounding it down to a number, as the SLLN does.

I used to TA a probability class for undergrads, and I found the demo to be perfectly reasonable.


If we're going to pick nits, it would illustrate the "Weak" LLN, not the Strong LLN. The Strong LLN still holds here, but you'd need to plot the sample paths to illustrate it.


I was actually trying my best to not pick nits, because HN has somewhat too much of that.

The reason I posted is to defend the OP against some assertions that I felt were picking at details of what started out to be a simple and fun visualization. I wrote what I did about the LLN to supply some intuition to the person who created the visualization, in case they wanted to put these two results in perspective.


Thanks for this explanation. I understood most of it but could you explain why you should normalize using 1/sqrt(n) and why doing so makes the result converge in distribution?


For a sequence of independent random variables with the same variance, X_1, X_2,..., we have

  var( (1/sqrt(n)) * (X_1 + X_2 + X_3 + ... X_n) 
    = (1/n) * (var(X_1) + var(X_2) + ... var(X_n))
    = (1/n) * n * var(X_1)
    = var(X_1)
This holds for any n, which means that, if you normalize by 1/sqrt(n) instead of 1/n, the "randomness" never vanishes even when n gets infinitely large. If you normalize by something bigger than 1/sqrt(n) the variance blows up, and if you normalize by something less than 1/sqrt(n), the variance collapses to zero so you get something concentrated at a single point.

The CLT tells us more than that, it actually tells us how the randomness is distributed when n gets very large, which is pretty remarkable when you think about it. (and it holds under much weaker conditions than what I mentioned above, it's just that those assumptions are probably the easiest to understand).


Thanks a lot for the explanation.


It is the distribution of the average of n binomial distributions (taking on values x±1)†, which according to the CLT converges to a Gaussian as n→∞.

[†] x is where it is centered.


You only get the Gaussian if you take the limit correctly.

In the visualization as you increase the number of bins, the Gaussian approximation becomes more and more squeezed, and by the weak law of large numbers the limit is a Dirac delta.

In order to get the Gaussian you'd need to be looking at a window whose width is proportional to the square root of the number of bins.


n = 4.


Well, to be accurate, it's four binomial distributions, not just one. You need the sum of ~30 random variables before you're getting into the CLT, but it demonstrates the gist.


[deleted]


Well, not really. The fact that CLT applies to a wide variety of distributions is a much stronger, mysterious and astonishing fact than what is demonstrated here.

It would be like demonstrating that you've cracked RSA with p = 7, q = 11.


Agreed, learned more, was wrong, deleted comment. Curious though—doesn't the CLT apply to this distribution too, technically? Isn't the binomial distribution being sampled and converging on a normal distribution, as per the CLT? Maybe some more explanation would enlighten us.


Yes, this is definitely a special case of CLT.

For instance, see: https://en.wikipedia.org/wiki/De_Moivre%E2%80%93Laplace_theo...


Sure, but the CLT holds in very great generality -- to variables that are not independent, and to stochastic processes. Any demonstration is bound to be a special case.


A great example of convergence in the natural world I once saw was drops of water falling off a gutter from about 10 feet in the air. There was very little wind, and the drops fell on a line right under the gutter. It was immediately clear that the drops landed in a roughly normal distribution with their random fall through the air, and the pattern of wet pavement they produced was a perfect little compressed bell curve. It was quite beautiful.


Well, it wasn't perfectly normal. But it was close. The Central Limit Theorem is about closeness, not perfection.


Video of a binomial distribution [1] physically visualized using a Galton box [2] Pachinko [3]: http://youtu.be/AjI_LcQOOs4

[1]: http://en.wikipedia.org/wiki/Binomial_distribution

[2]: http://en.wikipedia.org/wiki/Bean_machine

[3]: http://en.wikipedia.org/wiki/Pachinko


Here's a galton box. (http://www.youtube.com/watch?v=AUSKTk9ENzg)

It's a meatspace demonstration of the CLT.


This was one of the most "a-ha moment" videos I ever watched back in the day. Anyone with even a functional knowledge of statistics should watch this one.


I liked this but as other commenters have pointed out it's purely binomial at this stage. The great thing about the central limit theorem is that it is more general than just the limiting Binomial case.

So, there's this thing called the cumulant-generating function. It's pretty much defined for any random variable X. If you want to get technical it is the logarithm of the Fourier transform of a probability density function f(x). You're on HN so you probably know the first two jargon words, "logarithm" and "Fourier transform". A "probability density" just means that f(x) dx is the probability for X to be in the interval (x, x + dx). The Fourier transform puts us into a "frequency space" indexed by some variable k, so we can write the CGF as some function c(k), or in other words:

    c(k) = ln[ ∫ dx f(x) exp(i k x) ]
So for the "step left/right" variable which is -1 with probability 1/2 and +1 with probability 1/2, c(k) = ln[cos(k)]. (It can get a little messy when you ask what happens when cos(k) crosses 0 etc, but this function is infinitely-often differentiable on a disc centered at 0 which is all that we need.)

It also turns out that since the sum of all the probability ∫ dx f(x) = 1, you can just evaluate this for k=0 as c(0) = ln[ 1 ] = 0. The cumulants are derivatives evaluated at k=0:

    c'(0) = i E[X] = i μ               (where μ is the "expectation")
    c''(0) = - (E[X²] - E[X]²) = -σ²   (where σ² is the "variance")
    c'''(0) = -i E[(x - μ)³] = -i σ³ γ (where γ is the "skewness")
It is not very hard to prove that if I have a sum of two random variables X + Y = Z, then their CGFs add like cz(k) = cx(k) + cy(k). This is the Fundamental Awesomeness of Cumulants: cumulants are pseudo-linear; they're linear in independent random variables.

It is also not hard to prove that if I have a scalar multiplication U = X / n, then cu(k) = cx(k / n). Combining these together, the mean M = (X1 + X2 + ... + Xn) / n of n identical and independent random variables looks like:

    cm(k) = n c(k / n)
Now if you know calculus, you know the rest. Taylor-expand around k = 0 to find:

    cm(k) = i μ k − k²/2 * (σ²/n) − i k³/6 (σ³ γ / n²) + ...
We see a geometric attenuation in this series, we keep dividing new terms by n. If we drop terms from c'' on we get a constant, M = μ. That's boring, so we keep one more term, to find:

   cm(k) ~= i E[X] − (Var[X] / n) k² / 2
This is extra-good if f(x) is symmetric about the mean (so that the skewness vanishes), but it is also pretty good even if the distribution is skewed.

We can now invert all of the steps to return back to a probability density: you exponentiate, so you get exp(i μ k) exp(- k² σ²/ 2 n), then you inverse-Fourier-transform, which transforms the exp(i μ x) into a frequency offset x → x − μ and which transforms the Gaussian into another Gaussian, so you get a Gaussian centered at x = μ.

In other words you get approximately a normal distribution with mean E[X] and variance Var[X] / n, with the error given by convolutions of higher-order terms, and the first correction disappearing when the distribution is symmetric about μ (or otherwise non-skew).


Some of the explanatory power of the graphical demo seems to have been lost.


Maybe. But the proof also gives a lot of cute ideas for how to go further with this sort of demo. Since adding random variables amounts to doing convolutions with their distribution functions, if you had a Fourier transform package in JS you could probably show people some nice visualizations of adding together various random variables, and what it does to their probability distributions. This might give more insight into the generality of the CLT: you take a strange curve, convolve it with itself a bunch of times, and in the convolutions and rescaling it becomes first a skewed distribution, then a normal one.


Wikipedia and MathWorld[1] define the cumulant-generating function to be the log of the moment generating function, i.e. a Laplace transform, rather than a Fourier transform. This is (essentially) just notational since the proof still holds, but it does mean that the i's and -'s disappear, e.g. with the Laplace-CGF, c'(0) = μ, c''(0) = σ², c'''(0) = σ³ γ.

This proof also works with directly with moment generating functions and characteristic functions (which are the same as the two forms of CGF's without the logarithm), which avoids the slight issue of ln(cos(k)) when cos(k) = 0.

[1]: http://mathworld.wolfram.com/Moment-GeneratingFunction.html


That was in some sense intentional -- I prefer exp(i μ k) exp(-k² σ² / (2 n)) for the clarity of the expression in Fourier space.

That issue is not avoided -- it becomes ln(cosh(k)), which still only has a finite radius of convergence due to branch points at ± i π/2, and a complicated branch-sheet structure. It is not dissimilar from 1/(1 + x^2) -- a perfectly smooth ordinary real function on the real line, very pleasant to work with, but the (real!) Taylor series at 0, 1 - x^2 + x^4 - x^6 + ... does not converge outside of the interval (-1, 1) because it must go to infinity at ± i on the complex plane.


> In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed (i.i.d.).

No, that's just wrong. The context is some positive integer n and values of some n real valued random variables independent and identically distributed (i.i.d.).

If for some n do what the quote says, then are not guaranteed to get a Gaussian.

Moreover, if do what the quote says and take the limit as n goes to positive infinity, then, say, in the case the random variables have expectation that is finite and finite variance, will get just the common expectation of those random variables -- this is the law of large numbers, not the central limit theorem. That the distribution of the mean converges to the point with the expectation is the 'weak' law of large numbers. That the mean converges to the expectation with probability 1 is the strong law of large numbers.

For the central limit theorem, add up the n i.i.d. values and divide by the square root of n. Then under meager assumptions (e.g., i.i.d. and, for likely the strongest conditions known, the Lindeberg-Feller conditions), as n goes to infinity, the distribution of the result will converge to a Gaussian ('normal') distribution. Authors with the details include J. Neveu, L. Breiman, M. Loeve, and K. Chung.

It's essential to divide by the square root of n, not just n.


Kind of a misleading animation, since n is the number of switches (4) and not the number of balls (which diverges to infinity).

Setting the bins to 500 froze the browser, unfortunately.

The physical machines that do this are undeniably cool, but especially when they have the curve painted on beforehand; this is the best I could find right now, but I remember seeing a pretty big one as part of a traveling mathematics museum exhibition:

http://www.youtube.com/watch?v=AUSKTk9ENzg


When I worked at DEKA, I met a guy who designed and built an invertible Galton Box. The ball-holder was circular and if you spun it one way, the balls would fall into the expected bell curve. But if you turned it the other way, the pins would shift very subtly - so little you wouldn't really notice - but the balls would fall in a perfect inverse bell curve, with most of the balls on the outside and very few in the middle. I think the company helped him get a patent on the mechanism :)


That's awesome. Did it have the inverse curve drawn on too? For some reason, the certitude expressed by "calling your shot" like that made more of an impression on my than the falling balls on their own.


what browser are you using? the bins <input> tag has a max value set to 25


I'm using Firefox 22.0 and I have no problem setting it higher than that.


the `max` attribute only applies to using the native spinner controls, you can still select and type larger values.


Firefox 21


I have to learn this Theorem for my exams (which are in a week). All I know is that

(X1+X2+ ... + Xn - n * mean)/sqrt(n * variance) --> N(0,1)

I don't see how that is visualization of this theorem.


X1, X2, ..., XN are Bernoulli(0.5) variables. You can increase N to show that the distribution approaches a Gaussian distribution, as long as you normalize the variance correctly (otherwise you get either no distribution at all, or you get a delta distribution).



"The Bean Machine, also known as the quincunx or Galton box, is a device invented by Sir Francis Galton[1] to demonstrate the central limit theorem, in particular that the normal distribution is approximate to the binomial distribution." - http://en.wikipedia.org/wiki/Bean_machine


I don't understand, they take a distribution known to look like a normal distribution, then... yep it looks like a normal distribution, central limit theorem proved.


This is pretty great. I've just started using D3 and it's pretty powerful stuff.

For those interested in data visualization - also take a look at Vega (http://trifacta.github.io/vega/) - it makes creating the simpler charts easier than using D3.


Here's what's going on, what will illustrate the central limit theorem (CLT) and what will not:

For positive integer n and for i = 1, 2, ..., n, let real valued random variable X(i) be so that

P(X(i) = -1 = P(X(i) = 1 ) = 1/2

Assume that {X(1), ..., X(n)} is an independent set of random variables.

Easily the expectation E[X(i)] = 0.

Let real valued random variable

S(n) = X(1) + X(2) + ... + X(n)

Easily E[S(n)] = 0.

For n = 4, S(n) can take on values -4, -3, ..., 0, 1, ..., 4.

The animation in the OP should converge to the density of S(n) where n = 4.

By the CLT, as n approaches infinity, the density of

( 1/sqrt(n) ) S(n)

will be Gaussian with expectation 0.

By the strong law of large numbers, as n approaches infinity, the probability random variable

(1/n) S(n)

converges to 0 is 1.

For increasing n, S(n) is a martingale and is a discrete approximation to Brownian motion. As n approaches infinity, the density of S(n) approaches 0 everywhere on the real line and in its limit is not a density.


Errata:

Change

P(X(i) = -1 = P(X(i) = 1 ) = 1/2

to

P(X(i) = -1 ) = P(X(i) = 1 ) = 1/2

For more, let v(i) be the variance of X(i). Then since E[X(i)] = 0

  v(i) = E[ (X(i) - E[X(i)])^2 ]

       =  E[ (X(i))^2 ]

       = (1/2) 1 + (1/2) 1

       = 1
Since in the i.i.d. case the variance of a sum is the sum of the variances, the variance of S(n) is n. Then the standard deviation of S(n) is sqrt(n) so that the standard deviation of

(1/sqrt(n)) S(n)

is 1.

So, for large n, the density of

(1/sqrt(n)) S(n)

converges to Gaussian with expectation 0 and standard deviation 1 and variance 1.


Errata:

Change

Since in the i.i.d. case the variance of a sum is the sum of the variances,

to

Since in the independent case, and, hence, also the i.i.d., case, the variance of a sum is the sum of the variances,


Very cool. Could I suggest that you display the total balls dropped somewhere as well?

Edit: just two niggles - (a) 13 bins would be better numbered 0-12 or 1-13. (b) I would round the percentages to nearest value (not down).


good idea! i'll try and add this when have time.


I had to finish the number examples on that page to prove this to myself:

000 -> 0 001 -> 1 010 -> 1 011 -> 2 100 -> 1 101 -> 2 110 -> 2 111 -> 3

0 -> 1: 12.5% 1 -> 3: 37.5% 2 -> 3: 37.5% 3 -> 1: 12.5%

0000 -> 0 0001 -> 1 0010 -> 1 0011 -> 2 0100 -> 1 0101 -> 2 0110 -> 2 0111 -> 3 1000 -> 1 1001 -> 2 1010 -> 2 1011 -> 3 1100 -> 2 1101 -> 3 1110 -> 3 1111 -> 4

0 -> 1: 6.25% 1 -> 4: 25% 2 -> 6: 37.5% 3 -> 4: 25% 4 -> 1: 6.25%


Sorry hn killed the formatting :(


You may want to do a bit of proofreading. It's difficult for me to take the content of this article seriously with so many spelling mistakes; especially since you misspelled "probability", which is the focus of the entire thing.


thanks for pointing this point. I updated the document.


This is great -- I love math visualized. It's fun to set the settings to no delay and just watch the bar graph normalize really quickly

http://i.imgur.com/E8yppGG.png


Turn that into something I can gamble on and I'd play it all day.


while i have the attention of HN, does anyone have suggestions on any other types of statistic visualizations?


You could do a Poisson limit graphic. The result is at

http://en.wikipedia.org/wiki/Poisson_limit_theorem

Basically, you need a whole lot of events which occur in an essentially independent way, and each one is rare. You add up the number of events that happened, and it is approximately Poisson. Something like tossing 1000 marbles into a 100x100 tile floor, and counting the number of marbles that landed in some 10x10 tile area.


Brownian motion remains Brownian motion at all scales if you scale correctly.

More specifically, if you take a rectangle of width 1/nth of your visualization box, height 1/sqrt(n)th of your visualization box, centered on the point of the walk in the middle of the box then zoom in, you get another random walk with the same statistical properties as the original.


There is a gif of this at

http://en.wikipedia.org/wiki/Wiener_process#Self-similarity

which is rather nice.

If you get some infrastructure for making Brownian motion sample paths, another good one is that it crosses zero infinitely often in the neighborhood of the origin. Keep zooming in, and see more and more crossings. This is another instance of the scaling rule you mention, but it is neat to see.


Something that shows bids and offers converging to a market price. Not sure how you'd do it. In the Swensen/Yale finance lecture series on iTunes U they play a game with participants buying/selling items and a pattern emerges in the prices as if by magic.


I'd like to hear the answer to this as well. Especially in light of some of Bret Victor's recent talks, I'm very eager to hear about tools and visualizations that enhance thinking and learning.


Some of the best advanced statistical visualizations I've seen are by Theory: http://theory.info/visuals.


MCMC


It does get a lot more visually interesting with a very high # of bins, and very small delay.


Also known as Plinko Theorem


This is not the Central Limit Theorem.


Really? I see a collection X1, X2, ..., XN of Bernoulli(0.5) variables and the demonstration that the distribution of their sum, normalized for variance, approaches the Gaussian distribution as N increases. Is that not a direct consequence of the CLT?


> the distribution of their sum

Exactly. But the page defines the CLT as referring to:

"the mean of a sufficiently large number of independent random variables ..."

So it talks about the mean and then illustrates the sum, just as you are, which is confusing. I spent several minutes puzzling about how the mean of variables between zero and 1 could be 5, for example.

Now, I suppose you will say, the distribution of the mean is just the distribution of the sum, normalized, and this ought to be obvious to anybody with half a brain, but if this page is meant to help people trying to learn this stuff understand it, it would be better if it was accurate and not directly confusing with its language.


No, it is a property of Bernoulli trials. The distribution is a Binomial distribution, not a Normal distribution. They do, however, look similar http://en.wikipedia.org/wiki/File:Binomial_Distribution.svg


what do you mean by "normalized for variance"?


nice!


thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: