Causality and what Bayesian's are afraid to talk about.
Part 3: A primer on IID, exchangeability, and Bayesian causality.
Two posts ago, I pointed out a recent machine learning paper by several famous people (including Yoshua Bengio, who won a Turing Award for deep learning). The paper argues that machine learning models are too fragile and that we need "causal representation learning" to beat this fragility.
They argue that the current state of machine learning attempts to (and fails) to “engineer away” the artifacts of the causality:
Machine learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure — by and large, we consider these factors a nuisance and try to engineer them away. In accordance with this, the majority of current successes of machine learning boil down to large scale pattern recognition on suitably collected independent and identically distributed (IID) data.
IID is a common simplifying assumption in statistical learning. The gist of the paper is that most of the time, unseen causal processes make the data decidedly un-IID. Rather than pretending the data is IID, modelers should account for those causal processes in their model.
It got me thinking about how Bayesian statisticians tend to approach this problem - they focus on an idea called exchangeability.
In short, combining exchangeability with a bit of Bayesianism lets you get conditional IID and thereby justifies the use of popular modeling approaches.
But if you Google "exchangeability," you get a bunch of hits laden with jargon from probability theory and statistics. The jargon dances around our reliance on causal intuition for exchangeability to be of any practical.
Exchangeability explained (again).
A few posts back, I summarized exchangeability, and I’m basically going the repeat the summary here.
Suppose I flip a coin three times. Before you see each outcome, I hide it under a cup.
Suppose I offer you a $10 bet that will pay you $100 if you get the sequence H-H-T.
Would you take the bet? It would be a sound bet because your expected return is > 0 ($2.50), which is literally what “the odds are in your favor” means.
Now, what if, before you saw the outcome of the flip, I covered the coins in a cup and randomly shuffled the order. Note that if the initial outcome was indeed H-H-T, it probably isn't now.
Should you change your bet?
Hopefully, your intuition says no. The word for that intuition is exchangeable; used in a sentence, you say, "this sequence of random variables is exchangeable."
Exchangeability means exchanging the positions of the variables in the sequence's order doesn't affect the sequence's overall probability distribution.
Help us find like-minded thinkers. If you liked this post, share with one friend.
Why exchangeability matters in modeling.
The reason we care about exchangeability in statistical learning is an idea called de Finetti’s theorem. Here is a plain English description in terms of coin flips;
Suppose I do a large number of exchangeable coin flips and cover each one in a cup.
But some of the coins are weighted in favor of heads, and some of them are weighted in favor of tails.
Then, the flips are IID conditional on biases of the flips.
The nice thing is that I don’t need to know the biases, I can just be model my uncertainty about the biases with probability in a probability model.
This theorem generalizes beyond coin flips. Generally, if some random outcome is exchangeable, it becomes conditionally IID given some hidden variables (like the weights of coins). The theorem is useful because I don't have to know what those hidden variables are to get conditional IID; I just need to quantify my uncertainty about them with a probability model.
So if you can assume your data is exchangeable (a weaker assumption than IID), then you get IID for free so long as you use a probability model that includes those hidden variables.
Exchangeability and MNIST.
For example, in the previous post, I described how the famous MNIST handwritten digit dataset in machine learning was engineered from a dataset created by NIST where government employees wrote the digits in one part of the data, and high school students wrote the other part.
Exchangeability and De Finetti’s theorem lets you look at NIST data and think as follows.
Hmm. I see a bunch of digits. Some are drawn by kids, some by bureaucrats. I don’t know which is which but who cares? I’m going to say the data is exchangeable because then I get to use my favorite model just as long as I include some “kid or bureaucrat” variable in the model. Nice.
Put another way, suppose we train this model on the NIST data. After training, the model encounters this image of a digit:
Our model will then wonder about the probable label of this digit and what type of writer (high school student or bureaucrat) wrote the digit.
That probability is generally prohibitively hard to calculate. But if we combine the assumption of exchangeability with de Finetti’s theorem with Bayes rule, then the above probability is proportional to the product of three probabilities:
“Proportional to” means that finding the values of “label” and “write” that maximize this product of probabilities will also maximize the original probability. This product is much easier to learn. Suppose the label is “8” (duh) and the writer is a high school student.
These probabilities are easy to model — we estimate them from the proportion of the images in the data labeled 8 and the proportion of images written by high school students.
The machine learning algorithm would then quantify probability
i.e., the probability we have an image that looks like this given the writer was a high schooler trying to write the number 8.
That’s easy for modern statistical machine learning to do, given enough data.
The missing (causal) link
Here’s a trick. When you hear explanations of statistical theory that feel overly dense or counterintuitive, there is a causal explanation that would make things easier to understand. Simpson’s paradox (see causal explanation), types of missing data (causal explanation) and the Monte Hall problem (causal explanation) are prime examples.
Exchangeability is no exception.
I found that people tend to talk about exchangeability as if it were a cause — “because I have exchangeability, we should include this latent variable in my model” or “because I have exchangeability, my model is a good model.”
Exchangeability is better thought of as an effect. If you have exchangeable data, something made it so.
Why is the NIST data exchangeable? Because bureaucrats and high school students write differently.
But just because the identity of the writer is explicit in the description of the NIST data doesn't mean there aren't other implicit causes. We need to consider those implicit causes also (failure to think beyond the data to how the data came about is a common pathology in how applied data/machine learning scientists think).
Why not gender? According to the Internet, these handwriting styles are allegedly “feminine.”
Shouldn’t those styles affect how the number 8 looks in an image?
What about age? Culture background? Left/right-handedness? Which matter and which don’t?
Toward causal representation learning
Causal representation learning tries to learn representations of the causes of a problem that matter — "representations" generally mean trying to learn the essence of the important cause(es) from the effects even when you don't know all the potential causes. If we can succeed in that, Bayes and de Finetti can carry us the rest of the way to more robust AI.
Help us find like-minded thinkers. If you like this post, share with one friend.