But bro, ever hear of *Bayes* Occam's Razor?
Yet another reason why Bayesian reasoning is so boss.
Model complexity and Occam's Razor
Occam's razor is an inductive bias here one assumes that the simplest consistent hypothesis for how the data was generated is the best hypothesis.
Here, consistent means that the hypothesis can't be so simple as to fail to explain the data. For example, a common theme in detective stories is for the police to blame the wrong guy because that guy has a dark past and a dark character. However, the detective knows the suspect is innocent because the hypothesis that the suspect did the crime is simply inconsistent with the facts of the case.
Occam's razor certainly applies in statistical modeling. The model's "hypothesis" is a mapping between the input data and the thing being inferred/predicted. Occam's razor says that the less complex the mapping, the better.
Complexity is closely related to the problem of overfitting. More complex models provide tighter fits to the data. However, if they are too complex, they overfit the data, meaning that the model won't generalize well to new data. Occam's razor avoids this by applying a famous (and possibly apocryphal) Einstein quote to model-building:
Everything should be made as simple as possible, but not simpler.
~ Albert Einstein
Complexity comes in two flavors; syntactic and semantic.
We need to define complexity. Let's first distinguish between the semantic aspects of a model and the syntactic aspects of a model. Model semantics has to do with the relationship between the modeling abstractions and the real-world phenomena the model is trying to represent. For example, words like fit and accuracy are semantic notions because they describe how well data the model's predictions align with data we see in the real world.
In contrast, model syntax has to do with the rules and structural descriptions that constrain a model. Any element of the model that isn't directly related to how we interpret the model and its predictions in the real world is a syntactic element. Words like hyperparameter, neuron, activation function, knot vector, and kernel all syntactic elements of various models.
In my experience, statisticians and machine learners tend to pursue a syntactic version of Occam's razor when they model. Specifically, they use structural rules and penalties to constrain complexity. For example, some use L1 regularization, dropout, and pruning to make the model's structure more sparse. Alternatively, some use L2 regularization or constraints to limit the values of parameters and weights during training.
The trouble with the syntactic approach to Occam's razor is that they are difficult to justify on principled, non-arbitrary grounds. For example, of all the methods I mentioned for minimizing complexity in deep learning, we don't have general principles for how some of them work or when they work. We make arbitrary decisions about their application until they do work. Trial-and-error is certainly workable, just not ideal.
More importantly, our goal of finding a balance between complexity and fit suffers from this mismatch between the syntactic and semantic perspectives. When we see complexity in terms of the nuts of bolts of the model, while seeing fit in terms of how the model performs in the real world, we can only find arbitrary ways balance the two.
Introducing Bayes Occam's Razor
Bayesian Occam's razor is a form of Occam's razor that shows up in Bayesian modeling, specifically Bayesian models with semantic abstractions that connect to concepts in the real world*.
A good fit means that most of the hypotheses generated by the model have a high posterior probability. A bad fit means that only a few hypotheses have a high probability.
Complexity for a Bayesian model means the model can generate a wide variety of hypotheses that could lead to realistic data. Simplicity means it generates a limited set of hypotheses that lead to realistic data. This definition is semantic because it is in terms of how the model generates the kinds of data we see in the real world.
For example, what kinds of models could explain a global pandemic? A simple model might generate simple hypotheses like how global warming and human encroachment on animal habitats is leading to unprecedented interactions between humans and microbes.
In contrast, a complex model sounds like an internet conspiracy theory generator. It generates elaborate hypotheses about how billionaires, extra-terrestrial reptiles, and the deep state conspire to cause a pandemic because they want to enforce mind-control through vaccinations and 5G. The model is complex because, for a given hypothesis, the generation of realistic data follows from interactions between many highly improbable events.
The elegance of Bayes Occam's razor is as follows; reducing model complexity (while staying consistent) will increase fit. There is no trade-off, fit and complexity balance perfectly.
The intuition is that improving "fit" means increasing the number of hypotheses that have a high probability of having generated the data. Hypotheses from complex models are long chains of probabilistic events. So the likelihood of a datapoint is the product of all those probabilities in that chain. When you multiply probabilities together in sequence, the result gets small quickly. So reducing complexity means reducing the number of probability multiplications you need for a hypothesis to explain some data, which then increases the likelihood of that data. As you know from Bayes rule, this increases the posterior probability of the hypothesis. The more high probability hypotheses the model generates for a dataset, the better the fit.
So changing your model to reduce complexity increases fit.
* Good examples include probabilistic latent variable models (e.g., HMMs), Bayesian hierarchical models, probabilistic graphical models, and Bayesian non-parametric models.
I have a few more inductive bias posts in the pipeline. If you liked this, help me grow this newsletter!