The following is an excerpt from my forth-coming book on Causal Machine Learning. An early-access version of the book is available online.
Read other parts:
Causal and effect relationships drive data. But statistical analysis alone is insufficient to recover those causal and affect relationships from that data; as the adage says, correlation does not imply causality. For this, we need causal reasoning.
Causal reasoning is a crucial element to how humans understand, explain, and make decisions about the world. Causal AI means automating causal reasoning with machine learning. Today’s learning machines have superhuman prediction ability but aren't particularly good at causal reasoning, even when we train them on massively large amounts of data. In this book, you will learn how to write algorithms that capture causal reasoning in the context of machine learning and automated data science.
Though humans rely heavily on causal reasoning to navigate the world, our cognitive biases make our causal inferences highly error-prone. Improving our ability to answer causal questions has been the work of millennia of philosophers, centuries of scientists, and decades of statisticians. But now, a convergence of statistical and computational advances has shifted the focus from discourse to algorithms that we can train on data and deploy to software. It is now a fascinating time to learn how to build causal AI.
Why should I or my team care about causal data science and AI?
I want to present some high-level reasons motivating the study of causal modeling. These reasons apply to researchers, independent contributors, and managers working on data science, machine learning, and other domains of data-driven automated decision-making in general.
Better data science
Organizations in big tech and tech-powered retail have realized the importance of causal inference and are paying top salaries to people with a causal inference skill set. The main reason is that the goal of data science - extracting actionable insights from data – is a causal task. For example, when a data scientist analyses the statistical relationship between a social feed feature and engagement, they want to know if the feature causes more engagement. Causal modeling helps the data scientist achieve that goal in several ways.
Simulated experiments and causal effect inference
Causal effect inference – quantifying how much a cause (e.g., a promotion) affects an effect (e.g., sales) is the most common goal of applied data science. Randomized experiments, such as an A/B test, are the gold standard for causal effect inference. The concepts of causal inference explain why randomized experiments work so well; randomization eliminates non-causal sources of statistical correlation.
More importantly, causal inference enables data scientists to simulate experiments and estimate causal effects from observational data (data we observe in passing). Most data in the world is observational data because most data is not the result of an experiment. “Big data” is almost always observational data. When a tech company boasts of training a natural language model on petabytes of Internet data, it is observational data. When we can’t run a randomized experiment because of infeasibility, cost, or ethics, causal inference enables data scientists to turn to observational data to estimate causal effects.
A common belief is that in the era of big data, it is easy to run virtually unlimited experiments. If you can run unlimited experiments, who needs causal inference? But even when you can run experiments at little actual cost, there are often opportunity costs to running experiments. You can build a causal model based on data from old experiments and use it to select new experiments by predicting the outcome of those new experiments.
For example, suppose a data scientist at an e-commerce company has a choice of one thousand ways to tweak search results to increase sales. They have one thousand experiments to run, and running each would take time. Further, running experiments with suboptimal changes to the search results would sacrifice some sales. Causal modeling allows that data scientist to use past experimental data to build a causal model and use that model to simulate the results for each of these one thousand potential experiments. If that causal model captured the causal relationship between search results and conversion rates in search, it could potentially predict which search result variants would lead to suboptimal outcomes even if those variants were never directly tested previously. The data scientist could simulate all thousand experiments and then prioritize the experiments by those predicted to have the most impactful results in simulation. They avoid wasting clicks and sales on less insightful experiments. Further, the data scientist could use results from new experiments to update the causal model, improving causal predictions in each iteration.
Counterfactual data science
Counterfactual questions have the form, “given what happened, what would have happened if things had been different?” Causal modeling provides a logical way to predict counterfactual outcomes. Data science that can infer counterfactuals can answer critical business questions more directly.
For example, the TV show The Office was the most popular series on Netflix. That posed a problem for Netflix because Netflix licenses The Office from Comcast, which competes with Netflix through its own streaming service. If Comcast decided to deny Netflix access to The Office, Netflix would be in danger of losing subscribers who don’t watch much else. Thus, Netflix had a strong incentive to find ways to encourage these subscribers to engage more with other Netflix content.
Suppose you are a data scientist at Netflix. The company introduces the show Space Force, which like The Office, casts actor Steve Carrell and has a similar flavor of dry humor. The hope is that Space Force would act as a “gateway drug” to greater engagement in other Netflix content.
The classical data science analysis divides heavy viewers of The Office into those who watched Space Force and who didn’t, controls for other variables, and looks for a statistically significant difference in their hours of engagement in other content (e.g., by using a two-sample hypothesis testing). This analysis would answer the question, “How much does Space Force drive engagement among The Office watchers?” That question is indeed relevant to the business problem of getting The Office watchers to engage in other content.
However, consider the additional counterfactual questions you could answer with data and a causal model. For example, consider those The Office watchers who watched Space Force. If Space Force had not been introduced, would they have spent that time watching The Office? Alternatively, had Space Force not been introduced, and if The Office were no longer available, would they have spent that time watching content from a Netflix competitor? These questions get more at the heart of the business problem. And experiments alone cannot answer these questions. They require a causal model.
Better attribution, credit assignment, and root cause analysis
The “attribution problem” in marketing is perhaps best articulated by a quote attributed to entrepreneur and advertising pioneer John Wanamaker:
Half the money I spend on advertising is wasted; the trouble is I don’t know which half.
In other words, it is difficult to know what advertisement, promotion, or other action caused a specific customer behavior, sales number, or another key business outcome. Even in online marketing, where the data has gotten much richer and more granular than in Wanamaker’s time, attribution remains a challenge. For example, a user may have clicked after seeing an ad, but was it that single ad view that led to the click? Or were they going to click anyway? Or perhaps there was a cumulative effect of all the nudges to click they received over multiple channels? Causal modeling addresses the attribution problem by using formal causal logic to answer “why” questions, such as “why did this user click?”
Attribution goes by other names in other domains, such as “credit assignment” and “root cause analysis.” The core meaning is the same; we want to understand why a particular event outcome happened. We know what the causes are in general, we want to know how much a particular cause is to blame in a given instance.