Causal Inference Is Eating Machine Learning

Contents

The Question Your Model Can’t Answer When Accuracy Lies The Toolkit Caught Up Where Causal Methods Break Down Does Your Problem Need Causal Inference?The 5-Question Diagnostic Which Causal Method Fits Your Problem?A DoWhy Workflow in Practice What Changes Now References

shipped a readmission-prediction model in early 2024. This is a composite case drawn from patterns documented by Hernán & Robins in Nature Machine Intelligence, but every detail maps to real deployment failures.

Accuracy on the held-out test set: 94%. The operations team used it to decide which patients to prioritize for follow-up calls. They expected readmission rates to drop.

Rates went up.

The model had captured every correlation in the data: older patients, certain zip codes, specific discharge diagnoses. It performed exactly as designed. The test metrics were clean. The confusion matrix looked textbook.

But when the team acted on those predictions (calling patients flagged as high-risk, rearranging discharge protocols) the relationships in the data shifted beneath them. Patients who received extra follow-up calls didn’t improve. The ones who kept getting readmitted shared a different profile entirely: they couldn’t afford their medications, lacked reliable transportation to follow-up appointments, or lived alone without support for post-discharge care. The variables that predicted readmission were not the same variables that caused it.

The model never learned that distinction, because it was never designed to. It saw correlations and assumed they were handles you could pull. They weren’t. They were shadows cast by deeper causes the model couldn’t see.

A model that predicts readmission with 94% accuracy told the team exactly who would come back. It told them nothing about why, or what to do about it.

If you’ve built a model that predicts well but fails when turned into a decision, you’ve already felt this problem. You just didn’t have a name for it.

The name is confounding. The solution is causal inference. And in 2026, the tools to do it properly are finally mature enough for any data scientist to use.

The Question Your Model Can’t Answer

Machine Learning (ML) is built for one job: find patterns in data and predict outcomes. This is associational reasoning. It works brilliantly for spam filters, image classifiers, and recommendation engines. Pattern in, pattern out.

But business stakeholders rarely ask “what will happen next?” They ask “what should we do?” Should we raise the price? Should we change the treatment protocol? Should we offer this customer a discount?

These are causal questions. And answering them with associational models is like using a thermometer to set the thermostat. The thermometer tells you the temperature. It doesn’t tell you what would happen if you changed the dial.

Answering “what should we do?” with a tool designed for “what will happen?” is like using a thermometer to set the thermostat.

Judea Pearl, the computer scientist who won the 2011 Turing Award for his work on probabilistic and causal reasoning, organized this gap into what he calls the Ladder of Causation. The ladder has three rungs, and the distance between them explains why so many ML projects fail when they move from prediction to action.

Pearl’s three rungs of causal reasoning. The gap between Level 1 and Level 2 is where wrong decisions are made at scale. Image by the author.

Level 1: Association (“Seeing”). “Patients who take Drug X have better outcomes.” This is pure correlation. Every standard ML model operates here. It answers: what patterns exist in the data?

Level 2: Intervention (“Doing”). “If we give Drug X to this patient, will their outcome improve?” This requires understanding what happens when you change something. Pearl formalizes this with the do-operator: P(Y | do(X)). No amount of observational data, on its own, can answer this.

Level 3: Counterfactual (“Imagining”). “This patient took Drug X and recovered. Would they have recovered without it?” This requires reasoning about realities that never happened. It is the highest form of causal thinking.

Here’s what each level looks like in practice. A Level 1 model at an e-commerce company says: “Users who viewed product pages for running shoes also bought protein bars.” Useful for recommendations. A Level 2 question from the same company: “If we send a 20% discount on protein bars to users who viewed running shoes, will purchases increase?” That requires knowing whether the discount causes purchases or whether the same users would have bought anyway. A Level 3 question: “This user bought protein bars after receiving the discount. Would they have bought them without it?” That requires reasoning about a world that didn’t happen.

Most ML operates on Level 1. Most business decisions require Level 2 or 3. That gap is where wrong decisions are made at scale.

When Accuracy Lies

The gap between prediction and causation is not theoretical. It has a body count.

Consider the kidney stone study from 1986. Researchers compared two treatments for renal calculi. Treatment A outperformed Treatment B for small stones. Treatment A also outperformed Treatment B for large stones. But when the data was pooled across both groups, Treatment B appeared superior.

This is Simpson’s paradox. The lurking variable was stone severity. Doctors had prescribed Treatment A for harder cases. Pooling the data erased that context, flipping the apparent conclusion. A prediction model trained on the pooled data would confidently recommend Treatment B. It would be wrong.

That’s a statistics textbook example. The hormone therapy case drew blood.

For decades, observational studies suggested that postmenopausal Hormone Replacement Therapy (HRT) reduced the risk of coronary heart disease. The evidence looked solid. Millions of women were prescribed HRT based on these findings. Then the Women’s Health Initiative, a large-scale randomized controlled trial published in 2002, revealed the opposite: HRT actually increased cardiovascular risk.

For decades, observational studies suggested hormone therapy protected hearts. A proper trial revealed it damaged them. Millions of prescriptions, one confound.

The confound was wealth. Healthier, wealthier women were more likely to both choose HRT and have lower heart disease rates. The observational models captured this correlation and mistook it for a treatment effect. A 2019 paper by Miguel Hernán in CHANCE used this exact case to argue that data science needs “a second chance to get causal inference right.”

How common is this mistake? A 2021 scoping review in the European Journal of Epidemiology examined observational studies and found that 26% of them conflated prediction with causal claims. One in four published papers, in medical journals, where people make life-and-death decisions based on the results.

The core structure behind both cases is the confounding fork: a hidden common cause (Z) that influences both the treatment (X) and the outcome (Y), creating a spurious association between them. Stone severity drove both treatment choice and outcomes. Wealth drove both HRT adoption and heart health. In each case, the correlation between X and Y was real in the data. But acting on it as if X caused Y produced the wrong intervention.

The confounding fork: Z causes both X and Y, creating a correlation between X and Y that disappears (or reverses) when you control for Z. Image by the author.

The lesson is uncomfortable: a model can have high accuracy, pass every validation check, and still give recommendations that make outcomes worse. Accuracy measures how well a model captures existing patterns. It says nothing about whether those patterns survive when you intervene.

The Toolkit Caught Up

For years, causal inference lived behind a wall of econometrics textbooks, custom R scripts, and a small circle of specialists. That wall has come down.

Microsoft Research built DoWhy, a Python library that reduces causal analysis to four explicit steps: model your assumptions, identify the causal estimand, estimate the effect, and refute your own result. That fourth step is what separates causal inference from “I ran a regression and it was significant.” DoWhy forces you to try to break your conclusion before you trust it.

Alongside DoWhy sits EconML, another Microsoft Research library that provides the estimation algorithms: Double Machine Learning (DML), causal forests, instrumental variable methods, and doubly robust estimators. Together, they form the PyWhy project, which is quickly becoming the standard causal analysis stack in Python.

DoWhy reduces causal analysis to four steps: model, identify, estimate, refute. That last step separates causal inference from “I ran a regression.”

The market signals align. Fortune Business Insights valued the global Causal Artificial Intelligence (AI) market at $81.4 billion in 2025, projecting $116 billion for 2026 (a 42.5% Compound Annual Growth Rate, or CAGR). An additional 25% of organizations plan to adopt causal AI by 2026, which would bring total adoption among AI-driven organizations to nearly 70%.

Uber built CausalML for uplift modeling and treatment effect estimation. Netflix has published research on causal bandits for content recommendations. Amazon’s AWS team uses DoWhy for root cause analysis in microservice architectures, diagnosing why latency spikes happen rather than just predicting when they will. These aren’t academic experiments. They’re production systems running at scale.

The practical barrier used to be expertise. You needed to understand structural causal models, the backdoor criterion, and how to derive estimands by hand. DoWhy automates the identification step. You draw the DAG (encoding your domain knowledge), and the library determines which statistical estimand answers your causal question. That’s the part that used to take a PhD-level methods course to do manually.

Where Causal Methods Break Down

A fair objection: most ML applications work fine without causal reasoning. Recommendation systems, image classification, fraud detection, search ranking. Pattern in, pattern out. These problems genuinely don’t need causal structure, and adding it would be over-engineering.

Causal inference also carries a cost that prediction does not. It requires assumptions. You must specify a Directed Acyclic Graph (DAG), a diagram encoding which variables cause which. If your DAG is wrong (a missing confounder, a reversed arrow) your causal estimate can be worse than a naive correlation. The garbage-in-garbage-out problem doesn’t disappear; it moves from the data to the assumptions.

The argument here is not that causal inference should replace prediction. It is that causal inference must supplement prediction whenever you move from pattern recognition to decision-making. The failure mode is not “ML doesn’t work.” The failure mode is “ML works for prediction, then gets misapplied to a causal question.” Knowing which question you’re answering is the skill that separates a model builder from a decision scientist.

Does Your Problem Need Causal Inference?

The 5-Question Diagnostic

Before you pick a method, run your problem through these five questions. If you answer “yes” to two or more, you need causal inference. If you answer “yes” to question 1 alone, you need causal inference.

Are you making a decision or a prediction?
Predicting who will churn = standard ML. Deciding which intervention prevents churn = causal inference.
Would acting on your model change the underlying relationships?
If your intervention alters the very patterns the model learned, your correlations will shift post-deployment. This is a causal problem.
Could a confounding variable explain your result?
If two variables (treatment and outcome) share a common cause, your observed association may vanish, reverse, or amplify once the confounder is controlled for. Think: the HRT case.
Do you need to answer “what if?” or “why?”
“What if we doubled the price?” is a Level 2 (intervention) question. “Why did this customer leave?” is a Level 3 (counterfactual) question. Both require causal reasoning.
Is there selection bias in how treatments were assigned?
If doctors prescribe Drug A to sicker patients, or if users self-select into a feature, comparing raw outcomes without adjustment is meaningless.

A simplified diagnostic flow. The full 5-question version is in the text above. Image by the author.

Which Causal Method Fits Your Problem?

Once you know you need causal inference, the next question is which method. This matrix maps common situations to the right tool.

If you’re unsure where to start: begin with a DAG. Draw the causal relationships you believe exist between your treatment, outcome, and potential confounders. Even a rough DAG makes your assumptions explicit, which is the single most important step. You can refine the estimation method afterward.

A DoWhy Workflow in Practice

Here’s a concrete example: measuring whether a customer loyalty program actually increases annual spending (as opposed to loyal customers who would spend more anyway self-selecting into the program).

# Install: pip install dowhy
import dowhy
from dowhy import CausalModel

# Step 1: MODEL your causal assumptions as a DAG
# Income affects both loyalty signup AND spending (confounder)
model = CausalModel(
    data=df,
    treatment="loyalty_program",
    outcome="annual_spending",
    common_causes=["income", "prior_purchases", "age"],
)

# Step 2: IDENTIFY the causal estimand
# DoWhy determines what statistical quantity answers your question
identified = model.identify_effect()
# Returns: E[annual_spending | do(loyalty_program=1)]
#        - E[annual_spending | do(loyalty_program=0)]

# Step 3: ESTIMATE the causal effect
estimate = model.estimate_effect(
    identified,
    method_name="backdoor.propensity_score_matching"
)
print(f"Causal effect: ${estimate.value:.2f}/year")

# Step 4: REFUTE your own result
# Add a random variable that shouldn't affect the estimate
refutation = model.refute_estimate(
    identified, estimate,
    method_name="random_common_cause"
)
print(refutation)
# If the effect holds under random confounders, your result is robust

Four steps. Model your assumptions, identify the estimand, estimate the effect, then try to break your own result. The DoWhy documentation provides full tutorials on integrating EconML estimators for more advanced use cases (DML, causal forests, instrumental variables).

The refutation step deserves emphasis. In standard ML, you validate with held-out test sets. In causal inference, you validate by trying to destroy your own estimate: adding random confounders, using placebo treatments, running the analysis on data subsets. If the effect survives, you have something real. If it collapses, you’ve saved yourself from a costly wrong decision.

If your model’s recommendations would change the relationships it learned from, you’ve left prediction territory. Welcome to causation.

What Changes Now

The convergence is already visible. Tech companies are hiring for causal reasoning: Microsoft built the entire PyWhy stack, Uber released CausalML, Netflix published research on causal inference in production. The skillset is no longer confined to economics PhD programs and epidemiology departments. It is entering production ML teams.

Universities are adapting. Hernán’s classification of data science tasks into Description, Prediction, and Causal Inference (published through the Harvard School of Public Health) is becoming a standard pedagogical framework. The question is no longer “should data scientists learn causal inference?” It is “how quickly can they?”

For the individual practitioner, the return on learning causal methods is asymmetric. The data scientist who can answer “what will happen?” is valuable. The one who can answer “what should we do?” (and demonstrate why the answer is robust) commands a different kind of trust in the room. That trust translates directly into influence over decisions, resource allocation, and strategy.

The learning curve is real but shorter than it looks. If you understand conditional probability and have built regression models, you already have 60% of the foundation. The remaining 40% is learning to think in graphs (DAGs), understanding the difference between conditioning and intervening, and knowing when to reach for which estimator. The PyWhy documentation, Brady Neal’s free online course on causal inference, and Pearl’s accessible The Book of Why cover that gap in weeks, not years.

Remember the health-tech company from the opening? After the readmission spike, they rebuilt their analysis using DoWhy. They drew a DAG, identified that socioeconomic factors were confounders (not causes) of readmission, and isolated the actual causal drivers: medication adherence and follow-up appointment access. They redesigned their intervention around those two levers.

Readmission rates dropped 18%.

The model’s accuracy didn’t change. What changed was the question it answered.

The next time a stakeholder asks “what should we do?”, you have two options: hand them a correlation and hope it survives contact with reality, or hand them a causal estimate with a refutation report showing exactly how hard you tried to break it. The tools exist. The math is settled. The code is four lines.

The only question left is whether you’ll keep predicting, or start causing.

References

Pearl, J. & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Pearl, J. & Bareinboim, E. (2022). “On Pearl’s Hierarchy and the Foundations of Causal Inference.” Technical Report R-60, UCLA Cognitive Systems Laboratory.
Hernán, M.A. (2019). “A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks.” CHANCE, 32(1), 42-49.
Luijken, K. et al. (2021). “Prediction or causality? A scoping review of their conflation within current observational research.” European Journal of Epidemiology, 37, 35-46.
Hernán, M.A. & Robins, J.M. (2020). “Causal inference and counterfactual prediction in machine learning for actionable healthcare.” Nature Machine Intelligence, 2, 369-375.
Sharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. Microsoft Research / PyWhy.
Battocchi, K. et al. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effect Estimation. Microsoft Research / ALICE.
Fortune Business Insights. (2025). “Causal AI Market Size, Industry Share | Forecast, 2026-2034.”
Charig, C.R. et al. (1986). “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotomy.” BMJ, 292(6524), 879-882.
PyWhy Contributors. (2024). “Tutorial on Causal Inference and its Connections to Machine Learning (Using DoWhy+EconML).” PyWhy Documentation.