When Shapley Values Break: A Guide to Robust Model Explainability

Contents

Explainability in AI is essential for gaining trust in model predictions and is highly important for improving model robustness. Good explainability often acts as a debugging tool, revealing flaws in the model training process. While Shapley Values have become the industry standard for this task, we must ask: Do they always work? And critically, where do they fail? To understand where Shapley values fail, the best approach is to control the ground truth. We will start with a simple linear model, and then systematically break down the explanation. By observing how Shapley values react to these controlled changes, we can precisely identify exactly where they yield misleading results and how to fix them. The Toy Model We will start with a model with 100 uniform random variables. import numpy as np from sklearn.linear_model import LinearRegression import shap def get_shapley_values_linear_independent_variables( weights: np.ndarray, data: np.ndarray ) -> np.ndarray: return weights * data # Top compare the theoretical results with shap package def get_shap(weights: np.ndarray, data: np.ndarray): model = LinearRegression() model.coef_ = weights # Inject your weights model.intercept_ = 0 background = np.zeros((1, weights.shape[0])) explainer = shap.LinearExplainer(model, background) # Assumes independent between all features results = explainer.shap_values(data) return results DIM_SPACE = 100 np.random.seed(42) # Generate random weights and data weights = np.random.rand(DIM_SPACE) data = np.random.rand(1, DIM_SPACE) # Set specific values to test our intuition # Feature 0: High weight (10), Feature 1: Zero weight weights[0] = 10 weights[1] = 0 # Set maximal value for the first two features data[0, 0:2] = 1 shap_res = get_shapley_values_linear_independent_variables(weights, data) shap_res_pacakge = get_shap(weights, data) idx_max = shap_res.argmax() idx_min = shap_res.argmin() print( f"Expected: idx_max 0, idx_min 1\nActual: idx_max {idx_max}, idx_min: {idx_min}" ) print(abs(shap_res_pacakge - shap_res).max()) # No difference In this straightforward example, where all variables are independent, the calculation simplifies dramatically. Recall that the Shapley formula is based on the marginal contribution of each feature, the difference in the model’s output when a variable is added to a coalition of known features versus when it is absent. \[ V(S∪{i}) – V(S)\] Since the variables are independent, the specific combination of pre-selected features (S) does not influence the contribution of feature i. The effect of pre-selected and non-selected features cancel each other out during the subtraction, having no impact on the influence of feature i. Thus, the calculation reduces to measuring the marginal effect of feature i directly on the model output: \[ W_i · X_i \] The result is both intuitive and works as expected. Because there is no interference from other features, the contribution depends solely on the feature’s weight and its current value. Consequently, the feature with the largest combination of weight and value is the most contributing feature. In our case, feature index 0 has a weight of 10 and a value of 1. Let’s Break Things Can We Fix This?Grouping Features The Winner Takes It All Real World Validation Conclusion

The Toy Model

Now, we will introduce dependencies to see where Shapley values start to fail.

In this scenario, we will artificially induce perfect correlation by duplicating the most influential feature (index 0) 100 times. This results in a new model with 200 features, where 100 features are identical copies of our original top contributor and independent of the rest of the 99 features. To complete the setup, we assign a zero weight to all these added duplicate features. This ensures the model’s predictions remain unchanged. We are only altering the structure of the input data, not the output. While this setup seems extreme, it mirrors a common real-world scenario: taking a known important signal and creating multiple derived features (such as rolling averages, lags, or mathematical transformations) to better capture its information.

However, because the original Feature 0 and its new copies are perfectly dependent, the Shapley calculation changes.

Based on the Symmetry Axiom: if two features contribute equally to the model (in this case, by carrying the same information), they must receive equal credit.

Intuitively, knowing the value of any one clone reveals the full information of the group. As a result, the massive contribution we previously saw for the single feature is now split equally across it and its 100 clones. The “signal” gets diluted, making the primary driver of the model appear much less important than it actually is.
Here is the corresponding code:

import numpy as np
from sklearn.linear_model import LinearRegression
import shap

def get_shapley_values_linear_correlated(
    weights: np.ndarray, data: np.ndarray
) -> np.ndarray:
    res = weights * data
    duplicated_indices = np.array(
        [0] + list(range(data.shape[1] - DUPLICATE_FACTOR, data.shape[1]))
    )
    # we will sum those contributions and split contribution among them
    full_contrib = np.sum(res[:, duplicated_indices], axis=1)
    duplicate_feature_factor = np.ones(data.shape[1])
    duplicate_feature_factor[duplicated_indices] = 1 / (DUPLICATE_FACTOR + 1)
    full_contrib = np.tile(full_contrib, (DUPLICATE_FACTOR+1, 1)).T
    res[:, duplicated_indices] = full_contrib
    res *= duplicate_feature_factor
    return res

def get_shap(weights: np.ndarray, data: np.ndarray):
    model = LinearRegression()
    model.coef_ = weights  # Inject your weights
    model.intercept_ = 0
    explainer = shap.LinearExplainer(model, data, feature_perturbation="correlation_dependent")    
    results = explainer.shap_values(data)
    return results

DIM_SPACE = 100
DUPLICATE_FACTOR = 100

np.random.seed(42)
weights = np.random.rand(DIM_SPACE)
weights[0] = 10
weights[1] = 0
data = np.random.rand(10000, DIM_SPACE)
data[0, 0:2] = 1

# Duplicate copy of feature 0, 100 times:
dup_data = np.tile(data[:, 0], (DUPLICATE_FACTOR, 1)).T
data = np.concatenate((data, dup_data), axis=1)
# We will put zero weight for all those added features:
weights = np.concatenate((weights, np.tile(0, (DUPLICATE_FACTOR))))


shap_res = get_shapley_values_linear_correlated(weights, data)

shap_res = shap_res[0, :] # Take First record to test results
idx_max = shap_res.argmax()
idx_min = shap_res.argmin()

print(f"Expected: idx_max 0, idx_min 1\nActual: idx_max {idx_max},  idx_min: {idx_min}")

This is clearly not what we intended and fails to provide a good explanation to model behavior. Ideally, we want the explanation to reflect the ground truth: Feature 0 is the primary driver (with a weight of 10), while the duplicated features (indices 101–200) are merely redundant copies with zero weight. Instead of diluting the signal across all copies, we would clearly prefer an attribution that highlights the true source of the signal.

Note: If you run this using Python shap package, you might notice the results are similar but not identical to our manual calculation. This is because calculating Shapley values is computationally infeasible. Therefore libraries like shap rely on approximation methods which slightly introduce variance.

Image by author (generated with Google Gemini).

Can We Fix This?

Since correlation and dependencies between features are extremely common, we cannot ignore this issue.

On the one hand, Shapley values do account for these dependencies. A feature with a coefficient of 0 in a linear model and no direct effect on the output receives a non-zero contribution because it contains information shared with other features. However, this behavior, driven by the Symmetry Axiom, is not always what we want for practical explainability. While “fairly” splitting the credit among correlated features is mathematically sound, it often hides the true drivers of the model.

Several techniques can handle this, and we will explore them.

Grouping Features

This approach is particularly critical for high-dimensional feature space models, where feature correlation is inevitable. In these settings, attempting to attribute specific contributions to every single variable is often noisy and computationally unstable. Instead, we can aggregate similar features that represent the same concept into a single group. A helpful analogy is from image classification: if we want to explain why a model predicts “cat” instead of a “dog”, examining individual pixels is not meaningful. However, if we group pixels into “patches” (e.g., ears, tail), the explanation becomes immediately interpretable. By applying this same logic to tabular data, we can calculate the contribution of the group rather than splitting it arbitrarily among its components.

This can be achieved in two ways: by simply summing the Shapley values within each group or by directly calculating the group’s contribution. In the direct method, we treat the group as a single entity. Instead of toggling individual features, we treat the presence and absence of the group as simultaneous presence or absence of all features within it. This reduces the dimensionality of the problem, making the estimation faster, more accurate, and more stable.

Image by author (generated with Google Gemini).

The Winner Takes It All

While grouping is effective, it has limitations. It requires defining the groups beforehand and often ignores correlations between those groups.

This leads to “explanation redundancy”. Returning to our example, if the 101 cloned features are not pre-grouped, the output will repeat those 101 features with the same contribution 101 times. This is overwhelming, repetitive, and functionally useless. Effective explainability should reduce the redundancy and show something new to the user each time.

To achieve this, we can create a greedy iterative process. Instead of calculating all values at once, we can select features step-by-step:

Select the “Winner”: Identify the single feature (or group) with the highest individual contribution
Condition the Next Step: Re-evaluate the remaining features, assuming the features from the previous step are already known. We will incorporate them in the subset of pre-selected features S in the shapley value each time.
Repeat: Ask the model: “Given that the user already knows about Feature A, B, C, which remaining feature contributes the most information?”

By recalculating Shapley values (or marginal contributions) conditioned on the pre-selected features, we ensure that redundant features effectively drop to zero. If Feature A and Feature B are identical and Feature A is selected first, Feature B no longer provides new information. It is automatically filtered out, leaving a clean, concise list of distinct drivers.

Note: You can find an implementation of this direct group and greedy iterative calculation in our Python package medpython.
Full disclosure: I am a co-author of this open-source package.

Real World Validation

While this toy model demonstrates mathematical flaws in shapley values method, how does it work in real-life scenarios?

We applied those methods of Grouped Shapley with Winner takes it all, additionally with more methods (that are out of scope for this post, maybe next time), in complex clinical settings used in healthcare. Our models utilize hundreds of features with strong correlation that were grouped into dozens of concepts.

This method was validated across several models in a blinded setting when our clinicians weren’t aware which method they were inspecting, and outperformed the vanilla Shapley values by their rankings. Each technique contributed above the previous experiment in a multi-step experiment. Additionally, our team utilized these explainability enhancements as part of our submission to the CMS Health AI Challenge, where we were selected as award winners.

Image by the **Centers for Medicare & Medicaid Services (CMS)**

Conclusion

Shapley values are the gold standard for model explainability, providing a mathematically rigorous way to attribute credit.
However, as we have seen, mathematical “correctness” does not always translate into effective explainability.

When features are highly correlated, the signal might be diluted, hiding the true drivers of your model behind a wall of redundancy.

We explored two ways to fix this:

Grouping: Aggregate features into a single concept
Iterative Selection: conditioning on already presented concepts to squeeze out only new information, effectively stripping away redundancy.

By acknowledging these limitations, we can ensure our explanations are meaningful and helpful.

If you found this useful, let’s connect on LinkedIn