ECCCos from the Black Box. Faithful model explanations through… | by Patrick Altmeyer

Faithful model explanations through energy-based conformal counterfactuals

Counterfactual explanations offer an intuitive and straightforward way to explain opaque machine learning (ML) models. They work under the premise of perturbing inputs to achieve a desired change in the predicted output.

If you have not heard about counterfactual explanations before, feel free to also check out my introductory posts: 1) Individual Recourse for Black Box Models and 2) A new tool for explainable AI.

There are typically many ways to achieve this, in other words, many different counterfactuals may yield the same desired outcome. A key challenge for researchers has therefore been to, firstly, define certain desirable characteristics of counterfactual explanations and, secondly, come up with efficient ways to achieve them.

One of the most important and studied characteristics of counterfactual explanations is ‘plausibility’: explanations should look realistic to humans. Plausibility is positively associated with actionability, robustness (Artelt et al. 2021) and causal validity (Mahajan, Tan, and Sharma 2020). To achieve plausibility, many existing approaches rely on surrogate models. This is straightforward but it also convolutes things further: it essentially reallocates the task of learning plausible explanations for the data from the model itself to the surrogate.

In our AAAI 2024 paper, Faithful Model Explanations through Energy-Based Conformal Counterfactuals (ECCCo), we propose that we should not only look for explanations that please us but rather focus on generating counterfactuals that faithfully explain model behavior. It turns out that we can achieve both faithfulness and plausibility by relying solely on the model itself, leveraging recent advances in energy-based modelling and conformal prediction. We support this claim through extensive empirical studies and believe that ECCCo opens avenues for researchers and practitioners seeking tools to better distinguish trustworthy from unreliable models.

This is a companion post to our recent AAAI 2024 paper co-authored with Mojtaba Farmanbar, Arie van Deursen and Cynthia C. S. Liem. The paper is a more formal and detailed treatment of the topic and is available here. This post is intentionally free of technical details, maths or code. It is meant to provide a high-level overview of the paper.