TE2Rules: Explaining “Why did my model say that?”

Contents

Taking model explainability beyond images and text Step 1: Train the XGBoost model

Taking model explainability beyond images and text

In the rapidly evolving landscape of artificial intelligence, recent advancements have propelled the field to astonishing heights, enabling models to mimic human-like capabilities in handling both images and text. From crafting images with an artist’s finesse to generating captivating captions, answering questions and composing entire essays, AI has become an indispensable tool in our digital arsenal.

However, despite these extraordinary feats, the full-scale adoption of this potent technology is not universal. The black-box nature of AI models raises significant concerns, particularly in industries where transparency is paramount. The lack of insight into “why did the model say that?” introduces risks, such as toxicity and unfair biases, particularly against marginalized groups. In high-stakes domains like healthcare and finance, where the consequences of erroneous decisions are costly, the need for explainability becomes crucial. This means that it’s not enough for the model to arrive at the correct decision, but it’s also equally important to explain the rationale behind those decisions.

While models that can ingest, understand and generate more images or text has been the new frenzy among many people, many high-stake domains make decisions from data compiled into tables like user profile information, posts the user has liked, purchase history, watch history etc.,

Tabular data is no new phenomena. It has been around as long as internet has been there like user’s browser history with visited pages, click interactions, products viewed online, products bought online etc., These informations are often used by advertisers to show you relevant ads.

Many critical use cases in the high-stake domains like finance, healthcare, legal etc., also heavily rely on data organized in tabular format. Here are some examples:

Consider a hospital trying to figure out the likelihood of a patient recovering well after a certain treatment. They might use tables of patient data, including factors like age, previous health issues, and treatment details. If the models used are too complex or “black-box,” doctors may have a hard time trusting or understanding the predictions.
Similarly, in the financial world, banks analyze various factors in tables to decide if someone is eligible for a loan and what interest rate to offer. If the models they use are too complex, it becomes challenging to explain to customers why a decision was made, potentially leading to a lack of trust in the system.

In the real world, many critical decision-making tasks like diagnosing illnesses from medical tests, approving loans based on financial statements, optimizing investments according to risk profiles on robo-advisors, identifying fake profiles on social media, and targeting the right audience for tailored advertisements all involve making decisions from tabular data. While deep neural networks, such as convolutional neural networks and transformer models like GPT, excel in grasping unstructured inputs like images, text, and voice, Tree Ensemble models like XGBoost still remain the unmatched champions for handling tabular data. This might be surprising in the era of deep neural networks, but it is true! Deep Models for tabular data like TabTansformer, TabNet etc., only perform as good as XGBoost models, though they use lot more parameters.

In this blog post, we take up explaining the binary classification decisions made by an XGBoost model. An intuitive approach to explain such models is by using human understandable rules. For instance, consider a model deciding whether a user account is that of a robot. If the model labels a user as “robot,” an interpretable explanation based on model features might be that the “number of connections with other robots ≥ 100 and number of API calls per day ≥ 10k”.

TE2Rules is an algorithm designed exactly for this purpose. TE2Rules stands for Tree Ensembles to Rules, and its primary function is to explain any binary classification-oriented tree ensemble model by generating rules derived from combinations of input features. This algorithm combines decision paths extracted from multiple trees within the XGBoost Model, using a subset of unlabeled data. The data used for extracting rules from the XGBoost model need not be same as the training data and does not require any ground truth labels. The algorithm uses this data to uncover implicit correlations present in the dataset. Notably, the rules extracted by TE2Rules exhibit a high precision against the model predictions (with a default of 95%). The algorithm systematically identifies all potential rules from the XGBoost model to explain the positive instances and subsequently condenses them into a concise set of rules that effectively cover the majority of positive cases in the data. This condensed set of rules serves as a comprehensive global explainer for the model. Additionally, TE2Rules retains the longer set of all conceivable rules, which can be employed to explain specific instances through the use of succinct rules.

TE2Rules has demonstrated its effectiveness in various medical domains by providing insights into the decision-making process of models. Here are a few instances:

In this section, we show how we can use TE2Rules to explain a model trained to predict whether an individual’s income exceeds $50,000. The model is trained using Adult Income Dataset from UCI Repository. The Jupyter notebook used in this blog is available here: XGBoost-Model-Explanation-Demo. The dataset is covered by CC BY 4.0 license, permitting both academic and commercial use.