Exploratory Data Analysis for Credit Scoring with Python

Contents

project, it is often tempting to jump to modeling. Yet the first step and the most important one is to understand the data.

In our previous post, we presented how the databases used to build credit scoring models are constructed. We also highlight the importance of asking right questions:

Who are the customers?
What types of loans are they granted?
What characteristics appear to explain default risk?

In this article, we illustrate this foundational step using an open-source dataset available on Kaggle: the Credit Scoring Dataset. This dataset contains 32,581 observations and 12 variables describing loans issued by a bank to individual borrowers.

These loans cover a range of financing needs — medical, personal, educational, and professional — as well as debt consolidation operations. Loan amounts range from $500 to $35,000.

The variables capture two dimensions:

contract characteristics (loan amount, interest rate, purpose of financing, credit grade, and time elapsed since loan origination),
borrower characteristics (age, income, years of professional experience, and housing status).

The model’s target variable is default, which takes the value 1 if the customer is in default and 0 otherwise.

Today, many tools and an increasing number of AI agents are capable of automatically generating statistical descriptions of datasets. Nevertheless, performing this analysis manually remains an excellent exercise for beginners. It builds a deeper understanding of the data structure, helps highlight potential anomalies, and supports the identification of variables that may be predictive of risk.

In this article, we take a simple instructional approach to statistically describing each variable in the dataset.

For categorical variables, we analyze the number of observations and the default rate for each category.
For continuous variables, we discretize them into four intervals defined by the quartiles:
- ]min; Q1], ]Q1; Q2], ]Q2; Q3] and ]Q3; max]

We then apply the same descriptive analysis to these intervals as for categorical variables. This segmentation is arbitrary and could be replaced by other discretization methods. The goal is simply to get an initial read on how risk behaves across the different loan and borrower characteristics.

Descriptive Statistics of the Modeling Dataset

Distribution of the Target Variable (loan_status)

This variable indicates whether the loan granted to a counterparty has resulted in a repayment default. It takes two values: 0 if the customer is not in default, and 1 if the customer is in default.

Over 78% of customers have not defaulted. The dataset is imbalanced, and it is important to account for this imbalance during modeling.

The next relevant variable to analyze would be a temporal one. It would allow us to study how the default rate evolves over time, verify its stationarity, and assess its stability and its predictability.

Unfortunately, the dataset contains no temporal information. We do not know when each observation was recorded, which makes it impossible to determine whether the loans were issued during a period of economic stability or during a downturn.

This information is nonetheless essential in credit risk modeling. Borrower behavior can vary significantly depending on the macroeconomic environment. For instance, during financial crises — such as the 2008 subprime crisis or the COVID-19 pandemic — default rates typically rise sharply compared to more favorable economic periods.

The absence of a temporal dimension in this dataset therefore limits the scope of our analysis. In particular, it prevents us from studying how risk dynamics evolve over time and from evaluating the potential robustness of a model against economic cycles.

We do, however, have access to the variable cb_person_cred_hist_length, which represents the length of a customer’s credit history, expressed in years.

Distribution by Credit History Length (`cb_person_cred_hist_length`)

This variable has 29 distinct values, ranging from 2 to 30 years. We will treat it as a continuous variable and discretize it using quantiles.

Several observations can be drawn from the table above. First, more than 56% of borrowers have a credit history of four years or less, indicating that a large proportion of clients in the dataset have relatively short credit histories.

Second, the default rate appears fairly stable across intervals, hovering around 21%. That said, borrowers with shorter credit histories tend to exhibit slightly riskier behavior than those with longer ones, as reflected in their higher default rates.

Distribution by Previous Default (`cb_person_default_on_file`)

This variable indicates whether the borrower has previously defaulted on a loan. It therefore provides valuable information about the past credit behavior of the client.

It has two possible values:

Y: the borrower has defaulted in the past
N: the borrower has never defaulted

In this dataset, more than 80% of borrowers have no history of default, suggesting that the majority of clients have maintained a satisfactory repayment record.

However, a clear difference in risk emerges between the two groups. Borrowers with a previous default history are significantly riskier, with a default rate of about 38%, compared with around 18% for borrowers who have never defaulted.

This result is consistent with what is commonly observed in credit risk modeling: past repayment behavior is often one of the strongest predictors of future default.

Distribution by Age

The presence of the age variable in this dataset indicates that the loans are granted to individual borrowers (retail clients) rather than corporate entities. To better analyze this variable, we group borrowers into age intervals based on quartiles.

The dataset includes borrowers across a wide range of ages. However, the distribution is strongly skewed toward younger individuals: more than 70% of borrowers are under 30 years old.

The analysis of default rates across the age groups reveals that the highest risk is concentrated in the first quartile, followed by the second quartile. In other words, younger borrowers appear to be the riskiest segment in this dataset.

Distribution by Annual Income

Borrowers’ annual income in this dataset ranges from $4,000 to $6,000,000. To analyze its relationship with default risk, we divide income into four intervals based on quartiles.

The results show that the highest default rates are concentrated among borrowers with the lowest incomes, particularly in the first quartile ($4,000–$385,00) and the second quartile ($385,00–$55,000).

As income increases, the default rate gradually decreases. Borrowers in the third quartile ($55,000–$792,000) and the fourth quartile ($792,000–$600,000) exhibit noticeably lower default rates.

Overall, this pattern suggests an inverse relationship between annual income and default risk, which is consistent with standard credit risk expectations: borrowers with higher incomes typically have greater repayment capacity and financial stability, making them less likely to default.

Distribution by Home Ownership

This variable describes the borrower’s housing status. The categories include RENT (tenant), MORTGAGE (homeowner with a mortgage), OWN (homeowner without a mortgage), and OTHER (other housing arrangements).

In this dataset, approximately 50% of borrowers are renters, 40% are homeowners with a mortgage, 8% own their home outright, and about 2% fall into the “OTHER” category.

The analysis reveals that the highest default rates are observed among renters (RENT) and borrowers classified as “OTHER.” In contrast, homeowners without a mortgage (OWN) exhibit the lowest default rates, followed by borrowers with a mortgage (MORTGAGE).

Distributionby person employment length person_emp_length

This variable measures the borrower’s employment length in years. To analyze its relationship with default risk, borrowers are grouped into four intervals based on quartiles: the first quartile (0–2 years), the second quartile (2–4 years), the third quartile (4–7 years), and the fourth quartile (7 years or more).

The analysis shows that the highest default rates are concentrated among borrowers with the shortest employment histories, particularly those in the first quartile (0–2 years) and the second quartile (2–4 years).

As employment length increases, the default rate tends to decline. Borrowers in the third quartile (4–7 years) and the fourth quartile (7 years or more) exhibit lower default rates.

Overall, this pattern suggests an inverse relationship between employment length and default risk, indicating that borrowers with longer employment histories may benefit from greater income stability and financial security, which reduces their likelihood of default.

Distribution by loan intent

This categorical variable describes the purpose of the loan requested by the borrower. The categories include EDUCATION, MEDICAL, VENTURE (entrepreneurship), PERSONAL, DEBTCONSOLIDATION, and HOMEIMPROVEMENT.

The number of borrowers is fairly balanced across the different loan purposes, with a slightly higher share of loans used for education (EDUCATION) and medical expenses (MEDICAL).

However, the analysis reveals notable differences in risk across categories. Borrowers seeking loans for debt consolidation (DEBTCONSOLIDATION) and medical purposes (MEDICAL) exhibit higher default rates. In contrast, loans intended for education (EDUCATION) and entrepreneurial activities (VENTURE) are associated with lower default rates.

Overall, these results suggest that the purpose of the loan may be an important risk indicator, as different financing needs can reflect varying levels of financial stability and repayment capacity.

Distribution by loan grade

This categorical variable represents the loan grade assigned to each borrower, typically based on an assessment of their credit risk profile. The grades range from A to G, where A corresponds to the lowest-risk loans and G to the highest-risk loans.

In this dataset, more than 80% of borrowers are assigned grades A, B, or C, indicating that the majority of loans are considered relatively low risk. In contrast, grades D, E, F, and G correspond to borrowers with higher credit risk, and these categories account for a much smaller share of the observations.

The distribution of default rates across the grades shows a clear pattern: the default rate increases as the loan grade deteriorates. In other words, borrowers with lower credit grades tend to exhibit higher probabilities of default.

This result is consistent with the purpose of the grading system itself, as loan grades are designed to summarize the borrower’s creditworthiness and associated risk level.

Distribution by Loan Amount

This variable represents the loan amount requested by the borrower. In this dataset, loan amounts range from $500 to $35,000, which corresponds to relatively small consumer loans.

The analysis of default rates across the quartiles shows that the highest risk is concentrated among borrowers in the upper range of loan amounts, particularly in the fourth quartile ($20,000–$35,000), where default rates are higher.

Distribution by loan interest rate (loan_int_rate)

This variable represents the interest rate applied to the loan granted to the borrower. In this dataset, interest rates range from 5% to 24%.

To analyze the relationship between interest rates and default risk, we group the observations into quartiles. The results show that the highest default rates are concentrated in the upper range of interest rates, particularly in the fourth quartile (approximately 13%–24%).

Distribution by loan percent income

This variable measures the percentage of a borrower’s annual income allocated to loan repayment. It indicates the financial burdenassociated with the loan relative to the borrower’s income.

The analysis shows that the highest default rates are concentrated in the upper quartile, where borrowers allocate between 20% and 100% of their income to loan repayment.

Conclusion

In this analysis, we have described each of the 12 variables in the dataset. This exploratory step allowed us to build a clear understanding of the data and quickly summarize its key characteristics in the introduction.

In the past, this type of analysis was often time-consuming and typically required the collaboration of several data scientists to perform the statistical exploration and produce the final reporting. While the interpretations of different variables may sometimes appear repetitive, such detailed documentation is often required in regulated environments, particularly in fields like credit risk modeling.

Today, however, the rise of artificial intelligence is transforming this workflow. Tasks that previously required several days of work can now be completed in less than 30 minutes, under the supervision of a statistician or data scientist. In this setting, the expert’s role shifts from manually performing the analysis to guiding the process, validating the results, and ensuring their reliability.

In practice, it is possible to design two specialized AI agents at this stage of the workflow. The first agent assists with data preparation and dataset construction, while the second performs the exploratory analysis and generates the descriptive reporting presented in this article.

Several years ago, it was already recommended to automate these tasks whenever possible. In this post, the tables used throughout the analysis were generated automatically using the Python functions presented at the end of this article.

In the next article, we will take the analysis a step further by exploring variable treatment, detecting and handling outliers, analyzing relationships between variables, and performing an initial feature selection.

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.

[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.

[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.

[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.

[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.

[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.

[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.

Data & Licensing

The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC0: Public Domain.

Disclaimer

Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.

Codes

import pandas as pd
from typing import Optional, Union


def build_default_summary(
    df: pd.DataFrame,
    category_col: str,
    default_col: str,
    category_label: Optional[str] = None,
    include_na: bool = False,
    sort_by: str = "count",
    ascending: bool = False,
) -> pd.DataFrame:
    """
    Construit un tableau de synthèse pour une variable catégorielle.

    Paramètres
    ----------
    df : pd.DataFrame
        DataFrame source.
    category_col : str
        Nom de la variable catégorielle.
    default_col : str
        Colonne binaire indiquant le défaut (0/1 ou booléen).
    category_label : str, optionnel
        Libellé à afficher pour la première colonne.
        Par défaut : category_col.
    include_na : bool, default=False
        Si True, conserve les valeurs manquantes comme catégorie.
    sort_by : str, default="count"
        Colonne de tri logique parmi {"count", "defaults", "prop", "default_rate", "category"}.
    ascending : bool, default=False
        Sens du tri.

    Retour
    ------
    pd.DataFrame
        Tableau prêt à exporter.
    """

    if category_col not in df.columns:
        raise KeyError(f"La colonne catégorielle '{category_col}' est introuvable.")
    if default_col not in df.columns:
        raise KeyError(f"La colonne défaut '{default_col}' est introuvable.")

    data = df[[category_col, default_col]].copy()

    # Validation minimale sur la cible
    # On convertit bool -> int ; sinon on suppose 0/1 documenté
    if pd.api.types.is_bool_dtype(data[default_col]):
        data[default_col] = data[default_col].astype(int)

    # Gestion des NA de la variable catégorielle
    if include_na:
        data[category_col] = data[category_col].astype("object").fillna("Missing")
    else:
        data = data[data[category_col].notna()].copy()

    grouped = (
        data.groupby(category_col, dropna=False)[default_col]
        .agg(count="size", defaults="sum")
        .reset_index()
    )

    total_obs = grouped["count"].sum()
    total_def = grouped["defaults"].sum()

    grouped["prop"] = grouped["count"] / total_obs if total_obs > 0 else 0.0
    grouped["default_rate"] = grouped["defaults"] / grouped["count"]

    sort_mapping = {
        "count": "count",
        "defaults": "defaults",
        "prop": "prop",
        "default_rate": "default_rate",
        "category": category_col,
    }
    if sort_by not in sort_mapping:
        raise ValueError(
            "sort_by doit être parmi {'count', 'defaults', 'prop', 'default_rate', 'category'}."
        )

    grouped = grouped.sort_values(sort_mapping[sort_by], ascending=ascending).reset_index(drop=True)

    total_row = pd.DataFrame(
        {
            category_col: ["Total"],
            "count": [total_obs],
            "defaults": [total_def],
            "prop": [1.0 if total_obs > 0 else 0.0],
            "default_rate": [total_def / total_obs if total_obs > 0 else 0.0],
        }
    )

    summary = pd.concat([grouped, total_row], ignore_index=True)

    

    summary = summary.rename(
        columns={
            category_col: category_label or category_col,
            "count": "Nb of obs",
            "defaults": "Nb def",
            "prop": "Prop",
            "default_rate": "Default rate",
        }
    )
    summary = summary[[category_label or category_col, "Nb of obs", "Prop", "Nb def", "Default rate"]]
    return summary


def export_summary_to_excel(
    summary: pd.DataFrame,
    output_path: str,
    sheet_name: str = "Summary",
    title: str = "All perimeters",
) -> None:
    """
    Exporte le tableau de synthèse dans un fichier Excel avec mise en forme.
    Nécessite le moteur xlsxwriter.
    """

    with pd.ExcelWriter(output_path, engine="xlsxwriter") as writer:
        #

        workbook = writer.book
        worksheet = workbook.add_worksheet(sheet_name)

        nrows, ncols = summary.shape
        total_excel_row = 2 + nrows  # +1 implicite Excel car index 0-based côté xlsxwriter pour set_row
        # Détail :
        # ligne 0 : titre fusionné
        # ligne 2 : header
        # données commencent ligne 3 (Excel visuel), mais xlsxwriter manipule en base 0

        # -------- Formats --------
        border_color = "#4F4F4F"
        header_bg = "#D9EAF7"
        title_bg = "#CFE2F3"
        total_bg = "#D9D9D9"
        white_bg = "#FFFFFF"

        title_fmt = workbook.add_format({
            "bold": True,
            "align": "center",
            "valign": "vcenter",
            "font_size": 14,
            "border": 1,
            "bg_color": title_bg,
        })

        header_fmt = workbook.add_format({
            "bold": True,
            "align": "center",
            "valign": "vcenter",
            "border": 1,
            "bg_color": header_bg,
        })

        text_fmt = workbook.add_format({
            "border": 1,
            "align": "left",
            "valign": "vcenter",
            "bg_color": white_bg,
        })

        int_fmt = workbook.add_format({
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "# ##0",
            "bg_color": white_bg,
        })

        pct_fmt = workbook.add_format({
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "0.00%",
            "bg_color": white_bg,
        })

        total_text_fmt = workbook.add_format({
            "bold": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "bg_color": total_bg,
        })

        total_int_fmt = workbook.add_format({
            "bold": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "# ##0",
            "bg_color": total_bg,
        })

        total_pct_fmt = workbook.add_format({
            "bold": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "0.00%",
            "bg_color": total_bg,
        })

        # -------- Titre fusionné --------
        worksheet.merge_range(0, 0, 0, ncols - 1, title, title_fmt)

        # -------- Header --------
        worksheet.set_row(2, 28)
        for col_idx, col_name in enumerate(summary.columns):
            worksheet.write(1, col_idx, col_name, header_fmt)

        # -------- Largeurs de colonnes --------
        column_widths = {
            0: 24,  # catégorie
            1: 14,  # Nb of obs
            2: 12,  # Nb def
            3: 10,  # Prop
            4: 14,  # Default rate
        }
        for col_idx in range(ncols):
            worksheet.set_column(col_idx, col_idx, column_widths.get(col_idx, 15))

        # -------- Mise en forme cellule par cellule --------
        last_row_idx = nrows - 1

        for row_idx in range(nrows):
            excel_row = 2 + row_idx  # données à partir de la ligne 3 (0-based xlsxwriter)

            is_total = row_idx == last_row_idx

            for col_idx, col_name in enumerate(summary.columns):
                value = summary.iloc[row_idx, col_idx]

                if col_idx == 0:
                    fmt = total_text_fmt if is_total else text_fmt
                elif col_name in ["Nb of obs", "Nb def"]:
                    fmt = total_int_fmt if is_total else int_fmt
                elif col_name in ["Prop", "Default rate"]:
                    fmt = total_pct_fmt if is_total else pct_fmt
                else:
                    fmt = total_text_fmt if is_total else text_fmt

                worksheet.write(excel_row, col_idx, value, fmt)

        # Optionnel : figer le header
        #worksheet.freeze_panes(3, 1)

        worksheet.set_default_row(24)


def generate_categorical_report_excel(
    df: pd.DataFrame,
    category_col: str,
    default_col: str,
    output_path: str,
    sheet_name: str = "Summary",
    title: str = "All perimeters",
    category_label: Optional[str] = None,
    include_na: bool = False,
    sort_by: str = "count",
    ascending: bool = False,
) -> pd.DataFrame:
    """
    
    1. calcule le tableau
    2. l'exporte vers Excel
    3. renvoie aussi le DataFrame récapitulatif
    """
    summary = build_default_summary(
        df=df,
        category_col=category_col,
        default_col=default_col,
        category_label=category_label,
        include_na=include_na,
        sort_by=sort_by,
        ascending=ascending,
    )

    export_summary_to_excel(
        summary=summary,
        output_path=output_path,
        sheet_name=sheet_name,
        title=title,
    )

    return summary

def discretize_variable_by_quartiles(
    df: pd.DataFrame,
    variable: str,
    new_var: str | None = None
) -> pd.DataFrame:
    """
    Discretize a continuous variable into four intervals based on its quartiles.

    The function computes Q1, Q2 (median), and Q3 of the selected variable and
    creates four bins corresponding to the following intervals:

        ]min ; Q1], ]Q1 ; Q2], ]Q2 ; Q3], ]Q3 ; max]

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe containing the variable to discretize.

    variable : str
        Name of the continuous variable to be discretized.

    new_var : str, optional
        Name of the new categorical variable created. If None,
        the name "<variable>_quartile" is used.

    Returns
    -------
    pd.DataFrame
        A copy of the dataframe with the new quartile-based categorical variable.
    """

    # Create a copy of the dataframe to avoid modifying the original dataset
    data = df.copy()

    # If no name is provided for the new variable, create one automatically
    if new_var is None:
        new_var = f"{variable}_quartile"

    # Compute the quartiles of the variable
    q1, q2, q3 = data[variable].quantile([0.25, 0.50, 0.75])

    # Retrieve the minimum and maximum values of the variable
    vmin = data[variable].min()
    vmax = data[variable].max()

    # Define the bin edges
    # A small epsilon is subtracted from the minimum value to ensure it is included
    bins = [vmin - 1e-9, q1, q2, q3, vmax]

    # Define human-readable labels for each interval
    labels = [
        f"]{vmin:.2f} ; {q1:.2f}]",
        f"]{q1:.2f} ; {q2:.2f}]",
        f"]{q2:.2f} ; {q3:.2f}]",
        f"]{q3:.2f} ; {vmax:.2f}]",
    ]

    # Use pandas.cut to assign each observation to a quartile-based interval
    data[new_var] = pd.cut(
        data[variable],
        bins=bins,
        labels=labels,
        include_lowest=True
    )

    # Return the dataframe with the new discretized variable
    return data

Example of application for a continuous variable

# Distribution by age (person_age)
# Discretize the variable into quartiles

df_with_age_bins = create_quartile_bins(
    df,
    variable="person_age",
    new_var="age_quartile"
)

summary = generate_categorical_report_excel(
    df=df_with_age_bins,
    category_col="age_quartile",
    default_col="def",
    output_path="age_quartile_report.xlsx",
    sheet_name="Age Quartiles",
    title="Distribution by Age (Quartiles)",
    category_label="Age Quartiles",
    sort_by="default_rate",
    ascending=False
)