How to Train a Scoring Model in the Age of Artificial Intelligence

Contents

All code used in this section is available on GitHub. The business logic and modeling functions are located in the src/selection directory, specifically in the following file:

src/selection/logit_model_selection.py

The corresponding analysis and results are documented in:

08_logistic_model_selection.qmd

, it has become easier to generate code, automate model training, compare metrics, and produce summary tables. A few well-structured prompts can now help a data scientist write Python scripts, estimate logistic regressions, compute AUC and Gini, generate plots, and document the results.

But this speed creates a risk.

A scoring model is not just an algorithm that runs successfully. It is not simply the model with the highest performance on the training sample. In a professional credit risk environment, a scoring model must be statistically sound, stable over time, interpretable, consistent with business expectations, and easy to monitor after deployment.

This article is part of a broader series on building robust, interpretable, and stable scoring models. In previous articles, we covered the main steps before modeling: building the datasets, performing exploratory data analysis, preparing variables, preselecting predictors, testing stability over time, comparing development and validation samples, and discretizing continuous variables.

We now turn to one of the most important stages: training candidate models and selecting the final model.

The goal of this article is to present a clear methodology for training several scoring models, comparing their performance, assessing their stability, and selecting a final model based on statistical, business, and operational criteria.

Tools such as ChatGPT, Codex, and GitHub Copilot can assist with generating code, automating modeling loops, running statistical tests, producing summary tables, and documenting results. In this work, we will specifically use Codex and assess its ability to carry out each of these tasks.

The article is organized into three parts. First, we present the datasets used in the modeling process. Second, we describe the methodology used to train and evaluate candidate models. Third, we explain how to analyze the results and select the final model.

The Datasets

In this article, we illustrate this foundational step using an open-source dataset available on Kaggle: the Credit Scoring Dataset. This dataset contains 32,581 observations and 12 variables describing loans issued by a bank to individual borrowers.

Throughout this series, we have applied a range of processing steps to these variables in order to pre-select the candidate variables for the final model selection, subject to both statistical and regulatory constraints.

In this application, the variables retained after the preselection steps are categorical. Most of them have two or three modalities. This is consistent with the previous stages of the methodology, where continuous variables were discretized to improve interpretability and make the final score easier to explain.

The retained variables are:

Those variables are explanatory variables denoted by $X_1, …, X_{q}$ . In this case, q =6.

The target variable, denoted by Y, represents default status. In this case, it corresponds to the variable loan_status. It is defined as:

$Y = \begin{cases} 1 & \text{if the borrower is in default} \\ 0 & \text{otherwise} \end{cases}$

The objective is to estimate the probability of default conditional on the observed characteristics:

$P(Y = 1 \mid X_1 = x_1, X_2 = x_2, \dots, X_{6} = x_{6})$

The score is then constructed as a transformation of this estimated probability. In the case of logistic regression, this transformation is based on the logit function.

The data are split into three main samples.

The training sample is used to estimate the parameters of the candidate models. In our case, it is also divided into four folds to assess the robustness of the models across different subsamples.

The test sample is used to evaluate model performance on observations that were not directly used to estimate the coefficients. It helps determine whether the model generalizes well to a population similar to the development sample.

The out-of-time sample is used to assess temporal stability. This is especially important in credit scoring. A model should not only perform well at the time of development; it should also remain stable when applied to a different time period.

This distinction matters because a model can look strong on the training data but deteriorate significantly on the out-of-time sample. When that happens, the model may be overfitted or too dependent on the development period.

Reformulating the Scoring Problem

A scoring model estimates the relationship between a binary target variable $Y$ and a set of explanatory variables $X_1, X_2, \dots, X_{6}$ .

For each individual i, the model produces a score based on the estimated probability of default:

$Score(x_i) = f \left(P(Y_i = 1 \mid X_{1,i}, X_{2,i}, \dots, X_{q,i})\right)$

In credit scoring, the score must rank borrowers by risk. A good model should assign higher-risk scores, on average, to borrowers who default and lower-risk scores to borrowers who do not.

This ranking ability is why discrimination metrics such as AUC and Gini are central in scoring. However, discrimination alone is not enough. A model can have good predictive power and still be unstable, difficult to interpret, or inconsistent with business logic.

That is why the final model must be selected using several criteria, not just one performance metric.

Why Logistic Regression Remains the Reference Model

Because the target variable is binary, logistic regression is a natural reference model. It models the log-odds of default as a linear combination of the explanatory variables:

$\log \left( \frac{P(Y = 1 \mid X)}{1 – P(Y = 1 \mid X)} \right) = \beta_0 + \beta_1 X_1 + \dots + \beta_q X_q$

Logistic regression has several advantages in a scoring context. It is designed for binary outcomes, produces interpretable coefficients, allows the analyst to verify the direction of risk, and is well understood by statistical, business, and IT teams. It is also relatively easy to implement in production.

In the age of artificial intelligence, it may be tempting to move directly to more complex models such as random forests, gradient boosting, or neural networks. These models can sometimes deliver better raw performance.

But in credit scoring, raw performance is not the only objective. The model must also be explainable, documented, stable, and aligned with business expectations. For this reason, logistic regression remains a strong benchmark and, in many cases, the preferred production model.

Artificial intelligence can accelerate the modeling process, but it does not change the core requirements of a professional scoring model.

Preparing Categorical Variables

Since the explanatory variables are categorical, they must be transformed before being used in logistic regression.

Each categorical variable is converted into dummy variables. If a variable has n modalities, it is represented by n – 1 indicators. One modality is kept as the reference category.

This avoids perfect multicollinearity between modalities. The estimated coefficients are then interpreted relative to the reference category.

For example, suppose a variable has three modalities: A, B, and C. If A is selected as the reference, the model estimates one coefficient for B and one coefficient for C. These coefficients measure the difference in risk between B and A, and between C and A.

In this methodology, the reference category is chosen as the least risky modality, meaning the modality with the lowest default rate in the training sample. This makes interpretation easier: positive coefficients indicate higher risk relative to the safest modality.

Training Candidate Models

After variable preselection, all relevant combinations of candidate variables are tested.

The objective is not simply to identify the model with the highest training performance. The goal is to retain a model that satisfies several requirements:

statistical validity;
business consistency;
sufficient discriminatory power;
stability across samples;
a reasonable number of variables;
limited multicollinearity;
clear interpretability.

For each combination of variables, a logistic regression is estimated on the training sample and evaluated across the validation folds.

Each candidate model is assessed using four families of criteria: statistical validation, predictive performance, stability, and interpretability.

This process can be largely automated with artificial intelligence. An AI coding assistant can help generate loops over variable combinations, estimate models, store coefficients, calculate metrics, and produce comparison tables.

Statistical Validation Criteria

The first level of evaluation concerns statistical validity.

Global Significance

Global significance can be assessed using a likelihood ratio test. This test compares the full model with a null model that includes only the intercept.

The purpose is to verify whether the explanatory variables collectively add significant information in explaining the target variable.

A model that does not significantly improve on the null model should not be retained, even if some descriptive metrics appear acceptable.

Individual Significance

Individual significance is assessed by analyzing the coefficients and their associated statistical tests, such as Wald tests, likelihood ratio tests, or p-values.

In this methodology, selected variables must be significant at the 5% level. The modalities should also be reviewed to ensure that each retained variable contributes meaningfully to risk discrimination.

This step is important because a variable may appear useful overall while some of its modalities are weak, unstable, or difficult to interpret.

Direction of Risk

Statistical significance is not enough. The coefficients must also be consistent with business expectations.

If a modality is expected to represent higher risk, its coefficient should indicate an increase in the probability of default relative to the reference category.

A model can be statistically strong but difficult to justify if the direction of risk is inconsistent with economic or business logic. In professional scoring, this type of inconsistency must be carefully investigated before the model can be accepted.

Multicollinearity

Multicollinearity can make coefficient estimates unstable and difficult to interpret. It is commonly assessed using the Variance Inflation Factor, or VIF.

In this methodology, retained models must satisfy:

VIF < 10

Because the variables are categorical, the VIF is calculated on the dummy variables, excluding the reference modalities. For each categorical variable, we return a simple status:

OK if all modalities satisfy the VIF constraint;
KO if at least one modality has VIF >= 10.

This rule helps eliminate models in which explanatory variables are too strongly redundant.

Goodness of Fit

Goodness of fit can be assessed using tests such as the Hosmer-Lemeshow test. This test compares predicted probabilities with observed default rates across risk groups.

It should not be interpreted in isolation, but it can provide useful information about calibration.

In this application, we do not use the Hosmer-Lemeshow test directly. In our Python workflow, we are not relying on a documented built-in one-call implementation for this test. It should therefore either be coded manually, implemented with a validated external function, or handled in another statistical environment. A dedicated article will cover this topic separately.

Performance Metrics

Model performance is evaluated from two perspectives.

The first perspective measures discrimination: the model’s ability to distinguish borrowers who default from borrowers who do not. This is captured by the ROC curve, AUC, and Gini.

The second perspective focuses on class imbalance and the quality of positive-class prediction. This is captured by recall, precision, F1-score, and PR-AUC.

ROC Curve, AUC, and Gini

The ROC curve shows the relationship between the true positive rate and the false positive rate across different classification thresholds.

The true positive rate, also called recall, is defined as:

$TPR = \frac{TP}{TP + FN}$

It measures the proportion of actual defaults correctly identified by the model.

The false positive rate is defined as:

$FPR = \frac{FP}{FP + TN}$

It measures the proportion of non-defaulting borrowers incorrectly classified as defaults.

The AUC, or Area Under the Curve, summarizes the ROC curve. The closer the AUC is to 1, the better the model is at ranking risky and non-risky borrowers. An AUC close to 0.5 indicates performance close to random classification.

The Gini index is a common transformation of AUC in credit scoring:

$Gini = 2 \times AUC – 1$

A Gini of 0 corresponds to random performance. A higher Gini indicates stronger discriminatory power.

Recall, Precision, and F1-Score

When the target variable is imbalanced, it is useful to complement AUC and Gini with metrics focused on the default class.

Recall measures how many actual defaults are correctly detected:

$Recall = \frac{TP}{TP + FN}$

Precision measures how many predicted defaults are truly defaults:

$Precision = \frac{TP}{TP + FP}$

The F1-score combines precision and recall through a harmonic mean:
$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

This metric is useful when we need to balance the ability to detect defaults with the need to limit false positives.

Precision-Recall AUC

The Precision-Recall curve plots precision against recall for different thresholds. It is particularly useful when the positive class is rare.

The PR-AUC should be interpreted relative to the default rate in the sample. A useful model should generally achieve a PR-AUC above the observed default rate.

Conditional Score Distributions

Numerical metrics should be complemented with graphical analysis.

The conditional distributions of scores for defaulting and non-defaulting borrowers help show whether the model separates the two populations effectively.

A good model should produce visibly different score distributions. If the distributions strongly overlap, the model has limited discriminatory power, even if some metrics appear acceptable.

Stability Criteria

A scoring model should not be selected based only on training performance. It must remain stable across different samples.

For this reason, performance is compared across:

the training sample;
the test sample;
the out-of-time sample;
the validation folds.

A model with a high training Gini but a strong deterioration on the test or out-of-time sample may be overfitted.

To account for stability, we use a penalized Gini criterion:

$\text{Gini}_{\text{penalized}} = \text{mean}(\text{Gini}_{\text{folds}}) – |\text{Gini}_{\text{train}} – \text{Gini}_{\text{test}}| – |\text{Gini}_{\text{train}} – \text{Gini}_{\text{OOT}}|$

This criterion rewards models that combine good average performance across folds with limited degradation between samples.

The same logic can be applied to recall, precision, F1-score, and PR-AUC.

The key idea is simple: a good scoring model should perform well, but it should also perform consistently.

Selecting the Optimal Number of Variables

Once statistically acceptable models have been identified, performance is analyzed by the number of variables included.

The goal is to find the smallest model that delivers satisfactory performance and stability.

A more complex model is not always better. Adding variables may slightly improve Gini, but it can also reduce stability, increase the risk of overfitting, and make interpretation more difficult.

The final model should balance:

performance;
stability;
interpretability;
simplicity;
business consistency.

In scoring, this balance is often more important than maximizing a single metric.

A model with six stable, interpretable variables may be preferable to a model with ten variables and a slightly higher training Gini.

The Role of Large Language Models

In this article, the training, comparison, and selection code is produced with the assistance of an artificial intelligence tool, specifically Codex with an advanced reasoning model.

The purpose is not to delegate statistical judgment to AI. The purpose is to use AI as an accelerator for repetitive and technical tasks.

AI can help generate data preparation scripts, automate variable combinations, estimate logistic regressions, compute performance metrics, check statistical constraints, compare train, test, and out-of-time results, produce summary tables, and document the workflow.

This makes AI a powerful methodological assistant.

However, the results must still be reviewed. Statistical tests must be interpreted correctly. Coefficients must be checked. Business consistency must be validated. Stability must be assessed. The final model must be selected by the analyst, not by the tool.

Presenting the Results

The results should follow the same logic as the model selection process.

First, present the number of candidate variables, the number of combinations tested, and the number of models eliminated at each stage. This makes the selection process transparent.

Second, present the statistically acceptable models. These are the models that satisfy the main validation criteria: global significance, variable significance, coherent direction of risk, acceptable VIF levels, and stable coefficients.

Third, compare the remaining models using performance and stability metrics:

average Gini across folds;
train Gini;
test Gini;
out-of-time Gini;
train-test gap;
train-out-of-time gap;
penalized Gini;
recall;
precision;
F1-score;
PR-AUC.

The best model for each number of variables — satisfying all statistical and stability constraints — is presented in the table below.

The choice of the final model depends on the objective. In this case, Model 4 is selected. The default rate on the training set is 22%, which sets the minimum PR-AUC benchmark at approximately 22%. A meaningful model must achieve a PR-AUC substantially above this threshold.

Model 5 achieves the best penalized PR-AUC, the best penalized recall, and the best penalized F1-score. If the primary objective is the operational detection of defaults using a classification threshold, Model 5 is a compelling option.

However, for a scoring model, the main criterion remains the ability to rank risk—that is, the Gini index —particularly on the test and out-of-time datasets, and, in our case, the penalized Gini.

Model 4 offers the best overall trade-off for the following reasons:

It achieves the highest penalized Gini at 56.01%, reflecting strong and stable discriminatory power across datasets.
It improves marginally on Model 3 by incorporating the variablecb_person_default_on_file, which adds meaningful risk information.
Its penalized PR-AUC of 48.44% is well above the 22% default rate, confirming the model’s ability to identify defaulting borrowers.
With only 4 variables, it remains highly interpretable and easy to explain to business and governance teams.

For these reasons, Model 4 is selected as the final scoring model. The estimated coefficients of this model are presented in the table below:

Finally, the chart below summarizes the discrimination performance of the final model by presenting the Gini index across the training, test, and out-of-time datasets. The results confirm the absence of overfitting, as the Gini values remain consistent across all three datasets.

The model has been saved in Python using the pickle format for future use, for instance, to compute scores for the various counterparties within the portfolio perimeter.

Conclusion

In this article, we presented the key steps involved in selecting the best candidate model, a model that will subsequently be used to build a score capable of discriminating between counterparties across a retail portfolio, using logistic regression as the reference framework.

The results show that the four-variable model offers the best trade-off between discriminatory performance, predictive ability, and temporal stability. With a Gini of approximately 60% and a PR-AUC of approximately 49%, it demonstrates both strong risk-ranking capacity and a meaningful ability to identify defaulting borrowers — well above the 22% baseline set by the observed default rate.

Throughout this work, we used OpenAI’s Codex agent to assist with code writing and chart production. The outputs were generated by specifying the desired format, with no additional manual adjustments. The quality of the results was consistently high, confirming that this type of tool can serve as a reliable methodological assistant and is likely to meaningfully influence the way scoring models are developed in the future.

In the next installment, we will present how scores are computed for the various counterparties within the portfolio, along with the individual contributions of each variable to the final score.

References

[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.

[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.

[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.

[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.

[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.

[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.

[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.

[8] Laborda, J., & Ryoo, S. (2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.

Data & Licensing

The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC0: Public Domain.

Disclaimer

Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.