going through a deep transformation driven by technological progress. These changes affect all sectors, especially the banking industry. Data professionals must quickly adapt to become more efficient, productive, and competitive.
For experienced professionals with strong foundations in mathematics, statistics, and operational practice, this transition can be natural. However, it may be more challenging for beginners who have not yet fully mastered these fundamental skills.
In the field of credit risk, developing these skills requires a clear understanding of bank exposures and the mechanisms used to manage the associated risks.
My next articles will focus mainly on credit risk management within a regulatory framework. The European Central Bank (ECB) allows banks to use internal models to assess the credit risk of their different exposures. These exposures may include loans granted to companies to finance long-term projects or loans granted to households to finance real estate projects.
These models aim to estimate several key parameters:
- PD (Probability of Default): the probability that a borrower will be unable to meet its payment obligation.
- EAD (Exposure at Default): the exposure amount at the time of default.
- LGD (Loss Given Default): is the severity of the loss in the event of default.
We can therefore distinguish between PD models, EAD models, and LGD models. In this series, I will mainly focus on PD models. These models are used to assign ratings to borrowers and to contribute to the calculation of regulatory capital requirements, which protect banks against unexpected losses.
In this first article, I will focus on defining and constructing the modeling scope.
Definition of default
The construction of data modeling requires a clear understanding of the modeling objective and a precise definition of default. Assessing the probability of default of a counterparty involves observing the transition from a healthy state to a state of default over a given horizon h. In what follows, we will assume that this horizon is set at one year (h = 1).
The definition of default was harmonized and brought under regulatory supervision following the 2008 financial crisis. The objective was to establish a standardized definition applicable to all banking institutions.
This definition is based on several criteria, including:
- a significant deterioration in the counterparty’s financial situation,
- the existence of past-due amounts,
- situations of forbearance,
- contagion effects within a group of exposures.
Historically, there was the former definition of default (ODOD), which gradually evolved into the new definition of default (NDOD) that is currently in place.
For example, a counterparty is considered in default when the debtor has payment arrears of more than 90 days on a material credit obligation.
Once the definition of default has been clearly established, the institution can apply it to all of its clients. It may then face a potentially heterogeneous portfolio composed of large corporations, small and medium-sized enterprises (SMEs), retail clients, and sovereign entities.
To manage risk more effectively, it is essential to identify these different categories and create homogeneous sub-portfolios. This segmentation then allows each portfolio to be modeled in a more relevant and accurate way.
Definition of filters
Defining filters makes it possible to determine the modeling scope and retain only homogeneous counterparties for analysis. Filters are variables used to delimit this scope.
These variables can be identified through statistical methods, such as clustering techniques, or defined by subject matter experts based on business knowledge.
For example, when focusing on large corporations, revenue can serve as a relevant size variable to establish a threshold. One may choose to include only counterparties with annual revenue above €30 million.
Additional variables can then be used to further characterize this segment, such as industry sector, geographic region, financial ratios, or ESG indicators.
Another modeling scope may focus exclusively on retail clients who have taken loans to finance personal projects. In this case, income can be used as a filtering variable, while other relevant characteristics may include employment status, type of collateral, and loan type.
Once the objective is clearly defined, the default definition is well specified, and the scope has been properly structured through appropriate filters, constructing the modeling dataset becomes a natural next step.
Construction of the Modeling Dataset
Since the objective is to predict the probability of default over a one-year horizon, for each year (N), we must retain all healthy counterparties, meaning those that did not default at any time during year (N) (from 01/01/N to 12/31/N).
On December 31, N, the characteristics of these healthy counterparties are observed and recorded. For example, if we focus on corporate entities, then as of 12/31/N, the values of the following variables for each counterparty are collected: turnover, industry sector, and financial ratios.
To construct the default variable for each of these counterparties, we then look at year (N+1). The variable takes the value 1 if the counterparty defaults at least once during the year (N+1), and 0 otherwise.
This variable, denoted Y or def, is the target variable of the model. The chart below illustrates the process described above.
In summary, for each fixed year (N), we obtain a rectangular dataset where:
- Each row corresponds to a counterparty that was healthy as of 12/31/N,
- The columns include all explanatory variables measured at that date, denoted (Xi) for counterparty (i),
- The final column corresponds to the target variable (Yi), which indicates whether counterparty (i) defaults at least once during the year (N+1) (1) or not (0).
For example, if (N = 2015), the explanatory variables are measured as of 12/31/2015, and the target variable is observed over the year 2016.
The regulator requires modeling datasets to be built using at least five years of historical data in order to capture different economic cycles. Since the models are calibrated over multiple periods, the regulator also requires regulatory models to be Through-the-Cycle (TTC), meaning they should be relatively insensitive to short-term macroeconomic fluctuations.
Suppose we have client data covering six years, from 01/01/2015 to 12/31/2020. By applying the procedure described above for each year (N) between 2015 and 2019, five successive datasets can be constructed.
The first dataset, corresponding to the year 2015, includes all counterparties that remained performing from 01/01/2015 to 12/31/2015. Their explanatory variables ( Xi, …, Xk ) are measured as of 12/31/2015, while the default variable ( Y ) is observed over the year 2016. It takes the value 1 if the counterparty defaults at least once during 2016, and 0 otherwise.
The same process is repeated for the following years up to the 2019 dataset. This final dataset includes all counterparties that remained performing from 01/01/2019 to 12/31/2019. Their explanatory variables (X1, …, Xk) are measured as of 12/31/2019, and the default variable (Y) is observed in 2020. It takes the value 1 if the counterparty defaults at any point during 2020, and 0 otherwise.
The final modeling scope corresponds to the vertical concatenation of all datasets constructed as of 12/31/N. In our example, N ranges from 2015 to 2019. The resulting dataset can be illustrated by the rectangular table below.

Each statistical observation is identified by a pair consisting of the counterparty identifier and the year (ID x year) in which the explanatory variables are measured (as of 12/31/N). And the number of lines denotes the number of observations.
For example, the counterparty with identifier (ID = 1) may appear in both 2015 and 2018. These correspond to two distinct and independent observations in the dataset, identified respectively by the pairs (1 x 2015) and (1 x 2018).
This approach offers several advantages. In particular, it prevents temporal overlap among obligors and reduces implicit autocorrelation between observations, since each record is uniquely identified by the (id x year) pair.
In addition, it increases the likelihood of building a more robust and representative dataset. By pooling observations across multiple years, the number of default events becomes sufficiently large to support reliable model estimation. This is particularly important when analyzing portfolios of large corporations, where default events are often relatively rare.
Finally, the financial institution must implement appropriate organizational measures to ensure effective data management and security throughout the entire data lifecycle. To this end, the ECB requires financial entities to comply with common regulatory standards, such as the Digital Operational Resilience Act (DORA).
Institutions should establish a comprehensive strategic framework for information security management, as well as a dedicated data security framework specifically covering data used in internal models.
Moreover, human oversight must remain central to these processes. Procedures should therefore be thoroughly documented, and clear guidelines must be established to explain how and when human judgment should be applied.
Conclusion
Defining the model development and application scope, as well as properly documenting them, are essential steps in reducing model risk, not only at the design stage, but throughout the entire model lifecycle.
The key objective is to ensure that the development scope is representative of the intended portfolio and, when necessary, to clearly identify any extensions, restrictions, or approximations made when applying the model compared to its original design.
Preparing a standardized document that clearly defines the variables used to establish the scope is considered good practice. At a minimum, the following information should be easily identifiable: the technical name of the variable, its format, and its source.
In my next article, I will use a credit risk dataset to illustrate how to predict the probability of default for different counterparties. I will explain the steps required to properly understand the available dataset and, where possible, describe how to handle and process the different variables.
References
European Central Bank. (2025). Supervisory Guide: Guide to the SSM Supervisory Review and Evaluation Process (SREP). European Central Bank. https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guide202507.en.pdf
Image Credits
All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and Excel, unless otherwise stated.
Disclaimer
I write to learn, so mistakes are the norm, even though I try my best. Please let me know if you notice any. I am also open to any suggestions for new topics!