From NetCDF to Insights: A Practical Pipeline for City-Level Climate Risk Analysis

Contents

The Problem: From Raw Tensors to Decision-Ready Insight Foundational Data Sources Location-Specific Baselines: Defining Extreme Heat Thermodynamic Feature Engineering: Wet-Bulb Temperature Translating Climate Data into Human Impact Economic Impact Modeling Case Study: Contrasting Climate Contexts Pipeline Architecture: From Data to Insight Limitations and Assumptions Key Insights Conclusion References

research has essentially transitioned to handling large data sets. Large-scale Earth System Models (ESMs) and reanalysis products like CMIP6 and ERA5 are no longer mere repositories of scientific data but are massive high-dimensional, petabyte size spatial-temporal datasets demanding extensive data engineering before they can be used for analysis.

From a machine learning, and data architecture standpoints, the process of turning climate science into policy resembles a classical pipeline: raw data intake, feature engineering, deterministic modeling, and final product generation. Nevertheless, in contrast to conventional machine learning on tabular data, computational climatology raises issues like irregular spatial-temporal scales, non-linear climate-specific thresholds, and the imperative to retain physical interpretability that are far more complex.

This article presents a lightweight and practical pipeline that bridges the gap between raw climate data processing and applied impact modeling, transforming NetCDF datasets into interpretable, city-level risk insights.

The Problem: From Raw Tensors to Decision-Ready Insight

Although there has been an unprecedented release of high-resolution climate data globally, turning them into location-specific and actionable insights remains non-trivial. Most of the time, the problem is not that there is no data; it is the complication of the data format.

Climate data are conventionally saved in the Network Common Data Form (NetCDF). These files:

Contain huge multidimensional arrays (tensors usually have the shape time × latitude × longitude × variables).
Spatially mask rather heavily, temporally aggregate, and align coordinate reference system (CRS) are necessary even before statistical analysis.
Are not by nature understandable for the tabular structures (e.g., SQL databases or Pandas DataFrames) that are typically used by urban planners and economists.

This kind of disruption in the structure causes a translation gap: the physical raw data are there, but the socio-economic insights, which should be deterministically derived, are not.

Foundational Data Sources

One of the aspects of a solid pipeline is that it can integrate traditional baselines with forward-looking projections:

ERA5 Reanalysis: Delivers past climate data (1991-2020) such as temperature and humidity
CMIP6 Projections: Offers potential future climate scenarios based on various emission pathways

With these data sources one can perform localized anomaly detection instead of depending solely on global averages.

Location-Specific Baselines: Defining Extreme Heat

A critical issue in climate analysis is deciding how to define “extreme” conditions. A fixed global threshold (for example, 35°C) is not adequate since local adaptation varies greatly from one region to another.

Therefore, we characterize extreme heat by a percentile-based threshold obtained from the historical data:

import numpy as np
import xarray as xr

def compute_local_threshold(tmax_series: xr.DataArray, percentile: int = 95) -> float:
    return np.percentile(tmax_series, percentile)

T_threshold = compute_local_threshold(Tmax_historical_baseline)

This approach ensures that extreme events are defined relative to local climate conditions, making the analysis more context-aware and meaningful.

Thermodynamic Feature Engineering: Wet-Bulb Temperature

Temperature by itself is not enough to determine human heat stress accurately. Humidity, which influences the body’s cooling mechanism through evaporation, is also a major factor. The wet-bulb temperature (WBT), which is a combination of temperature and humidity, is a good indicator of physiological stress. Here is the formula we use based on the approximation by Stull (2011), which is simple and quick to compute:

import numpy as np

def compute_wet_bulb_temperature(T: float, RH: float) -> float:
    wbt = (
        T * np.arctan(0.151977 * np.sqrt(RH + 8.313659))
        + np.arctan(T + RH)
        - np.arctan(RH - 1.676331)
        + 0.00391838 * RH**1.5 * np.arctan(0.023101 * RH)
        - 4.686035
    )
    return wbt

Sustained wet-bulb temperatures above 31–35°C approach the limits of human survivability, making this a critical feature in risk modeling.

Translating Climate Data into Human Impact

To move beyond physical variables, we translate climate exposure into human impact using a simplified epidemiological framework.

def estimate_heat_mortality(population, base_death_rate, exposure_days, AF):
    return population * base_death_rate * exposure_days * AF

In this case, mortality is modeled as a function of population, baseline death rate, exposure duration, and an attributable fraction representing risk.

While simplified, this formulation enables the translation of temperature anomalies into interpretable impact metrics such as estimated excess mortality.

Economic Impact Modeling

Climate change also affects economic productivity. Empirical studies suggest a non-linear relationship between temperature and economic output, with productivity declining at higher temperatures.
We approximate this using a simple polynomial function:

def compute_economic_loss(temp_anomaly):
    return 0.0127 * (temp_anomaly - 13)**2

Although simplified, this captures the key insight that economic losses accelerate as temperatures deviate from optimal conditions.

Case Study: Contrasting Climate Contexts

To illustrate the pipeline, we consider two contrasting cities:

Jacobabad (Pakistan): A city with extreme baseline heat
Yakutsk (Russia): A city with a cold baseline climate

Localized P95 thresholds highlighting how extreme heat is defined relative to regional temperature distributions rather than fixed global limits (Image by author).

City	Population	Baseline Deaths/Yr	Heat Risk (%)	Estimated Heat Deaths/Yr
Jacobabad	1.17M	~8,200	0.5%	~41
Yakutsk	0.36M	~4,700	0.1%	~5

Despite using the same pipeline, the outputs differ significantly due to local climate baselines. This highlights the importance of context-aware modeling.

Pipeline Architecture: From Data to Insight

The full pipeline follows a structured workflow:

import xarray as xr
import numpy as np

ds = xr.open_dataset("cmip6_climate_data.nc")

tmax = ds["tasmax"].sel(lat=28.27, lon=68.43, method="nearest")

threshold = np.percentile(tmax.sel(time=slice("1991", "2020")), 95)

future_tmax = tmax.sel(time=slice("2030", "2050"))
heat_days_mask = future_tmax > threshold

*End-to-end workflow from raw NetCDF ingestion to impact modeling (Image by author)*

This method can be divided into a series of steps that reflect a traditional data science workflow. It starts with data ingestion, which involves loading raw NetCDF files into a computational setup. Subsequently, spatial feature extraction is carried out, whereby relevant variables like maximum temperature are pinpointed for a certain geographic coordinate. The following step is baseline computation, using historical data to determine a percentile-based threshold that designates extreme situations.

At the point the baseline is fixed, anomaly detection spots future time intervals when temperatures break the threshold, quite literally identification of heat events. Lastly, these recognized occurrences are forwarded to impact models that convert them into understandable results like death accounts and economic damage.

When properly optimized, this sequence of operations allows large-scale climate datasets to be processed efficiently, transforming complex multi-dimensional data into structured and interpretable outputs.

Limitations and Assumptions

Like any analytical pipeline, this one too is dependent on a set of simplifying assumptions, which should be taken into account while interpreting the results. Mortality estimations rely on the assumption of uniform population vulnerability, which hardly portrays the variations in the division of age, social conditions or availability of infrastructure like cooling systems, etc. The economic impact assessment at the same time describes a very rough sketch of the situation and completely overlooks the sensitivities of different sectors and the strategies for adaptation in certain localities. Besides, there is an intrinsic uncertainty of climate projections themselves stemming from climate model diversities and the emission scenarios of the future. Finally, the spatial resolution of global datasets can dampen the effect of local spots such as urban heat islands, thereby be a cause of the potential underestimation of risk in the densely populated urban environment.

Overall, these limitations point to the fact that the results of this pipeline should not be taken literally as precise forecasts but rather as exploratory estimates that can provide directional insight.

Key Insights

This pipeline illustrates some key understandings at the crossroads of climate science and data science. For one, the main difficulty in climate studies is not modeling complexity but rather the enormous data engineering effort needed to process raw, high-dimensional data sets into usable formats. Secondly, the integration of multiple domain models the combining of climate data with epidemiological and economic frameworks frequently provides the most practical value, rather than just improving a single component on its own. In addition, transparency and interpretability turn out to be essential design principles, as well-organized and easily traceable workflows allow for validation, trust, and greater adoption among scholars and decision-makers.

Conclusion

Climate datasets are rich but complicated. Unless structured pipelines are created, their value will remain hidden to the decision-makers.

Using data engineering principles and incorporating domain-specific models, one can convert the raw NetCDF data into functional, city-level climate projections. The same approach serves as an illustration of how data science can be instrumental in closing the divide between climate scientists and decision-makers.

A simple implementation of this pipeline can be explored here for reference:
https://openplanet-ai.vercel.app/

References

[1] Gasparrini A., Temperature-related mortality (2017), Lancet Planetary Health
[2] Burke M., Temperature and economic production (2018), Nature
[3] Stull R., Wet-bulb temperature (2011), Journal of Applied Meteorology
[4] Hersbach H., ERA5 reanalysis (2020), ECMWF