From Connections to Meaning: Why Heterogeneous Graph Transformers (HGT) Change Demand Forecasting

Contents

A brief recap: What GraphSAGE told us What HGT adds: Relationship-aware learning Implementation (high-level)The housing market analogy — now with meaning GraphSAGE: structure without meaning HGT: structure with meaning Comparison of Results Explainability Not All Supply Chains Are Created Equal Key Takeaways What’s next? From Concepts to Code Reference

forecasting errors are not caused by bad time-series models.

They are caused by ignoring structure.

SKUs do not behave independently. They interact through shared plants, product groups, warehouses, and storage locations. A demand shock to one SKU often propagates to others — yet most forecasting systems model each SKU in isolation.

In my previous article, we showed that explicitly modeling these connections matters. Using a real FMCG supply-chain graph, a simple Graph Neural Network (GraphSAGE) reduced SKU-level forecast error by over 27% compared to a strong naïve baseline, purely by allowing information to flow across related SKUs.

But GraphSAGE makes a simplifying assumption: all relationships are equal.

A shared plant is treated the same as a shared product group. Substitutes and complements are averaged into a single signal. This limits the model’s ability to anticipate real demand shifts.

This article explores what happens when the model is allowed not just to see the supply-chain network, but to understand the meaning of each relationship inside it.

We show how Heterogeneous Graph Transformers (HGT) introduce relationship-aware learning into demand forecasting, and why that seemingly small change produces more anticipatory forecasts, tighter error distributions, and materially better outcomes — even on intermittent, daily per-SKU demand — turning connected forecasts into meaning-aware, operationally grounded predictions.

A brief recap: What GraphSAGE told us

In the previous article, we trained a spatio-temporal GraphSAGE model on a real FMCG supply-chain graph with:

40 SKUs
9 plants
21 product groups
36 subgroups
13 storage locations

Each SKU was connected to others through shared plants, groups, and locations — creating a dense web of operational dependencies. The temporal characteristics displayed lumpy production and intermittent demand, a common scenario in FMCG.

GraphSAGE allowed each SKU to aggregate information from its neighbors. That produced a large jump in forecast quality.

Model	WAPE (SKU-daily)
Naïve baseline	0.86
GraphSAGE	~0.62

At the hardest possible level — daily, per-SKU, intermittent demand — a WAPE of ~0.62 is already almost production-grade in FMCG.

But the error plots showed something important:

The model followed trends well
It handled zeros well
But it smoothed away extreme spikes
And it reacted instead of anticipating

Because GraphSAGE assumes that all relationships are equal. Assuming all relations have equal weightage means the model cannot learn that:

A demand spike in a complementary SKU in the same plant should increase my forecast
But a spike in a substitute SKU in the same product group should reduce it

Let’s see how Heterogeneous Graph Transformer (HGT) addresses the challenge.

What HGT adds: Relationship-aware learning

Heterogeneous Graph Transformers are built for graphs where:

There are multiple types of nodes (SKUs, plants, warehouses, groups) and/or
There are multiple types of edges (shared plants, product groups etc.)

In this case, while all nodes in the graph are SKUs, the relationships between them are heterogeneous. Here, HGT is not used to model multiple entity types, but to learn relation-aware message passing.

The model learns separate transformation and attention mechanisms for each type of SKU–SKU relationship, allowing demand signals to propagate differently depending on why two SKUs are connected.

It learns:

“How should information flow across each type of relationship?”

Formally, instead of one aggregation function, HGT learns:

\[
h_i = \sum_{r \in \{\text{plant}, \text{group}, \text{subgroup}, \text{storage}\}}
\sum_{j \in N_r(i)} \alpha_{r,i,j} W_r h_j
\]

where

r represents the type of operational relationship between SKUs (shared plant, product group, etc.)
Wᵣ allows the model to treat each relationship differently
αᵣ,ᵢ,ⱼ lets the model focus on the most influential neighbors
The set N_r(i) contains all SKUs that are directly connected to SKU i through a shared relationship r.

This lets the model learn, for example:

Plant edges propagate capacity and production signals
Product-group edges propagate substitution and demand transfer
Warehouse edges propagate inventory buffering

The graph becomes economically meaningful, not just topologically connected.

Implementation (high-level)

Just like in the GraphSAGE model, we use:

The same SupplyGraph dataset, temporal features, log1p normalization and sliding window of 14 days.

The difference is in the spatial encoder. The following is an overview of the architecture.

Heterogeneous Graph Encoder
- Nodes: SKUs
- Edges: shared plant, shared group, shared sub-group and shared storage
- HGT layers learn relation-specific message passing
Temporal Encoder
- A time-series encoder processes the last 14 days of embeddings
- This captures how the graph evolves over time
Output Head
- A regressor predicts next-day log1p sales per SKU

Everything else — training, loss, evaluation — remains identical to GraphSAGE. So any difference in performance comes purely from better structural understanding.

The housing market analogy — now with meaning

In the previous article, we used a simple housing-market analogy to explain why graph-based forecasting works.

Let’s upgrade it.

GraphSAGE: structure without meaning

GraphSAGE is like predicting the price of your house by looking at:

The historical price of your house
The average price movement of nearby houses

This already improves over treating your house in isolation. But GraphSAGE makes a critical simplifying assumption:

All neighbors influence your house in the same way.

In practice, this means GraphSAGE treats all nearby entities as identical signals. A luxury villa, a school, a shopping mall, a highway, or a factory are all just “neighbors” whose price signals get averaged together.

The model learns that houses are connected — but not why they are connected.

HGT: structure with meaning

Now imagine a more realistic housing model.

Every data point is still a house — there are no different node types.
But houses are connected through different kinds of relationships:

Some share the same school district
Some share the same builder or construction quality
Some are near parks
Others are near highways or industrial zones

Each of these relationships affects prices differently.

Schools and parks tend to increase value
Highways and factories often reduce it
Luxury houses matter more than neglected ones

A Heterogeneous Graph Transformer (HGT) learns these distinctions explicitly. Instead of averaging all neighbor signals, HGT learns:

which type of relationship a neighbor represents, and
how strongly that relationship should influence the prediction.

That distinction is what turns a connected demand forecast into a meaning-aware, operationally grounded prediction.

Comparison of Results

Here is the comparison of WAPE of HGT with GraphSAGE and naive baseline:

Model	WAPE
Naive baseline	0.86
GraphSAGE	0.62
HGT	0.58

At a daily-per SKU WAPE below 0.60, the Heterogeneous Graph Transformer (HGT) delivers a clear production-grade step-change over both traditional forecasting and GraphSAGE. The results depict a ~32% reduction in misallocated demand vs. traditional forecasting and a further 6–7% improvement over GraphSAGE

The following scatter chart depicts the actual vs predicted sales on the log1p scale for both GraphSAGE (purple dots) and HGT (cyan dots). While both models are good, there is a greater dispersion of purple dots of GraphSAGE as compared to the tight clustering of the cyan HGT ones, corresponding to the 6% improvement in WAPE.

Actual vs predicted (GraphSAGE vs HGT)

At the scale of this dataset (≈ 1.1 million units), that improvement translates into ~45,000 fewer units misallocated over the evaluation period.

Operationally, reducing misallocation by this magnitude leads to:

Fewer emergency production changes
Lower expediting and premium freight costs
More stable plant and warehouse operations
Better service levels on high-volume SKUs
Less inventory trapped in the wrong locations

Importantly, these improvements come without adding business rules, planner overrides, or manual tuning.

And the bias comparison is as follows:

Model	Mean Forecast	Bias (Units)	Bias %
Naïve	~701	0	0%
GraphSAGE	~733	+31	~4.5%
HGT	~710	~8.4	~1.2%

HGT introduces a very small positive bias — roughly 1–2%.

This is well within production-safe limits and aligns with how FMCG planners operate in practice, where a slight upward bias is often preferred to avoid stock-outs. The following histogram confirms a Gaussian distribution centered around zero, indicating unbiased performance on typical forecasting days.

The real difference between GraphSAGE and HGT is evident when we compare the forecasts for the top-4 SKUs by volume. Here is the GraphSAGE chart:

Forecast v Actual – Top 4 SKUs (GraphSAGE)

And the same for HGT :

The distinction is evident from the area highlighted in the first chart and across all of other SKUs:

HGT is not reactive like GraphSAGE. It is a stronger forecast, anticipating and tracking the peaks and troughs of the actual demand, rather than smoothing out the fluctuations.
This is a result of the differential learning of the structural relations between neighboring SKUs, which lets it predict the change in demand confidently before it has already started.

And finally, the performance across SKUs with non-zero volumes clearly shows that all the high-volume SKUs have a WAPE < 0.60, which is desirable for a production forecast and is an improvement over GraphSAGE.

Explainability

HGT makes it practical to implement explainability to the forecasts — essential for planners to have confidence on the causality of features. When the model predicts a dip, and we can show it’s because “Neighbor X in the same subgroup is trending down,” planners can validate the signal against real-world logistics, turning an AI prediction into actionable business insight.

Lets look at the influence of different spatial and temporal features during the forecast for the first 7 days and last 7 days of the duration for the SKU with most volume (SOS001L12P). Here is the comparison of the temporal features:

And the spatial features:

The charts show that different features and SKU/edges play a role during different time periods:

For the first 7 days, Sales Lag(7d) has the maximum influence (23%) which changes to Rolling Mean (21%) for the last 7 days.
Similarly during the initial 7 days, there is heavy reliance on SOS005L04P, likely a primary storage node or precursor SKU that dictates immediate availability. By the end of the test duration, the influence redistributes. SOS005L04P shares the stage with SOS002L09P (~40% Share each) both from the same subgroup as our target SKU. This suggests the model is now aggregating signals from a broader subgroup of related products to form a more holistic view.

This type of analysis is crucial to understand and forecast the impacts of marketing campaigns and promotions or external factors such as interest rates on specific SKUs. These should be included in the spatial structure as additional nodes in the graph with the SKUs linked to it.

Not All Supply Chains Are Created Equal

The use case here is a relatively simple case with only SKUs as nodes. And that is because in FMCG, plants and warehouses act largely as buffers — they smooth volatility but rarely hard-stop the system. That is why, HGT could learn much of their effect purely from edge types like shared plant or shared warehouse without modeling them as explicit nodes. Supply chains can be far more complex. As an example, automotive supply chains are very different. A paint shop, engine line, or regional distribution center is a hard capacity bottleneck: when it is constrained, demand for specific trims or colors collapses regardless of market demand. In that setting, HGT still benefits from typed relationships, but it also requires explicit Plant and Warehouse nodes with their own time-series signals (capacity, output, backlogs, delays) to model how supply-side physics interact with customer demand. In other words, FMCG needs structure-aware graphs; automotive needs causality-aware graphs.

Other factors that are common across industries are promotions, marketing spends, seasonality, external factors such as economic conditions (eg; fuel prices) or competitor launches in a segment. These also affect SKUs in different ways. For eg; fuel price increase or a new regulation may dampen sales of ICE vehicles and increase sale of electric ones. Such factors need to be included in the graph as nodes and their relations to the SKUs included in the spatial model. And their temporal features need to include the historical data when the events occurred. This would enable HGT to learn the effects of these factors on demand in the weeks and months following the event.

Key Takeaways

Supply-chain demand is not just connected — it is structured. Treating all SKU relationships as equal leaves does not harness the full predictive potential.
GraphSAGE proves that networks matter: simply allowing SKUs to exchange information across shared plants, groups, and locations delivers a large accuracy jump over classical forecasting.
Heterogeneous Graph Transformers go one step further by learning why SKUs are connected. A shared plant, a shared subgroup, and a shared warehouse do not propagate demand in the same way — and HGT learns that distinction directly from data.
That structural awareness translates into real outcomes: lower WAPE, tighter forecast dispersion, better peak anticipation, and materially fewer misallocated units — without business rules, manual tuning, or planner overrides.
Explainability becomes operational, not cosmetic. Relation-aware attention allows planners to trace forecasts back to economically meaningful drivers, turning predictions into trusted decisions.
The broader lesson: as supply chains grow more interdependent, forecasting models must evolve from time-series-only to relationship-aware systems. In FMCG this means structure-aware graphs; in more constrained industries like automotive, it means causality-aware graphs with explicit bottlenecks.

In short: when the model understands the meaning of connections, forecasting stops being reactive — and starts becoming anticipatory.

What’s next? From Concepts to Code

Across this article and the previous one, we moved step by step through the evolution of demand forecasting — from isolated time-series models, to GraphSAGE, and finally to Heterogeneous Graph Transformers — showing how each shift progressively improves forecast quality by better reflecting how real supply chains operate.

The next logical step is to move from concepts to code.

In the next article, we will translate these ideas into an end-to-end, implementable workflow. Using focused code examples, we will walk through how to:

Construct the supply-chain graph and define relationship types
Engineer temporal features for intermittent, SKU-level demand
Design and train GraphSAGE and HGT models
Evaluate performance using production-grade metrics
Visualize forecasts, errors, and relation-aware attention
Add explainability so planners can understand why a forecast changed

The goal is not just to show how to train a model, but how to build a production-ready, interpretable graph-based forecasting system that practitioners can adapt to their own supply chains.

If this article explained why structure and meaning matter, the next one will show exactly how to make them work in code.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Reference

SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks : Authors: Azmine Toushik Wasi, MD Shafikul Islam, Adipto Raihan Akib

_{Images used in this article are generated using Google Gemini. Charts and underlying code created by me.}