EmoNet: Speaker-Aware Transformers for Emotion Recognition — and What I’d Build Differently in 2026

Contents

What ERC is, and why text-only is hard The 2024 landscape Three contributions, with intuition 1. Global Speaker Identity 2. Speaker Behaviour Module 3. Weighted Cross-Entropy Loss Results: what worked, and what surprised me Reflection (2026): the field moved, and so should we Where this leaves me

, I submitted my MS thesis on Emotion Recognition in Conversation (ERC). The model, EmoNet, achieved a Weighted F1 of 39.18 on EmoryNLP — competitive with the public PapersWithCode leaderboard at the time, sitting between TUCORE-GCN_RoBERTa (39.24) and S+PAGE (39.14), and improving over my chosen baseline, CoMPM, by +1.81 F1.

Two years later, I returned to look at where the field is now. The leaderboard is unrecognizable. The top entries are no longer encoder-only models with clever attention heads — they’re LLaMA-2–7B-based systems with LoRA fine-tuning and retrieval-augmented prompting: InstructERC, CKERC, BiosERC, LaERC-S. The methods are different. The compute is different. The mindset is different.

And yet — when I read these new papers carefully, the core ideas I proposed in EmoNet show up inside them, just implemented at a different layer of the stack. This is the story of what I built, where it placed, and what I’d build now if I were starting over.

What ERC is, and why text-only is hard

Emotion Recognition in Conversation is the task of assigning an emotion label to each utterance in a multi-turn dialogue. It’s distinct from sentiment analysis on isolated sentences in one important way: the emotion of an utterance is shaped by what came before it, and by who is speaking.

Consider this exchange from the EmoryNLP dataset (sourced from the TV show Friends):

Monica: Wendy, we had a deal! Yeah, you promised! Wendy! Wendy! Wendy!   [Mad]

Rachel: Who was that?   [Neutral]

Monica: Wendy bailed. I have no waitress.   [Mad]

In isolation, “Who was that?” is emotionally neutral. The label Neutral is only meaningful in context — it sits between two angry utterances from a different speaker and ERC models must capture this conversational dynamic.

There’s a second wrinkle: multimodal information is missing. In real human conversation, tone of voice, facial expressions, and body language carry an enormous share of emotional signal. Text-only ERC strips all of that away. The same words — “Oh, great.” — can be sincere or sarcastic, and the text alone often can’t tell you which.

This information loss is the central challenge. You have to extract emotion from a noisier signal than the human-grade benchmark.

The 2024 landscape

When I started my thesis in late 2023, the EmoryNLP leaderboard was dominated by transformer-based architectures with various clever modifications. A quick tour:

– KET (Zhong et al., 2019) — knowledge-enriched transformer with affective graph attention, the first paper to bring transformers to ERC.

– DialogueGCN (Ghosal et al., 2019) — graph convolutional network that converted dialogues into node-classification problems.

– RGAT (Ishiwatari et al., 2020) — relation-aware graph attention with relational position encoding for speaker dependencies.

– DialogXL (Shen et al., 2020) — adapted XLNet with utterance recurrence and dialogue self-attention.

– HiTrans (Li et al., 2020) — hierarchical transformer with pairwise utterance speaker verification as auxiliary task.

– TUCORE-GCN (Lee & Choi, 2021) — heterogeneous dialogue graph with speaker-aware BERT.

– CoMPM (Lee & Lee, 2021) — combined dialogue context with pre-trained memory tracking for the speaker.

I chose CoMPM as my base for two reasons. First, it explicitly modeled the speaker’s pre-trained memory as a separate module — which mapped to my intuition that who is speaking matters as much as what they’re saying. Second, its architecture was modular enough to extend without rewriting from scratch. The CoMPM paper showed that adding pre-trained memory to the context model gave a measurable boost — but their speaker identity was still local to each dialogue. The moment a new conversation began, everything the model had learned about a speaker was discarded.

That seemed like a problem worth solving.

Three contributions, with intuition

1. Global Speaker Identity

The problem. In CoMPM and most prior work, speaker IDs are scoped to a single dialogue. Speaker A in scene 1 has no relationship to Speaker A in scene 14, even when they’re the same person. Hence, every dialogue starts cold.

The intuition. People have characteristic emotional patterns. Monica gets angry about specific things; Phoebe is reliably cheerful; Ross has predictable bouts of insecurity. If a model can carry information about this specific speaker across dialogues, it should be able to make better-calibrated predictions when that speaker reappears.

The implementation. Each unique speaker in the entire dataset gets a stable, dataset-wide ID. The first time Monica Geller appears, she’s assigned an ID — say, ID 7 — that stays with her. Every subsequent appearance — across episodes, seasons, scenes — she stays ID 7. The model can now learn speaker-specific patterns that persist.

This sounds obvious in retrospect. In 2024 it was not how the leaderboard models worked.

2. Speaker Behaviour Module

The problem. Global Speaker Identity alone is just a label. To make it useful, the model needs to do something with the speaker’s accumulated history. How do you give a transformer access to “everything Monica has ever said in this dataset,” without blowing out the context window or making training intractable?

The intuition. Recurrence. A GRU is a natural fit for sequentially compressing a speaker’s historical utterances into a single fixed-size representation. Recent utterances contribute more; older ones gradually dilute. A configurable sliding window bounds the GRU’s input — say, the last N utterances by this speaker — keeping compute and memory predictable.

The implementation. Each utterance is independently encoded by a pre-trained RoBERTa backbone. The resulting embeddings flow through a unidirectional GRU. The GRU’s final hidden state — call it `kt` — represents the speaker’s behavioral pattern at the current moment. This is projected into the same dimension as the dialogue context output and added in. The combined signal feeds the final classifier.

The architecture is structurally similar to CoMPM’s pre-trained memory module, but with two key differences: the speaker-history pool is global (not local to the current dialogue), and the GRU explicitly models temporal decay.

Figure: EmoNet Architecture (Image by author). This model consists of two modules: a Dialogue context embedding module and a Speaker behaviour module. The figure shows an example of predicting emotion of u6, from a 6-turn dialogue context. A, D, and Y refer to the participant in the conversation, where SA = Su1 = Su4 = Su6, SD = Su2, and SY = Su3 = Su5. Wo and Wp are linear matrices

3. Weighted Cross-Entropy Loss

The problem. EmoryNLP is imbalanced — Neutral outnumbers Sad by roughly 4.5:1. Most papers handle this with data augmentation or under-sampling. But conversational data is sequential: dropping or duplicating utterances distorts the natural emotional flow, which is exactly the signal the model is trying to learn from.

The intuition. If you can’t safely change the data, change the loss. Weight rare classes higher so a single misclassification of Sad costs the model more than a single misclassification of Neutral.

The implementation. Cross-entropy with per-class weights derived from inverse class frequency, then normalized. Nothing exotic — but with the conversational-sequence argument as the explicit motivation, this becomes a principled choice rather than an arbitrary one.

Results: what worked, and what surprised me

Here’s the ablation table from the thesis:

The result that surprised me — and that I think is the most honest part of this work — is the second row. Adding Global Speaker ID alone made the model substantially worse (F1 dropped from 37.85 to 29.43). That looked like a failure at first.

But it wasn’t. The Global Speaker Identity is a capability — it gives the model the ability to learn long-range speaker patterns. On its own, that capability creates a representational burden the rest of the model couldn’t absorb. Only once the Speaker Behaviour module was added — giving the model a structured way to use the global identities — did the contribution surface. By the final configuration, EmoNet had recovered and surpassed the CoMPM baseline by 1.81 F1.

This is the lesson I took from the ablation: a feature isn’t valuable in isolation; it’s valuable in combination with the machinery that consumes it. Research papers that report “this addition gave us +X%” often hide ablation rows where the addition alone made things worse. I chose to keep that row in.

The full model handled Neutral, Joy, and Scared well. Powerful remained the hardest class — partly because it’s rare, and partly because Powerful and Joy are nearly indistinguishable in textual conversation without acoustic cues. This is a multimodal problem masquerading as a text problem.

Reflection (2026): the field moved, and so should we

Two years on, the EmoryNLP leaderboard looks completely different. The leading systems now are:

– InstructERC (Lei et al., 2023) — reformulates ERC as a generative LLM task. It uses retrieval-augmented instruction templates and auxiliary tasks such as speaker identification and emotion prediction to better model dialogue roles and emotional dynamics.

– CKERC (Fu, 2024) — introduces commonsense-enhanced ERC. For each utterance, an LLM generates commonsense annotations about speaker intention and likely listener reaction, providing implicit social and emotional reasoning beyond explicit dialogue context.

– BiosERC (Xue et al., 2024) — injects LLM-derived speaker biographical information into the ERC process, allowing the model to reason not only over utterance context but also over speaker-specific traits.

– LaERC-S (Fu et al., 2025) — two-stage instruction tuning. Stage 1: equip the LLM with speaker-specific characteristics. Stage 2: use those characteristics during the ERC task itself.

Look at those last two carefully.

BiosERC’s speaker biographical information is, in spirit, Global Speaker Identity scaled up — instead of an integer ID, it’s a textual profile the LLM can attend to. LaERC-S’s speaker characteristics are, in spirit, the Speaker Behaviour module — historical speaker patterns made available to the model — but folded into instruction tuning rather than implemented as a separate GRU.

The architectural intuitions held up. The implementation layer changed.

This is the part I find genuinely interesting. When I was working on EmoNet in 2024, I was thinking inside the encoder-only-transformer paradigm: “how do I add another module to the architecture?” The 2024–2025 papers think inside the LLM paradigm: “how do I encode this idea into instruction tuning or retrieval context?” The ideas are similar; the leverage points are different.

If I were to rebuild EmoNet today, I would not start from RoBERTa-large. I would start from a small open-source LLM — LLaMA-3.2–3B, Qwen-2.5–3B, or Phi-3.5 — and use LoRA to fine-tune it on EmoryNLP, following the InstructERC family of approaches. The Global Speaker Identity becomes a textual speaker biography retrieved from a vector store. The Speaker Behaviour module becomes a few-shot prompt with the speaker’s most recent emotional history. The Weighted Loss survives almost unchanged — class imbalance doesn’t care what model you’re using.

The architecture diagram would look completely different. The conceptual debt to the 2024 thesis would be visible if you knew where to look.

It taught me that research debt has a longer half-life than I expected — ideas survive paradigm shifts even when their implementations don’t.

Where this leaves me

EmoNet is now publicly archived under DOI 10.5281/zenodo.20048006 with the full thesis, defense slides, and PyTorch implementation on GitHub. I’m currently working on the modernized port — a LoRA-fine-tuned LLM with retrieval-based speaker context — as a follow-up project that I’ll write about soon.

If you’re working on conversational AI, applied NLP, or LLM fine-tuning, I’d be interested to hear what you’re building.