has evolved significantly since the early days of Google Translate in 2007. However, NMT systems still hallucinate like any other model — especially when it comes to low-resource domains or when translating between rare language pairs.
When Google Translate provides a result, you see only the output text, not the probability distributions or uncertainty metrics for each word or sentence. Even if you don’t need this information, knowledge of where the model is confident and where it isn’t can be really valuable for internal purposes. For instance, simple parts can be fed to a fast and cheap model, while more resources can be allocated for difficult ones.
But how can we assess and, most importantly, “calibrate” this uncertainty? The first thing that comes to mind is to evaluate the distribution of output probabilities for each token, for example, by calculating its entropy. This is computationally simple, universal across model architectures, and, as can be seen below, actually correlates with cases where the NMT model is uncertain.
However, the limitations of this approach are obvious:
- First, the model may be choosing between multiple synonyms and, from the token selection perspective, be uncertain.
- Second, and more importantly, this is just a black-box method that explains nothing about the nature of the uncertainty. Perhaps the model really hasn’t seen anything similar during training. Or perhaps it simply hallucinated a non-existent word or a grammatical construction.
Existing approaches address this problem reasonably well, but all have their nuances:
- Semantic Entropy [1] clusters model outputs by semantic meaning, but requires generating 5–10 outputs for a single input, which is computationally expensive (and frankly, when I tried to reproduce this on my labelled dataset, the observed semantic similarity of words in these clusters was questionable).
- Metrics like xCOMET [2] achieve SOTA-level QE at the token level, but require fine-tuning 3.5 billion parameters of an XLM-R model on expensive quality-annotated data and, aside from that, function as a black box.
- Model introspection [3] through saliency analysis looks interesting but also has interpretation issues.
The method proposed below can make uncertainty computation efficient. Since most NMT setups already have two models — a forward model (language1 → language2) and a backward model (language2 → language1) — we can leverage them to compute interpretable uncertainty signals.
After generating a translation with the forward model, we can “place” the inverted translation-source pair into the backward model using teacher forcing (as if it generated it itself), then extract the transposed cross-attention map and compare it with the corresponding map from the forward model. The results below show that this approach allows obtaining interpretable signals at the token level in most cases.
Additionally, there is no need to retrain a heavy NMT model. It is sufficient to train a lightweight classifier on features from the matrix comparison while keeping the main model’s weights frozen.
What does right and wrong look like
Let’s start with a simple French → English translation example where everything is clear just from the visualization.
“Elle aime manger des pommes → She likes to eat apples”
Now let’s compare this with a broken NMT translation:
“La femme dont je t’ai parlé travaille à cette université” (correct translation: “The woman I told you about works at this university”)
What the model produced:
“The woman whose wife I told you about that university”
Where did this extra “wife” come from?

Bidirectional Cross-Check
Computation of bidirectional attention. Please note that the backward model uses teacher forcing. It receives the pre-generated English translation and checks whether it matches back to the original French source, while no new French sentence is generated. This is an alignment verification, not a round-trip translation.
def get_bidirectional_attention(dual_model, src_tensor, tgt_tensor):
"""Extract forward/backward cross-attention and reciprocal map."""
dual_model.eval()
with torch.no_grad():
fwd_attn, bwd_attn = dual_model.get_cross_attention(src_tensor, tgt_tensor)
# Align to full target/source lengths for element-wise comparison
B, T = tgt_tensor.shape
S = src_tensor.shape[1]
fwd_aligned = torch.zeros(B, T, S, device=src_tensor.device)
bwd_aligned = torch.zeros(B, T, S, device=src_tensor.device)
if T > 1:
fwd_aligned[:, 1:T, :] = fwd_attn
if S > 1:
bwd_aligned[:, :, 1:S] = bwd_attn.transpose(1, 2)
reciprocal = fwd_aligned * bwd_aligned
return fwd_aligned, bwd_aligned, reciprocal
All reproducible code is accessible via the project’s GitHub repository.
At the beginning of my work on the topic, I tried using direct round-trip translation. However, due to the poor performance of single-GPU-trained models and also because of the translation ambiguity, it was difficult to compare the source and round-trip result at the token level, as sentences could completely lose their meaning. Moreover, comparing the attention matrices of the back and forward models for three different sentences — the source, the translation, and the reproduced source from the round-trip — would have been costly.
When Patterns Are Less Obvious: Chinese → English
For language pairs with similar structure (like French↔English), the “1-to-1 token pattern” is intuitive. But what about typologically distant languages?
Chinese → English involves:
- Flexible word order. Chinese is SVO like English, but allows topicalization and pro-drop.
- No spaces between words. Tokenizers must segment before subword splitting.
- Logographic writing system. Characters map to morphemes, not phonemes.
The attention maps become harder to interpret just by looking at the picture, however the learned features still manage to capture alignment quality.
Let’s look at this example of a semantic inversion error:
这家公司的产品质量越来越差客户都很不满意 (correct translation: This company’s product quality is getting worse, customers are very dissatisfied)
The model output:
The quality of products of the company is increasingly satisfied with the customer.
The word “不满意” means “dissatisfied”, but what the model had produced is exactly opposite. Not even mentioning that the whole translation result is nonsense.

Despite the pattern being significantly less visually noticeable, a trainable QE classifier is still able to capture it. This is precisely why we extract 75 attention alignment–based features of various kinds, as explained in more detail below.
Experimental Setup
The NMT core is intentionally kept undertrained for the setup. A near-perfect translator produces few errors to detect. Nevertheless, to build a quality estimation system, we need translations that sometimes (or even often) fail, and include omissions, hallucinations, grammatical errors, and mistranslations.
While an absolutely garbage model would make no sense, simply because there would be no reference for the classifier on what is right, the model used in this setup (~0.25–0.4 BLEU over a validation part of the dataset) ensures a steady supply of diverse error types, thereby creating a decent training signal for the QE.
The architecture uses simple scaled dot-product attention instead of more advanced options (linear attention, GQA, etc.). This keeps the attention weights interpretable: each weight represents the probability mass assigned to a source position, without approximations or kernel tricks that would also be nice to consider — but yet out of the scope of this experiment. Discovering ways to improve the method for more optimized attention structure is a good forward point to go.
Data and Annotation
| ZH→EN | FR→EN | |
|---|---|---|
| Typological distance | High | Low |
| Expected error types | Alignment & word order errors | Lexical & tense errors |
| Training pairs | 100k sentences | 100k sentences |
| QE annotation set | 15k translations | 15k translations |
Token-level binary quality labels were annotated via “LLM-as-a-judge” approach using Gemini 2.5 Flash. The annotation prompt had clear and strict rules:
- BAD: mistranslations, wrong tense/form, hallucinated content, incorrect syntax, UNK tokens.
- OK: correct meaning, valid synonyms, natural paraphrasing.
Each translation was tokenized, and the judging model created labels for every token. It also provided a reference translation and a minimal post-edit. In total, this gave approx. 150,000 labeled tokens with 15–20% “BAD” rate.
Training Pipeline
Step 1: Train bidirectional NMTs. Forward (src→tgt) and backward (tgt→src) models were trained jointly on parallel data. Both share the same architecture but separate parameters.
class DualTransformerNMT(nn.Module):
"""Bidirectional translator used for QE feature extraction."""
def __init__(self, src_vocab_size, tgt_vocab_size, d_model,
n_heads, n_layers, d_ff, max_length):
super().__init__()
self.zh2en = TransformerNMT(src_vocab_size, tgt_vocab_size,
d_model, n_heads, n_layers, d_ff, max_length)
self.en2zh = TransformerNMT(tgt_vocab_size, src_vocab_size,
d_model, n_heads, n_layers, d_ff, max_length)
class QEClassifier(nn.Module):
"""Token-level BAD probability head."""
def __init__(self, input_dim=75, hidden_dim=128, dropout=0.2):
super().__init__()
self.input_projection = nn.Linear(input_dim, hidden_dim)
self.hidden1 = nn.Linear(hidden_dim, hidden_dim)
self.hidden2 = nn.Linear(hidden_dim, hidden_dim // 2)
self.output = nn.Linear(hidden_dim // 2, 1)
Step 2: Generate translations. The forward model translated the QE annotation set. These translations (with their natural errors) became the training data for quality estimation.
Step 3: Extract attention features. For each translated sentence, 75-dimensional feature vectors were extracted per token position using the method described below.
def extract_all_features(dual_model, src_tensor, tgt_tensor, attention_extractor):
"""Extract per-token QE features used in training/inference."""
# Bidirectional cross-attention
fwd_attn, bwd_attn = dual_model.get_cross_attention(src_tensor, tgt_tensor)
# 75 attention features (25 base x context window [-1,0,+1])
attn_features = attention_extractor.extract(
fwd_attn, bwd_attn, src_tensor, tgt_tensor
)[0] # [T, 75]
# Optional entropy feature (top-k normalized output entropy)
entropy = compute_output_entropy(dual_model.zh2en, src_tensor, tgt_tensor)[0]
# Final combined vector used in ablation: [T, 76]
features = torch.cat([attn_features, entropy.unsqueeze(-1)], dim=-1)
return features
Step 4: Train QE classifier. A small MLP classifier (128 → 64 → 1) was trained on the extracted features with frozen translator weights.
# Freeze translator weights, train QE head only
dual_model.freeze_translation_models()
n_bad = max(int(y_train.sum()), 1)
n_ok = max(int(len(y_train) - y_train.sum()), 1)
pos_weight = torch.tensor([n_ok / n_bad], device=device, dtype=torch.float32)
classifier = QEClassifier(input_dim=input_dim, hidden_dim=128, dropout=0.2).to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
for batch_x, batch_y in train_loader:
optimizer.zero_grad()
logits = classifier(batch_x.unsqueeze(1)).squeeze(1)
loss = criterion(logits, batch_y)
loss.backward()
optimizer.step()
Alignment feature types
1. Focus (12 features) — Where is the model looking?

def extract_focus_features(fwd_attn, bwd_attn, tgt_pos, src_content_mask):
"""
Extract top-k alignment scores and their backward counterparts.
Args:
fwd_attn: [S] forward attention from target position to all sources
bwd_attn: [S, T] backward attention matrix
tgt_pos: current target position index
src_content_mask: [S] boolean mask for content (non-special) tokens
Returns:
features: [12] focus feature vector
"""
# Mask to only consider content source tokens
fwd_scores = fwd_attn * src_content_mask
# Get top-3 attended source positions
top_k = 3
top_fwd_scores, top_src_indices = torch.topk(fwd_scores, top_k)
features = torch.zeros(12)
features[0:3] = top_fwd_scores # Forward top-1, top-2, top-3
# For each top source, check backward alignment strength
for i, src_idx in enumerate(top_src_indices):
bwd_from_src = bwd_attn[src_idx, :] # [T] - how source looks back
top_bwd_scores, _ = torch.topk(bwd_from_src, 3)
features[3 + i*3 : 3 + (i+1)*3] = top_bwd_scores
return features
Hallucinated tokens often (yet not always) have diffused attention. The model fails to ground the truth between the source and the target and tries to “look” everywhere. A confident translation often focuses sharply on 1–2 source positions or at least has a distinct pattern.
2. Reciprocity (2 features) — Does the alignment cycle back?

def extract_reciprocity_features(fwd_attn, bwd_attn, tgt_pos, src_content_mask):
"""
Check if attention alignment forms a closed cycle.
Returns:
hard_reciprocal: 1.0 if exact match, 0.0 otherwise
soft_reciprocal: dot product overlap (continuous measure)
"""
# Forward: find best source position for this target
fwd_scores = fwd_attn * src_content_mask
best_src = fwd_scores.argmax()
# Backward: does that source point back to us?
bwd_from_best_src = bwd_attn[best_src, :] # [T]
best_tgt_from_src = bwd_from_best_src.argmax()
# Hard reciprocity: exact position match
hard_reciprocal = 1.0 if (best_tgt_from_src == tgt_pos) else 0.0
# Soft reciprocity: attention distribution overlap
# High value = forward and backward "agree" on alignment
fwd_normalized = fwd_scores / (fwd_scores.sum() + 1e-9)
bwd_normalized = bwd_attn[:, tgt_pos] / (bwd_attn[:, tgt_pos].sum() + 1e-9)
soft_reciprocal = (fwd_normalized * bwd_normalized).sum()
return hard_reciprocal, soft_reciprocal
For example, if “wife” (from the example above) attends to position 3 in French, but position 3 doesn’t attend back to “wife,” the alignment is spurious.
3. Sink (11 features)

When uncertain, transformers often dump attention onto “safe” special tokens (SOS, EOS, PAD):
def extract_sink_features(fwd_attn, bwd_attn, src_tensor, tgt_tensor,
SOS=1, EOS=2, PAD=0):
"""
Extract attention sink features - attention mass on special tokens.
"""
# Identify special token positions in source
src_is_sos = (src_tensor == SOS).float()
src_is_eos = (src_tensor == EOS).float()
src_is_pad = (src_tensor == PAD).float()
# Measure attention mass going to each special token type
sink_sos = (fwd_attn * src_is_sos).sum() # Attention to SOS
sink_eos = (fwd_attn * src_is_eos).sum() # Attention to EOS
sink_pad = (fwd_attn * src_is_pad).sum() # Attention to PAD
sink_total = sink_sos + sink_eos + sink_pad
# Backward sink: check if best-aligned source also shows uncertainty
best_src = fwd_attn.argmax()
bwd_from_best = bwd_attn[best_src, :]
tgt_is_special = ((tgt_tensor == SOS) | (tgt_tensor == EOS) |
(tgt_tensor == PAD)).float()
bwd_sink = (bwd_from_best * tgt_is_special).sum()
# Asymmetry: disagreement in uncertainty levels
sink_asymmetry = abs(sink_total - bwd_sink)
# Extended features: entropy-based measures
content_mask = 1.0 - src_is_sos - src_is_eos - src_is_pad
fwd_content = fwd_attn * content_mask
fwd_content_norm = fwd_content / (fwd_content.sum() + 1e-9)
max_content = fwd_content.max() # Peak attention to content
concentration = max_content / (fwd_content.sum() + 1e-9) # How peaked?
return {
'sink_total': sink_total,
'sink_sos': sink_sos,
'sink_eos': sink_eos,
'sink_pad': sink_pad,
'bwd_sink': bwd_sink,
'sink_asymmetry': sink_asymmetry,
'max_content': max_content,
'concentration': concentration,
# ... plus entropy features
}
Why Context Matters
Translation errors create a cascade, where a dropped word affects neighbors. By including features from positions t-1 and t+1, we allow the classifier to detect these ripple patterns. This is obviously not the overwhelming signal (especially when the real semantic “neighbor” could be located distantly in the sentence), but it is already strong enough to bring the value. Combining it with “topological” token-linking methods could make these features even more meaningful.
Token-Level Comparison: What Each Signal Sees
Now let’s have a look at a more detailed breakdown – if attention-alignment could be matched together with output distribution entropy scores. Are they carrying the same information or could potentially augment each other?
“Entropy” column below is the normalized top-k output entropy values (k=20) of the forward model’s softmax distribution, resulting in 0–1 scale. A value near 0 means the model is confident in a single token; a value near 1 means probability is spread evenly across other candidates.
French: “wife” hallucination
Source: La femme dont je t’ai parlé… → MT: “the woman whose wife i told you about…”

Chinese: “satisfied” semantic inversion
Source: 这家公司的产品质量越来越差客户都很不满意 → MT: “the quality of products of the company is increasingly satisfied with the customer”


Repetition Errors: How Attention Catches Poor Model Artifacts
Source: Il est évident que cette approche ne fonctionne pas correctement (It is obvious that this approach does not work properly)
MT: “it’s obvious that this approach does not work properly operate properly”

What “Confident Translation” Looks Like
Source: Le rapport mentionne uniquement trois incidents, pas quatre
MT: “the report mentions only three incidents, not four”

Scaling up
Now let’s explore how the method works at scale. For this, I did a quick ablation check, which had the goal to observe the impact of attention-based features. Could it be that they bring nothing worthy of additional calculations, compared with a simple output entropy?
Methodology: the dataset was split at the sentence level into 70% training, 15% validation, and 15% test sets. The best epoch was selected using validation ROC-AUC for threshold independence, and the classification threshold was tuned on validation F1(BAD). The final metrics were reported on the held-out test set only.
Feature Contributions
| Features | ZH→EN ROC-AUC | ZH→EN PR-AUC | ZH→EN F1 (BAD) | FR→EN ROC-AUC | FR→EN PR-AUC | FR→EN F1 (BAD) |
|---|---|---|---|---|---|---|
| Entropy only (1) | 0.663 | 0.380 | 0.441 | 0.797 | 0.456 | 0.470 |
| Attention only (75) | 0.730 | 0.486 | 0.488 | 0.796 | 0.441 | 0.457 |
| Combined (76) | 0.750 | 0.506 | 0.505 | 0.849 | 0.546 | 0.530 |
Extended Metrics for Combined features (entropy+attention)
| Pair | Precision (BAD) | Recall (BAD) | Specificity (OK) | Balanced Acc. | MCC |
|---|---|---|---|---|---|
| ZH→EN | 0.405 | 0.672 | 0.689 | 0.680 | 0.315 |
| FR→EN | 0.462 | 0.623 | 0.877 | 0.750 | 0.443 |
When combined, features work better than each signal alone across both language pairs, likely because they capture complementary error types.
Final thoughts
Could the same approach be used for tasks beyond translation? Entropy captures “The model doesn’t know what to generate,” while attention captures “The model isn’t grounded in the input.” For RAG systems, this suggests combining perplexity-based detection together with attention analysis over retrieved documents. For summarization — building grounded links between source text tokens and those in the summary.
Limitations
Computation cost. Running the backward model essentially increases inference time.
It’s a glassbox model only. You need access to attention weights and, therefore, it won’t work on API-based models. However, if you have access — you won’t have to modify the core model’s weights. You can plug in any pretrained encoder-decoder, freeze it, and train only the QE head.
Uncertainty does not mean errors. It just means that the model is unsure. The model sometimes flags correct paraphrases as errors because attention patterns differ from those which the model met during training — similar to how a human translator might be unsure if they had never encountered anything like it before.
Try It Yourself
All code, models, and the annotated dataset are open source:
# Clone the repository
git clone https://github.com/algapchenko/nmt-quality-estimation
cd nmt-quality-estimation
pip install -r requirements.txt
# Run interactive demo (downloads models automatically)
python inference/demo.py --lang zh-en --interactive
# Or use programmatically:
from inference.demo import QEDemo
# Initialize (downloads models from HuggingFace automatically)
demo = QEDemo(lang_pair='zh-en')
# Translate with quality estimation
result = demo.translate_with_qe("她喜欢吃苹果")
print(result['translation']) # "she likes eating apples"
print(result['tokens']) # ['she', 'likes', 'eating', 'apples']
print(result['probs']) # (P(BAD) per token)
print(result['tags'])
# Highlighted output
print(demo.format_output(result))
# Translation: she likes eating apples
# QE: she likes eating apples
# BAD tokens: 0/4
Resources
References
[1] S. Farquhar, J. Kossen, L. Kuhn, Y. Gal, Detecting Hallucinations in Large Language Models Using Semantic Entropy (2024), https://www.nature.com/articles/s41586-024-07421-0
[2] COMET, GitHub repository, https://github.com/Unbabel/COMET
[3] W. Xu, S. Agrawal, E. Briakou, M. J. Martindale, M. Carpuat, Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection (2023), https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00563/