Detecting Translation Hallucinations with Attention Misalignment

Contents

What does right and wrong look like Bidirectional Cross-Check When Patterns Are Less Obvious: Chinese → English Experimental Setup Data and Annotation Training Pipeline Alignment feature types Why Context Matters Token-Level Comparison: What Each Signal Sees Repetition Errors: How Attention Catches Poor Model Artifacts What “Confident Translation” Looks Like Scaling up Final thoughts Limitations Try It Yourself Resources References

has evolved significantly since the early days of Google Translate in 2007. However, NMT systems still hallucinate like any other model — especially when it comes to low-resource domains or when translating between rare language pairs.

When Google Translate provides a result, you see only the output text, not the probability distributions or uncertainty metrics for each word or sentence. Even if you don’t need this information, knowledge of where the model is confident and where it isn’t can be really valuable for internal purposes. For instance, simple parts can be fed to a fast and cheap model, while more resources can be allocated for difficult ones.

But how can we assess and, most importantly, “calibrate” this uncertainty? The first thing that comes to mind is to evaluate the distribution of output probabilities for each token, for example, by calculating its entropy. This is computationally simple, universal across model architectures, and, as can be seen below, actually correlates with cases where the NMT model is uncertain.

However, the limitations of this approach are obvious:

First, the model may be choosing between multiple synonyms and, from the token selection perspective, be uncertain.
Second, and more importantly, this is just a black-box method that explains nothing about the nature of the uncertainty. Perhaps the model really hasn’t seen anything similar during training. Or perhaps it simply hallucinated a non-existent word or a grammatical construction.

Existing approaches address this problem reasonably well, but all have their nuances:

Semantic Entropy [1] clusters model outputs by semantic meaning, but requires generating 5–10 outputs for a single input, which is computationally expensive (and frankly, when I tried to reproduce this on my labelled dataset, the observed semantic similarity of words in these clusters was questionable).
Metrics like xCOMET [2] achieve SOTA-level QE at the token level, but require fine-tuning 3.5 billion parameters of an XLM-R model on expensive quality-annotated data and, aside from that, function as a black box.
Model introspection [3] through saliency analysis looks interesting but also has interpretation issues.

The method proposed below can make uncertainty computation efficient. Since most NMT setups already have two models — a forward model (language1 → language2) and a backward model (language2 → language1) — we can leverage them to compute interpretable uncertainty signals.

After generating a translation with the forward model, we can “place” the inverted translation-source pair into the backward model using teacher forcing (as if it generated it itself), then extract the transposed cross-attention map and compare it with the corresponding map from the forward model. The results below show that this approach allows obtaining interpretable signals at the token level in most cases.

Additionally, there is no need to retrain a heavy NMT model. It is sufficient to train a lightweight classifier on features from the matrix comparison while keeping the main model’s weights frozen.

What does right and wrong look like

Let’s start with a simple French → English translation example where everything is clear just from the visualization.

“Elle aime manger des pommes → She likes to eat apples”

Figure 1. Example of a correct model translation. For both forward and reverse translation, the cross-attention pattern is obvious: each token of the translated sentence has a connection to the tokens of the source sentence (“She”↔”Elle”, “likes”↔”aime”, “apples”↔”pommes”), with the exception of particles and articles — even these have an “anchor” to the corresponding verb or noun. Image by author

Now let’s compare this with a broken NMT translation:

“La femme dont je t’ai parlé travaille à cette université” (correct translation: “The woman I told you about works at this university”)

What the model produced:

“The woman whose wife I told you about that university”

Where did this extra “wife” come from?

Figure 2: The 1-to-1 pattern for tokens seen in the first example breaks down here. As noted, this alone is not a sufficient condition to claim that the model is hallucinating. However, it is a clear signal that the back translator cannot “find” the reverse translation path. This is especially evident in the Reciprocal attention map, where both the extra word “wife” and the semantically related tokens in the second part of the sentence have blurred scores. Image by author

Bidirectional Cross-Check

Computation of bidirectional attention. Please note that the backward model uses teacher forcing. It receives the pre-generated English translation and checks whether it matches back to the original French source, while no new French sentence is generated. This is an alignment verification, not a round-trip translation.

def get_bidirectional_attention(dual_model, src_tensor, tgt_tensor):
    """Extract forward/backward cross-attention and reciprocal map."""
    dual_model.eval()
    with torch.no_grad():
        fwd_attn, bwd_attn = dual_model.get_cross_attention(src_tensor, tgt_tensor)

    # Align to full target/source lengths for element-wise comparison
    B, T = tgt_tensor.shape
    S = src_tensor.shape[1]
    fwd_aligned = torch.zeros(B, T, S, device=src_tensor.device)
    bwd_aligned = torch.zeros(B, T, S, device=src_tensor.device)
    if T > 1:
        fwd_aligned[:, 1:T, :] = fwd_attn
    if S > 1:
        bwd_aligned[:, :, 1:S] = bwd_attn.transpose(1, 2)

    reciprocal = fwd_aligned * bwd_aligned
    return fwd_aligned, bwd_aligned, reciprocal

All reproducible code is accessible via the project’s GitHub repository.

At the beginning of my work on the topic, I tried using direct round-trip translation. However, due to the poor performance of single-GPU-trained models and also because of the translation ambiguity, it was difficult to compare the source and round-trip result at the token level, as sentences could completely lose their meaning. Moreover, comparing the attention matrices of the back and forward models for three different sentences — the source, the translation, and the reproduced source from the round-trip — would have been costly.

When Patterns Are Less Obvious: Chinese → English

For language pairs with similar structure (like French↔English), the “1-to-1 token pattern” is intuitive. But what about typologically distant languages?

Chinese → English involves:

Flexible word order. Chinese is SVO like English, but allows topicalization and pro-drop.
No spaces between words. Tokenizers must segment before subword splitting.
Logographic writing system. Characters map to morphemes, not phonemes.

The attention maps become harder to interpret just by looking at the picture, however the learned features still manage to capture alignment quality.

Let’s look at this example of a semantic inversion error:

这家公司的产品质量越来越差客户都很不满意 (correct translation: This company’s product quality is getting worse, customers are very dissatisfied)

The model output:

The quality of products of the company is increasingly satisfied with the customer.

The word “不满意” means “dissatisfied”, but what the model had produced is exactly opposite. Not even mentioning that the whole translation result is nonsense.

Figure 3: Even though it is much harder for the human eye to interpret the pattern here, unlike the French → English example, the weak Reciprocal attention scores in the second part of the sentence still stand out, indicating where the model makes semantic errors. Image by author

Despite the pattern being significantly less visually noticeable, a trainable QE classifier is still able to capture it. This is precisely why we extract 75 attention alignment–based features of various kinds, as explained in more detail below.

Experimental Setup

The NMT core is intentionally kept undertrained for the setup. A near-perfect translator produces few errors to detect. Nevertheless, to build a quality estimation system, we need translations that sometimes (or even often) fail, and include omissions, hallucinations, grammatical errors, and mistranslations.

While an absolutely garbage model would make no sense, simply because there would be no reference for the classifier on what is right, the model used in this setup (~0.25–0.4 BLEU over a validation part of the dataset) ensures a steady supply of diverse error types, thereby creating a decent training signal for the QE.

The architecture uses simple scaled dot-product attention instead of more advanced options (linear attention, GQA, etc.). This keeps the attention weights interpretable: each weight represents the probability mass assigned to a source position, without approximations or kernel tricks that would also be nice to consider — but yet out of the scope of this experiment. Discovering ways to improve the method for more optimized attention structure is a good forward point to go.

Data and Annotation

	ZH→EN	FR→EN
Typological distance	High	Low
Expected error types	Alignment & word order errors	Lexical & tense errors
Training pairs	100k sentences	100k sentences
QE annotation set	15k translations	15k translations

Token-level binary quality labels were annotated via “LLM-as-a-judge” approach using Gemini 2.5 Flash. The annotation prompt had clear and strict rules:

BAD: mistranslations, wrong tense/form, hallucinated content, incorrect syntax, UNK tokens.
OK: correct meaning, valid synonyms, natural paraphrasing.

Each translation was tokenized, and the judging model created labels for every token. It also provided a reference translation and a minimal post-edit. In total, this gave approx. 150,000 labeled tokens with 15–20% “BAD” rate.

Training Pipeline

Step 1: Train bidirectional NMTs. Forward (src→tgt) and backward (tgt→src) models were trained jointly on parallel data. Both share the same architecture but separate parameters.

class DualTransformerNMT(nn.Module):
    """Bidirectional translator used for QE feature extraction."""
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model,
                 n_heads, n_layers, d_ff, max_length):
        super().__init__()
        self.zh2en = TransformerNMT(src_vocab_size, tgt_vocab_size,
                                    d_model, n_heads, n_layers, d_ff, max_length)
        self.en2zh = TransformerNMT(tgt_vocab_size, src_vocab_size,
                                    d_model, n_heads, n_layers, d_ff, max_length)

class QEClassifier(nn.Module):
    """Token-level BAD probability head."""
    def __init__(self, input_dim=75, hidden_dim=128, dropout=0.2):
        super().__init__()
        self.input_projection = nn.Linear(input_dim, hidden_dim)
        self.hidden1 = nn.Linear(hidden_dim, hidden_dim)
        self.hidden2 = nn.Linear(hidden_dim, hidden_dim // 2)
        self.output = nn.Linear(hidden_dim // 2, 1)

Step 2: Generate translations. The forward model translated the QE annotation set. These translations (with their natural errors) became the training data for quality estimation.

Step 3: Extract attention features. For each translated sentence, 75-dimensional feature vectors were extracted per token position using the method described below.

def extract_all_features(dual_model, src_tensor, tgt_tensor, attention_extractor):
    """Extract per-token QE features used in training/inference."""
    # Bidirectional cross-attention
    fwd_attn, bwd_attn = dual_model.get_cross_attention(src_tensor, tgt_tensor)

    # 75 attention features (25 base x context window [-1,0,+1])
    attn_features = attention_extractor.extract(
        fwd_attn, bwd_attn, src_tensor, tgt_tensor
    )[0]  # [T, 75]

    # Optional entropy feature (top-k normalized output entropy)
    entropy = compute_output_entropy(dual_model.zh2en, src_tensor, tgt_tensor)[0]

    # Final combined vector used in ablation: [T, 76]
    features = torch.cat([attn_features, entropy.unsqueeze(-1)], dim=-1)
    return features

Step 4: Train QE classifier. A small MLP classifier (128 → 64 → 1) was trained on the extracted features with frozen translator weights.

# Freeze translator weights, train QE head only
dual_model.freeze_translation_models()

n_bad = max(int(y_train.sum()), 1)
n_ok = max(int(len(y_train) - y_train.sum()), 1)
pos_weight = torch.tensor([n_ok / n_bad], device=device, dtype=torch.float32)

classifier = QEClassifier(input_dim=input_dim, hidden_dim=128, dropout=0.2).to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

for batch_x, batch_y in train_loader:
    optimizer.zero_grad()
    logits = classifier(batch_x.unsqueeze(1)).squeeze(1)
    loss = criterion(logits, batch_y)
    loss.backward()
    optimizer.step()

Alignment feature types

1. Focus (12 features) — Where is the model looking?

def extract_focus_features(fwd_attn, bwd_attn, tgt_pos, src_content_mask):
    """
    Extract top-k alignment scores and their backward counterparts.

    Args:
        fwd_attn: [S] forward attention from target position to all sources
        bwd_attn: [S, T] backward attention matrix
        tgt_pos: current target position index
        src_content_mask: [S] boolean mask for content (non-special) tokens

    Returns:
        features: [12] focus feature vector
    """
    # Mask to only consider content source tokens
    fwd_scores = fwd_attn * src_content_mask

    # Get top-3 attended source positions
    top_k = 3
    top_fwd_scores, top_src_indices = torch.topk(fwd_scores, top_k)

    features = torch.zeros(12)
    features[0:3] = top_fwd_scores  # Forward top-1, top-2, top-3

    # For each top source, check backward alignment strength
    for i, src_idx in enumerate(top_src_indices):
        bwd_from_src = bwd_attn[src_idx, :]  # [T] - how source looks back
        top_bwd_scores, _ = torch.topk(bwd_from_src, 3)
        features[3 + i*3 : 3 + (i+1)*3] = top_bwd_scores

    return features

Hallucinated tokens often (yet not always) have diffused attention. The model fails to ground the truth between the source and the target and tries to “look” everywhere. A confident translation often focuses sharply on 1–2 source positions or at least has a distinct pattern.

2. Reciprocity (2 features) — Does the alignment cycle back?

def extract_reciprocity_features(fwd_attn, bwd_attn, tgt_pos, src_content_mask):
    """
    Check if attention alignment forms a closed cycle.

    Returns:
        hard_reciprocal: 1.0 if exact match, 0.0 otherwise
        soft_reciprocal: dot product overlap (continuous measure)
    """
    # Forward: find best source position for this target
    fwd_scores = fwd_attn * src_content_mask
    best_src = fwd_scores.argmax()

    # Backward: does that source point back to us?
    bwd_from_best_src = bwd_attn[best_src, :]  # [T]
    best_tgt_from_src = bwd_from_best_src.argmax()

    # Hard reciprocity: exact position match
    hard_reciprocal = 1.0 if (best_tgt_from_src == tgt_pos) else 0.0

    # Soft reciprocity: attention distribution overlap
    # High value = forward and backward "agree" on alignment
    fwd_normalized = fwd_scores / (fwd_scores.sum() + 1e-9)
    bwd_normalized = bwd_attn[:, tgt_pos] / (bwd_attn[:, tgt_pos].sum() + 1e-9)
    soft_reciprocal = (fwd_normalized * bwd_normalized).sum()

    return hard_reciprocal, soft_reciprocal

For example, if “wife” (from the example above) attends to position 3 in French, but position 3 doesn’t attend back to “wife,” the alignment is spurious.

3. Sink (11 features)

When uncertain, transformers often dump attention onto “safe” special tokens (SOS, EOS, PAD):

def extract_sink_features(fwd_attn, bwd_attn, src_tensor, tgt_tensor,
                          SOS=1, EOS=2, PAD=0):
    """
    Extract attention sink features - attention mass on special tokens.
    """
    # Identify special token positions in source
    src_is_sos = (src_tensor == SOS).float()
    src_is_eos = (src_tensor == EOS).float()
    src_is_pad = (src_tensor == PAD).float()

    # Measure attention mass going to each special token type
    sink_sos = (fwd_attn * src_is_sos).sum()  # Attention to SOS
    sink_eos = (fwd_attn * src_is_eos).sum()  # Attention to EOS
    sink_pad = (fwd_attn * src_is_pad).sum()  # Attention to PAD
    sink_total = sink_sos + sink_eos + sink_pad

    # Backward sink: check if best-aligned source also shows uncertainty
    best_src = fwd_attn.argmax()
    bwd_from_best = bwd_attn[best_src, :]
    tgt_is_special = ((tgt_tensor == SOS) | (tgt_tensor == EOS) |
                      (tgt_tensor == PAD)).float()
    bwd_sink = (bwd_from_best * tgt_is_special).sum()

    # Asymmetry: disagreement in uncertainty levels
    sink_asymmetry = abs(sink_total - bwd_sink)

    # Extended features: entropy-based measures
    content_mask = 1.0 - src_is_sos - src_is_eos - src_is_pad
    fwd_content = fwd_attn * content_mask
    fwd_content_norm = fwd_content / (fwd_content.sum() + 1e-9)

    max_content = fwd_content.max()  # Peak attention to content
    concentration = max_content / (fwd_content.sum() + 1e-9)  # How peaked?

    return {
        'sink_total': sink_total,
        'sink_sos': sink_sos,
        'sink_eos': sink_eos,
        'sink_pad': sink_pad,
        'bwd_sink': bwd_sink,
        'sink_asymmetry': sink_asymmetry,
        'max_content': max_content,
        'concentration': concentration,
        # ... plus entropy features
    }

Why Context Matters

Translation errors create a cascade, where a dropped word affects neighbors. By including features from positions t-1 and t+1, we allow the classifier to detect these ripple patterns. This is obviously not the overwhelming signal (especially when the real semantic “neighbor” could be located distantly in the sentence), but it is already strong enough to bring the value. Combining it with “topological” token-linking methods could make these features even more meaningful.

Token-Level Comparison: What Each Signal Sees

Now let’s have a look at a more detailed breakdown – if attention-alignment could be matched together with output distribution entropy scores. Are they carrying the same information or could potentially augment each other?

“Entropy” column below is the normalized top-k output entropy values (k=20) of the forward model’s softmax distribution, resulting in 0–1 scale. A value near 0 means the model is confident in a single token; a value near 1 means probability is spread evenly across other candidates.

French: “wife” hallucination

Source: La femme dont je t’ai parlé… → MT: “the woman whose wife i told you about…”

Chinese: “satisfied” semantic inversion

Source: 这家公司的产品质量越来越差客户都很不满意 → MT: “the quality of products of the company is increasingly satisfied with the customer”

Repetition Errors: How Attention Catches Poor Model Artifacts

Source: Il est évident que cette approche ne fonctionne pas correctement (It is obvious that this approach does not work properly)

MT: “it’s obvious that this approach does not work properly operate properly”

What “Confident Translation” Looks Like

Source: Le rapport mentionne uniquement trois incidents, pas quatre

MT: “the report mentions only three incidents, not four”

Scaling up

Now let’s explore how the method works at scale. For this, I did a quick ablation check, which had the goal to observe the impact of attention-based features. Could it be that they bring nothing worthy of additional calculations, compared with a simple output entropy?

Methodology: the dataset was split at the sentence level into 70% training, 15% validation, and 15% test sets. The best epoch was selected using validation ROC-AUC for threshold independence, and the classification threshold was tuned on validation F1(BAD). The final metrics were reported on the held-out test set only.

Feature Contributions

Features	ZH→EN ROC-AUC	ZH→EN PR-AUC	ZH→EN F1 (BAD)	FR→EN ROC-AUC	FR→EN PR-AUC	FR→EN F1 (BAD)
Entropy only (1)	0.663	0.380	0.441	0.797	0.456	0.470
Attention only (75)	0.730	0.486	0.488	0.796	0.441	0.457
Combined (76)	0.750	0.506	0.505	0.849	0.546	0.530

Extended Metrics for Combined features (entropy+attention)

Pair	Precision (BAD)	Recall (BAD)	Specificity (OK)	Balanced Acc.	MCC
ZH→EN	0.405	0.672	0.689	0.680	0.315
FR→EN	0.462	0.623	0.877	0.750	0.443

When combined, features work better than each signal alone across both language pairs, likely because they capture complementary error types.

Final thoughts

Could the same approach be used for tasks beyond translation? Entropy captures “The model doesn’t know what to generate,” while attention captures “The model isn’t grounded in the input.” For RAG systems, this suggests combining perplexity-based detection together with attention analysis over retrieved documents. For summarization — building grounded links between source text tokens and those in the summary.

Limitations

Computation cost. Running the backward model essentially increases inference time.

It’s a glassbox model only. You need access to attention weights and, therefore, it won’t work on API-based models. However, if you have access — you won’t have to modify the core model’s weights. You can plug in any pretrained encoder-decoder, freeze it, and train only the QE head.

Uncertainty does not mean errors. It just means that the model is unsure. The model sometimes flags correct paraphrases as errors because attention patterns differ from those which the model met during training — similar to how a human translator might be unsure if they had never encountered anything like it before.

Try It Yourself

All code, models, and the annotated dataset are open source:

# Clone the repository
git clone https://github.com/algapchenko/nmt-quality-estimation
cd nmt-quality-estimation
pip install -r requirements.txt

# Run interactive demo (downloads models automatically)
python inference/demo.py --lang zh-en --interactive
# Or use programmatically:
from inference.demo import QEDemo

# Initialize (downloads models from HuggingFace automatically)
demo = QEDemo(lang_pair='zh-en')

# Translate with quality estimation
result = demo.translate_with_qe("她喜欢吃苹果")

print(result['translation'])  # "she likes eating apples"
print(result['tokens'])       # ['she', 'likes', 'eating', 'apples']
print(result['probs'])        # (P(BAD) per token)
print(result['tags'])

# Highlighted output
print(demo.format_output(result))
# Translation: she likes eating apples
# QE: she likes eating apples
# BAD tokens: 0/4

Resources

References

[1] S. Farquhar, J. Kossen, L. Kuhn, Y. Gal, Detecting Hallucinations in Large Language Models Using Semantic Entropy (2024), https://www.nature.com/articles/s41586-024-07421-0

[2] COMET, GitHub repository, https://github.com/Unbabel/COMET

[3] W. Xu, S. Agrawal, E. Briakou, M. J. Martindale, M. Carpuat, Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection (2023), https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00563/