Multilingual Refusal Alignment for Safer LLMs

Abstract

As Large Language Models are deployed globally, ensuring consistent safety across languages becomes paramount. Yet safety behaviors vary unpredictably between languages, and most alignment research remains English-centric — creating a critical gap for non-English speakers worldwide.

We systematically investigate the dynamics of multilingual alignment: whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general capabilities. We introduce RefusEU, a novel refusal alignment dataset covering 12 European languages, including a held-out test set for evaluating state-of-the-art models.

Our controlled Direct Preference Optimization (DPO) experiments reveal two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety — even for identical harm categories. Conversely, training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.

Research Questions

What We Set Out to Answer

RQ1
Do we need to perform multilingual alignment for each language on the same groups of prompts, or is training in a single language (English) sufficient to achieve cross-lingual safety?
RQ2
How well is cross-lingual consistency — the ability to respond in the prompt's language — preserved during multilingual training, and how does it interact with safety?
RQ3
How does multilingual safety alignment influence general multilingual capabilities, including factual knowledge, fluency, and linguistic correctness?

Dataset

Introducing RefusEU

RefusEU is the first European dataset designed for alignment training as DPO-ready triples — (question, chosen, rejected) — and includes a separate, contamination-free test split. Each chosen response is a high-quality refusal; each rejected response was generated by a safety-abliterated model.

12 European languages

4k+ pairs per language

14 harm categories

1,400 test samples / lang.

Languages covered:

English German French Italian Spanish Portuguese Polish Czech Slovak Slovenian Lithuanian Latvian

Questions are generated using an adversarial pipeline based on Rainbow Teaming across 10 attack styles and 14 crime categories (Llama-Guard taxonomy). A multi-model labelling protocol (Llama-Guard-3-8B, PolyGuard-Qwen, GPT-4o-mini) ensures label quality, with a manual audit confirming 100% accuracy across 1,200 sampled pairs.

Figure 1 — Dataset construction process: adversarial prompt generation → multilingual translation → dual-model safety labelling → DPO triple curation.

Methodology

Experimental Design

To isolate alignment dynamics, we start from abliterated Llama-3.1-8B and 70B models — versions where safety mechanisms have been deliberately removed via refusal direction ablation — then realign them using DPO under four dataset configurations:

⚖️

Balanced

All 12 languages with equal representation (34,668 samples total).

🌍

High-Resource Only

English, German, Italian, French, Spanish, Portuguese (17,334 samples).

🇬🇧

English Only

Baseline to test whether English alignment is sufficient (2,889 samples).

🌐

No English

All 11 non-English languages — tests transfer to English from others.

Additionally, 11 individual single-language DPO runs were performed to measure language-specific transfer. Evaluation uses Attack Success Rate (ASR) on RefusEU-test, language consistency, Global MMLU, and an LLM-as-a-Judge fluency/correctness protocol.

Results

Key Findings

Table 2 and Figure 2 — Attack Success Rate (ASR %) on RefusEU-test. Lower is better. Balanced multilingual training achieves the lowest ASR across both model sizes.

English-only alignment is insufficient. Training exclusively on English safety preferences leads to notably higher ASR for low-resource languages, particularly with Llama-70B — demonstrating that cross-lingual safety transfer from English alone cannot be relied upon.
Balanced multilingual training works best. The lowest average ASR across all languages is consistently achieved by the balanced 12-language configuration for both the 8B and 70B models, with high-resource-only training as a strong second choice.
Linguistic proximity enables transfer. Closely related language pairs — Polish–Czech and Portuguese–Spanish — exhibit strongly correlated ASR values across training configurations, suggesting that structural similarity facilitates cross-lingual safety generalization.
Language consistency and safety interact non-trivially. While high language consistency is generally desirable, explicitly enforcing it can reduce safety in smaller models like Llama-8B. Llama-70B achieves near-100% consistency across all configurations; smaller models degrade under English-only setups.
General capabilities are largely preserved. Performance degradation on Global MMLU stays below 0.006 for both model sizes. For low-resource languages on the 8B model, translation-based pipelines (translate → answer in English → translate back) outperform native-language generation even for the unmodified Instruct baseline.

ASR vs. language consistency scatter plot

Figure 3 — ASR vs. language consistency across training setups. Llama-70B with high-resource training achieves the best combined performance.

Contributions

Summary

🗃️

RefusEU Dataset

The first DPO-ready multilingual refusal dataset covering 12 European languages, with a fixed contamination-free evaluation split and fully audited safety labels.

🔬

Controlled Experiments

Systematic ablation across 4 training configurations + 11 single-language runs on deliberately de-safety-aligned base models for clean measurement.

📐

Multidimensional Evaluation

ASR, language consistency, Global MMLU, and fluency/correctness measured across all 12 languages, revealing trade-offs invisible under single-metric reporting.

Citation

BibTeX

@inproceedings{krasnodebska2026refuseu,
  title = {Multilingual Refusal Alignment for Safer Large Language Models},
  author = {Krasnodębska, Aleksandra and Kusa, Wojciech and Lipani, Aldo},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year = {2026},
  address = {San Diego, California, United States},
  publisher = {Association for Computational Linguistics}
}