As Large Language Models are deployed globally, ensuring consistent safety across languages becomes paramount. Yet safety behaviors vary unpredictably between languages, and most alignment research remains English-centric — creating a critical gap for non-English speakers worldwide.
We systematically investigate the dynamics of multilingual alignment: whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general capabilities. We introduce RefusEU, a novel refusal alignment dataset covering 12 European languages, including a held-out test set for evaluating state-of-the-art models.
Our controlled Direct Preference Optimization (DPO) experiments reveal two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety — even for identical harm categories. Conversely, training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.
Do we need to perform multilingual alignment for each language on the same groups of prompts, or is training in a single language (English) sufficient to achieve cross-lingual safety?
How well is cross-lingual consistency — the ability to respond in the prompt's language — preserved during multilingual training, and how does it interact with safety?
How does multilingual safety alignment influence general multilingual capabilities, including factual knowledge, fluency, and linguistic correctness?
RefusEU is the first European dataset designed for alignment training as DPO-ready triples — (question, chosen, rejected) — and includes a separate, contamination-free test split. Each chosen response is a high-quality refusal; each rejected response was generated by a safety-abliterated model.
Languages covered:
Questions are generated using an adversarial pipeline based on Rainbow Teaming across 10 attack styles and 14 crime categories (Llama-Guard taxonomy). A multi-model labelling protocol (Llama-Guard-3-8B, PolyGuard-Qwen, GPT-4o-mini) ensures label quality, with a manual audit confirming 100% accuracy across 1,200 sampled pairs.
To isolate alignment dynamics, we start from abliterated Llama-3.1-8B and 70B models — versions where safety mechanisms have been deliberately removed via refusal direction ablation — then realign them using DPO under four dataset configurations:
All 12 languages with equal representation (34,668 samples total).
English, German, Italian, French, Spanish, Portuguese (17,334 samples).
Baseline to test whether English alignment is sufficient (2,889 samples).
All 11 non-English languages — tests transfer to English from others.
Additionally, 11 individual single-language DPO runs were performed to measure language-specific transfer. Evaluation uses Attack Success Rate (ASR) on RefusEU-test, language consistency, Global MMLU, and an LLM-as-a-Judge fluency/correctness protocol.
English-only alignment is insufficient. Training exclusively on English safety preferences leads to notably higher ASR for low-resource languages, particularly with Llama-70B — demonstrating that cross-lingual safety transfer from English alone cannot be relied upon.
Balanced multilingual training works best. The lowest average ASR across all languages is consistently achieved by the balanced 12-language configuration for both the 8B and 70B models, with high-resource-only training as a strong second choice.
Linguistic proximity enables transfer. Closely related language pairs — Polish–Czech and Portuguese–Spanish — exhibit strongly correlated ASR values across training configurations, suggesting that structural similarity facilitates cross-lingual safety generalization.
Language consistency and safety interact non-trivially. While high language consistency is generally desirable, explicitly enforcing it can reduce safety in smaller models like Llama-8B. Llama-70B achieves near-100% consistency across all configurations; smaller models degrade under English-only setups.
General capabilities are largely preserved. Performance degradation on Global MMLU stays below 0.006 for both model sizes. For low-resource languages on the 8B model, translation-based pipelines (translate → answer in English → translate back) outperform native-language generation even for the unmodified Instruct baseline.
The first DPO-ready multilingual refusal dataset covering 12 European languages, with a fixed contamination-free evaluation split and fully audited safety labels.
Systematic ablation across 4 training configurations + 11 single-language runs on deliberately de-safety-aligned base models for clean measurement.
ASR, language consistency, Global MMLU, and fluency/correctness measured across all 12 languages, revealing trade-offs invisible under single-metric reporting.