Findings of ACL 2026

Multilingual Refusal Alignment
for Safer Large Language Models

Aleksandra Krasnodębska Wojciech Kusa Aldo Lipani
NASK National Research Institute, Warsaw, Poland  ·  University College London, London, UK
Abstract

As Large Language Models are deployed globally, ensuring consistent safety across languages becomes paramount. Yet safety behaviors vary unpredictably between languages, and most alignment research remains English-centric — creating a critical gap for non-English speakers worldwide.

We systematically investigate the dynamics of multilingual alignment: whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general capabilities. We introduce RefusEU, a novel refusal alignment dataset covering 12 European languages, including a held-out test set for evaluating state-of-the-art models.

Our controlled Direct Preference Optimization (DPO) experiments reveal two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety — even for identical harm categories. Conversely, training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.

Research Questions

What We Set Out to Answer

Introducing RefusEU

RefusEU is the first European dataset designed for alignment training as DPO-ready triples — (question, chosen, rejected) — and includes a separate, contamination-free test split. Each chosen response is a high-quality refusal; each rejected response was generated by a safety-abliterated model.

12 European languages
4k+ pairs per language
14 harm categories
1,400 test samples / lang.

Languages covered:

English German French Italian Spanish Portuguese Polish Czech Slovak Slovenian Lithuanian Latvian

Questions are generated using an adversarial pipeline based on Rainbow Teaming across 10 attack styles and 14 crime categories (Llama-Guard taxonomy). A multi-model labelling protocol (Llama-Guard-3-8B, PolyGuard-Qwen, GPT-4o-mini) ensures label quality, with a manual audit confirming 100% accuracy across 1,200 sampled pairs.

Dataset construction pipeline
Figure 1 — Dataset construction process: adversarial prompt generation → multilingual translation → dual-model safety labelling → DPO triple curation.
Methodology

Experimental Design

To isolate alignment dynamics, we start from abliterated Llama-3.1-8B and 70B models — versions where safety mechanisms have been deliberately removed via refusal direction ablation — then realign them using DPO under four dataset configurations:

⚖️
Balanced

All 12 languages with equal representation (34,668 samples total).

🌍
High-Resource Only

English, German, Italian, French, Spanish, Portuguese (17,334 samples).

🇬🇧
English Only

Baseline to test whether English alignment is sufficient (2,889 samples).

🌐
No English

All 11 non-English languages — tests transfer to English from others.

Additionally, 11 individual single-language DPO runs were performed to measure language-specific transfer. Evaluation uses Attack Success Rate (ASR) on RefusEU-test, language consistency, Global MMLU, and an LLM-as-a-Judge fluency/correctness protocol.

Results

Key Findings

ASR comparison table results ASR comparison figure results
Table 2 and Figure 2 — Attack Success Rate (ASR %) on RefusEU-test. Lower is better. Balanced multilingual training achieves the lowest ASR across both model sizes.
ASR vs. language consistency scatter plot
Figure 3 — ASR vs. language consistency across training setups. Llama-70B with high-resource training achieves the best combined performance.
Contributions

Summary

🗃️
RefusEU Dataset

The first DPO-ready multilingual refusal dataset covering 12 European languages, with a fixed contamination-free evaluation split and fully audited safety labels.

🔬
Controlled Experiments

Systematic ablation across 4 training configurations + 11 single-language runs on deliberately de-safety-aligned base models for clean measurement.

📐
Multidimensional Evaluation

ASR, language consistency, Global MMLU, and fluency/correctness measured across all 12 languages, revealing trade-offs invisible under single-metric reporting.

BibTeX

@inproceedings{krasnodebska2026refuseu,
  title = {Multilingual Refusal Alignment for Safer Large Language Models},
  author = {Krasnodębska, Aleksandra and Kusa, Wojciech and Lipani, Aldo},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year = {2026},
  address = {San Diego, California, United States},
  publisher = {Association for Computational Linguistics}
}