Training large language models on narrow tasks can lead to broad misalignment

Betley, Jan; Warncke, Niels; Sztyber, Anna; Tan, Daniel C. H.; Bao, Xuchan; Soto, Martín; Srivastava, Megha; Labenz, Nathan A.; Evans, Owain

doi:10.1038/s41586-025-09937-5

articleNatureJan 14, 2026HYBRID OA

Training large language models on narrow tasks can lead to broad misalignment

JBJan Betley NWNiels Warncke ASAnna Sztyber DCDaniel C. H. Tan XBXuchan Bao

Bexley Hall · Risk Management Solutions (United Kingdom) · +8 more institutions

PubMed

Indexed inarxivcrossrefpubmed

Abstract

Abstract The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment 1 . Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information 2,3 . Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding 4 . For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across…

Citation impact

10

total citations

FWCI: 233.06
Percentile: 100%
References: 9

Too recent for citation history.

Authors

9

Topics & keywords

Topics

Keywords

Phenomenon
Task (project management)
Software deployment
Lead (geology)
Psychological intervention
Code (set theory)

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.

Funding

OP
Open Philanthropy Project