Training large language models on narrow tasks can lead to broad misalignment
Bexley Hall · Risk Management Solutions (United Kingdom) · +8 more institutions
Abstract
Abstract The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment 1 . Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information 2,3 . Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding 4 . For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across…
Citation impact
- FWCI
- 233.06
- Percentile
- 100%
- References
- 9
Authors
9Topics & keywords
- Phenomenon
- Task (project management)
- Software deployment
- Lead (geology)
- Psychological intervention
- Code (set theory)
- Peace, Justice and strong institutions