Jailbreak and Guard Aligned Language Models With Only Few In-Context Demonstrations

Wei, Zeming; Wang, Yue; Li, Ang; Mo, Yichuan; Wang, Yisen

doi:10.1109/tpami.2026.3660147

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceFeb 2, 2026Closed access

Jailbreak and Guard Aligned Language Models With Only Few In-Context Demonstrations

ZWZeming WeiYWYue Wang ALAng Li YMYichuan MoYWYisen Wang

Peking University · Massachusetts Institute of Technology

PubMed

Indexed incrossrefpubmed

Abstract

Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs' safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate…

Citation impact

10

total citations

FWCI: 179.86
Percentile: 100%
References: 29

Citations per year

Authors

5

ZW
Zeming WeiCorresponding
Peking University
YW
Yue Wang
Peking University, Massachusetts Institute of Technology
AL
Ang Li
Peking University
YM
Yichuan Mo
Peking University
YW
Yisen Wang
Peking University, Massachusetts Institute of Technology

Topics & keywords

Topics

Keywords

Adversarial system
Scalability
Guard (computer science)
Language model
Boosting (machine learning)
Resilience (materials science)
Robustness (evolution)

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.