Jailbreak and Guard Aligned Language Models With Only Few In-Context Demonstrations
Peking University · Massachusetts Institute of Technology
Abstract
Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs' safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate…
Citation impact
- FWCI
- 179.86
- Percentile
- 100%
- References
- 29
Authors
5- ZWZeming WeiCorresponding
Peking University
- YWYue Wang
Peking University, Massachusetts Institute of Technology
- ALAng Li
Peking University
- YMYichuan Mo
Peking University
- YWYisen Wang
Peking University, Massachusetts Institute of Technology
Topics & keywords
- Adversarial system
- Scalability
- Guard (computer science)
- Language model
- Boosting (machine learning)
- Resilience (materials science)
- Robustness (evolution)
- Peace, Justice and strong institutions