Jailbreak and Guard Aligned Language Models With Only Few In-Context Demonstrations

ZWZeming WeiYWYue WangALAng LiYMYichuan MoYWYisen Wang

Peking University · Massachusetts Institute of Technology

PubMed
Indexed incrossrefpubmed

Abstract

Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs' safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate…

Citation impact

10
total citations
FWCI
179.86
Percentile
100%
References
29
Citations per year

Authors

5

Topics & keywords

Keywords
  • Adversarial system
  • Scalability
  • Guard (computer science)
  • Language model
  • Boosting (machine learning)
  • Resilience (materials science)
  • Robustness (evolution)
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.

Funding