Jailbreaking Black Box Large Language Models in Twenty Queries
California University of Pennsylvania
Abstract
There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR—which is inspired by social engineering attacks—uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker…
Citation impact
- FWCI
- 159.19
- Percentile
- 100%
- References
- 42
Authors
6Topics & keywords
- Black box
- Computer science
- Artificial intelligence
- Peace, Justice and strong institutions