articleApr 9, 2025Closed access

Jailbreaking Black Box Large Language Models in Twenty Queries

California University of Pennsylvania

Indexed incrossref

Abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR—which is inspired by social engineering attacks—uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker…

Citation impact

86
total citations
FWCI
159.19
Percentile
100%
References
42
Citations per year

Authors

6

Topics & keywords

Keywords
  • Black box
  • Computer science
  • Artificial intelligence
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.