Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, Patrick; Robey, Alexander; Dobriban, Edgar; Hassani, Hamed; Pappas, George J.; Wong, Eric

doi:10.1109/satml64287.2025.00010

articleApr 9, 2025Closed access

Jailbreaking Black Box Large Language Models in Twenty Queries

PCPatrick Chao ARAlexander Robey EDEdgar Dobriban HHHamed Hassani GJGeorge J. Pappas

California University of Pennsylvania

Indexed incrossref

Abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR—which is inspired by social engineering attacks—uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker…

Citation impact

86

total citations

FWCI: 159.19
Percentile: 100%
References: 42

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Black box
Computer science
Artificial intelligence

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.