"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Shen, Xinyue; Chen, Zeyuan; Backes, Michael; Shen, Yun; Zhang, Yang

doi:10.1145/3658644.3670388

articleDec 2, 2024GOLD OA

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

XSXinyue Shen ZCZeyuan Chen MBMichael Backes YSYun Shen YZYang Zhang

Helmholtz Center for Information Security

Indexed incrossref

Abstract

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities…

Citation impact

163

total citations

FWCI: 50.77
Percentile: 100%
References: 54

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Harm
Set (abstract data type)
Adversarial system
Privilege (computing)
Computer science
SAFER
Internet privacy
Computer security

No related works found for this paper.

Funding

UB
Universitas Brawijaya
Award: 101057917