"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Helmholtz Center for Information Security
Abstract
The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities…
Citation impact
- FWCI
- 50.77
- Percentile
- 100%
- References
- 54
Authors
5Topics & keywords
- Harm
- Set (abstract data type)
- Adversarial system
- Privilege (computing)
- Computer science
- SAFER
- Internet privacy
- Computer security