Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Non-Expert Users Can Jailbreak LLMs by Manipulating Prompts

Recent explorations have shown that non-expert users can jailbreak large language models (LLMs) by simply manipulating the prompts provided to them. Jailbroken LLMs can exhibit degenerate output behavior, such as generating nonsensical or offensive text. They can also lead to privacy breaches, as they may be able to access and disclose sensitive information. Moreover, jailbroken LLMs can generate outputs that violate content regulator policies.

This paper proposes a formalism and taxonomy of known and possible jailbreaks of LLMs. It also surveys existing jailbreak methods and their effectiveness on various LLMs. Finally, the paper proposes a limited set of prompt guards to mitigate known attacks.

Formalism and Taxonomy of Jailbreaks

The paper proposes a formalism for jailbreaks of LLMs. This formalism defines jailbreaks as a type of attack in which an attacker can modify the prompt provided to an LLM to cause it to produce outputs that are not in accordance with the intended behavior of the LLM. The paper also proposes a taxonomy of jailbreaks, which categorizes jailbreaks based on their goals and the methods used to achieve those goals.

Survey of Jailbreak Methods

The paper surveys existing jailbreak methods and their effectiveness on various LLMs. The paper found that there are a number of different methods that can be used to jailbreak LLMs. These methods range from simple techniques, such as injecting hidden code into the prompt, to more sophisticated techniques, such as exploiting vulnerabilities in the LLM’s code.

Prompt Guards

The paper proposes a limited set of prompt guards to mitigate known jailbreak attacks. Prompt guards are techniques that can be used to make it more difficult for attackers to jailbreak LLMs. The paper found that prompt guards can be effective in mitigating some types of jailbreak attacks, but they are not a perfect solution.

Conclusion

The paper provides a comprehensive overview of jailbreaks of LLMs. It proposes a formalism and taxonomy for jailbreaks, surveys existing jailbreak methods, and proposes a limited set of prompt guards. The paper concludes that jailbreaks of LLMs are a serious threat, and that more research is needed to develop effective mitigation techniques.

Citation

Jalal, J., et al. (2023). Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks. arXiv preprint arXiv:2305.14965.

Categories: AI