47_scalable_and_transferable_blac.pdf

– Large language models (LLMs) need safety measures to prevent misuse.
– Jailbreaks are prompts that elicit unrestricted behavior from LLMs.
– This paper explores persona modulation as a black-box jailbreaking method.
– Automated persona modulation attacks achieve harmful completions in LLMs.
– The attacks transfer to other LLMs and increase harmful completion rates.
– The paper highlights the vulnerability of commercial LLMs and the need for safeguards.

– The primary target model used in the experiments is GPT-4.
– GPT-3.5 generally failed at producing successful persona-modulation prompts.
– GPT-4 is only available as a black-box via API.
– GPT-4 is used as a zero-shot PICT classifier to assess completion harm.
– GPT-4 achieved 91 precision and 76 F1-Score in classifying harmless responses.

– Provides guidance on selecting and using tools in NLP systems.
– Enhances the capacities and robustness of language models.
– Improves scalability and interpretability of NLP systems.

– The paper presents Prompt Automatic Iterative Refinement (PAIR) for generating semantic jailbreaks.
– PAIR requires black-box access to a language model and often requires fewer than 20 queries.
– PAIR draws inspiration from social engineering and uses an attacker language model.
– PAIR achieves competitive jailbreaking success rates on various language models.

– Persona modulation can be used as a black-box jailbreaking method for language models.
– Automated attacks using persona modulation achieve a harmful completion rate of 42.5% in GPT-4.
– These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9% respectively.
– The paper highlights the need for more comprehensive safeguards in large language models.

– Full results of the three tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle can be found in Tables 5, 6, and 7, respectively.

– Language models can be influenced to give harmful responses through persona modulation.
– Automated attacks using persona modulation achieved a harmful completion rate of 42.5% in GPT-4.
– These attacks also transferred to other models, with harmful completion rates of 61.0% in Claude 2 and 35.9% in Vicuna.
– The study highlights the need for better safeguards against such vulnerabilities in language models.