540_an_llm_can_fool_itself_a_promp.pdf

– Large language models (LLMs) are powerful in various natural language processing tasks.
– Adversarial robustness of LLMs is crucial in safety-critical domains.
– Existing methods for evaluating LLMs’ robustness are not effective or efficient.
– This paper proposes PromptAttack, a tool to audit LLMs’ adversarial robustness.
– PromptAttack converts textual attacks into prompts to fool LLMs.
– Attack prompts consist of original input, attack objective, and attack guidance.
– PromptAttack maintains the semantic meaning of adversarial examples.
– PromptAttack outperforms AdvGLUE and AdvGLUE++ in attack success rate.
– Simple emojis can easily mislead GPT-3.5 to make wrong predictions.

– Provides guidance on selecting and using tools in NLP systems.
– Enhances the capacities and robustness of language models.
– Improves scalability and interpretability of NLP systems.

– The paper presents Prompt Automatic Iterative Refinement (PAIR) for generating semantic jailbreaks.
– PAIR requires black-box access to a language model and often requires fewer than 20 queries.
– PAIR draws inspiration from social engineering and uses an attacker language model.
– PAIR achieves competitive jailbreaking success rates on various language models.

– PromptAttack is proposed as an effective method for evaluating LLM’s adversarial robustness.
– Attack prompt composed of OI, AO, and AG is designed.
– Fidelity filter ensures adversarial samples maintain original semantics.
– Few-shot and ensemble strategies boost the attack power of PromptAttack.
– PromptAttack consistently yields a state-of-the-art attack success rate on the GLUE dataset.
– PromptAttack is an effective tool for efficiently auditing LLM’s adversarial robustness.

– Full results of the three tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle can be found in Tables 5, 6, and 7, respectively.

– The paper is about testing how well a computer program can understand and respond to text.
– They created a way to trick the program into giving the wrong answer.
– They used different methods to change the words and structure of sentences.
– They found that even simple changes, like adding an emoji, can confuse the program.
– They made sure the changes didn’t change the overall meaning of the text.
– Their method was more successful than other methods in tricking the program.