Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast “...

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

“ 1. Background. A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors.

2. New Concept. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors.”

https://sail-sg.github.io/Agent-Smith/