Jailbreaking GPT-4: A New Cross-Lingual Attack Vector
A breaking new study reveals major holes in the safety guardrails for large language models like GPT-4. It’s not complex or ground breaking either, researchers found that simply translating unsafe English text into lesser known languages can trick the AI into generating harmful content.
Image Source: Yong, Z.-X., Menghini, C., & Bach, S. H. (2023).
The team from Brown University tested GPT-4 using an evaluation called the Adversarial Benchmark. This benchmark contains over 500 unsafe English prompts like instructions for making explosives. When fed these prompts directly, GPT-4 refused to engage over 99% of the time, showing its safety filters were working.
However, when they used Google Translate to convert the unsafe prompts into languages like Zulu, Scots Gaelic, and Guarani, GPT-4 began freely engaging with around 80% of them. The translated responses contained topics like terrorism, financial crimes, and spreading misinformation.
Researchers were surprised to see how easy it was to completely bypass GPT-4's safety just by using free translation tools. Revealing a fundamental vulnerability in the process!
The culprit is uneven safety training. Like most AI today, GPT-4 was trained predominantly on English and other high-resource languages with plentiful data. Far less safety research has focused on lower-resource languages, including many African and indigenous languages.
This linguistic inequality means safety does not transfer well across languages. While GPT-4 shows caution with harmful English prompts, it lets its guard down for Zulu prompts.
But why does this matter if you don't speak Zulu? The researchers explain that bad actors worldwide can simply use translation tools to form attacks. And low-resource language speakers get no protection.
"When safety mechanisms only work on some languages, it gives a false sense of security," said Yong. "Our results underscore the need for more inclusive and holistic testing before claiming an AI is safe."
The team calls for red team testing in diverse languages, building safety datasets covering more languages, and developing robust multilingual guardrails. Only then can AI creators truly deliver on promises of beneficial and safe language systems for all.
Key Takeaways:
- GPT-4's safety measures failed to generalize across languages, leaving major vulnerabilities.
- Unequal treatment of languages in AI training causes risks for all users.
- Public translation tools allow bad actors worldwide to exploit these holes.
- Truly safe AI requires inclusive data and safety mechanisms for all languages.
- Low-resource languages need more focus in AI safety research.
- Cross-lingual vulnerabilities may affect other large language models too.
- The simplicity of translation attacks to bypass GPT-4's safety is alarming.
- Companies need more robust and holistic safeguards for multilingual AI systems.
- Real-world safety is not guaranteed by English-only testing benchmarks.
- GPT-4 showed surprising aptitude for generating harmful content in unfamiliar languages.
- Safety standards must reflect the reality of our diverse multilingual world.
Full credit to Yong, Z.-X., Menghini, C., & Bach, S. H. (2023). Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446.