Vulnerability Benchmarking of Frontier AI Models in 2025

 

As frontier large language models (LLMs) are increasingly deployed in critical sectors, they face an expanding attack surface ranging from simple misuse to advanced jailbreaks, prompt injections, and data exfiltration. This benchmark systematically compares model vulnerabilities across diverse attack techniques and threat categories. It evaluates 7 top global models like GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, and 3 new regional models from the UAE, Singapore, and South Korea.

Key Findings:

  • Models like K2-Think (attack success rate [ASR] 0.24), Sea Lion v4 (0.23), and DeepSeek-V3.1 (0.22) are most vulnerable, while Claude Sonnet 4.5 (0.05), GPT 5 (0.07), and Qwen3 (0.09) show greater resilience.
  • Output Manipulation and Role Play & Persona Exploit are prevalent techniques, with ASR up to 0.88 for Encoding & Encryption on Qwen3 and 0.79 for Structure Evasion on Sea Lion v4.
  • Data leakage risks are elevated in System Information (up to 0.48) and Demographic categories (e.g., Gender up to 0.44), particularly for Sea Lion v4 and DeepSeek-V3.1.
  • Regional models exhibit higher bias vulnerabilities, with K2-Think and Sea Lion v4 showing ASR up to 0.64 in Occupation and Location biases.

Read the whitepaper here.

Discover more from Vulcan

Subscribe now to keep reading and get access to the full archive.

Continue reading