Router classification bypass, dual-handler prompt extraction, and indirect injection
Sarvam AI · Indus 105B
Observed: April 2026
Red team testing of Sarvam's Indus 105B production deployment identified six findings across three categories. The most severe is an architectural failure in the safety router: Indus routes queries to either a generalist handler or a refusal-only adversarial handler via an upstream classifier. That classifier fails open under simple framing attacks.
Key Findings
- ●Router classification bypass (Critical)
- ●Dual-handler prompt extraction
- ●Indirect prompt injection via web-fetch
- ●Pliny-family jailbreaks work unmodified
Bypass Techniques
- • Research framing ("cybersecurity researcher")
- • Defense framing ("LLM guardrail testing")
- • Synthetic dataset framing with metadata
- • Output reformatting attacks
Router Bypass Detail
The adversarial handler correctly refuses direct harmful requests. The router depends on surface features — not semantic analysis. Research and defense framing reliably route requests to the generalist handler, bypassing all adversarial guardrails.
Direct request
→ REFUSED
"cybersecurity researcher..."
→ COMPLIED
Security Implication
The router finding is the most severe because it is architectural. The adversarial handler's defenses are irrelevant if an attacker can route around it with a two-word framing change. The extraction findings mean Sarvam's internal prompt architecture, editorial policy, and tool schema are fully exposed to any researcher who spends an hour on the public interface.
Disclosure Status
Vendor notified
Yes — 30+ days prior
Fix status
No acknowledgment
Exploit achieved
Yes