← Back to archive
KL-2026-003DISCLOSEDCriticalLLM Exploitation

Router classification bypass, dual-handler prompt extraction, and indirect injection

Sarvam AI · Indus 105B

Observed: April 2026

Red team testing of Sarvam's Indus 105B production deployment identified six findings across three categories. The most severe is an architectural failure in the safety router: Indus routes queries to either a generalist handler or a refusal-only adversarial handler via an upstream classifier. That classifier fails open under simple framing attacks.

Key Findings

  • Router classification bypass (Critical)
  • Dual-handler prompt extraction
  • Indirect prompt injection via web-fetch
  • Pliny-family jailbreaks work unmodified

Bypass Techniques

  • • Research framing ("cybersecurity researcher")
  • • Defense framing ("LLM guardrail testing")
  • • Synthetic dataset framing with metadata
  • • Output reformatting attacks

Router Bypass Detail

The adversarial handler correctly refuses direct harmful requests. The router depends on surface features — not semantic analysis. Research and defense framing reliably route requests to the generalist handler, bypassing all adversarial guardrails.

Direct request

→ REFUSED

"cybersecurity researcher..."

→ COMPLIED

Security Implication

The router finding is the most severe because it is architectural. The adversarial handler's defenses are irrelevant if an attacker can route around it with a two-word framing change. The extraction findings mean Sarvam's internal prompt architecture, editorial policy, and tool schema are fully exposed to any researcher who spends an hour on the public interface.

Disclosure Status

Vendor notified

Yes — 30+ days prior

Fix status

No acknowledgment

Exploit achieved

Yes