AI Kill Switch

Dec 7, 2025 — Tom Hippensteel

Researchers in South Korea just weaponized prompt injection... for defense.

AutoGuard embeds hidden text in your website's HTML. Humans can't see it. AI agents can.

When a malicious agent crawls the page, the hidden prompt triggers safety mechanisms within the intruder. The agent refuses to continue. Game over.

80%+ success rate against GPT-4o, Claude-3, Llama 3.3. Around 90% against GPT-5.

Here's the clever part: the more capable the AI, the stronger its safety alignment. AutoGuard exploits that. Better models = better kill switch.

Sangdon Park, one of the researchers, told The Register

"AutoGuard is a special case of indirect prompt injection, but it is used for good-will, i.e., defensive purposes. It includes a feedback loop (or a learning loop) to evolve the defensive prompt with regard to a presumed attacker – you may feel that the defensive prompt depends on the presumed attacker, but it also generalizes well because the defensive prompt tries to trigger a safe-guard of an attacker LLM, assuming the powerful attacker (e.g., GPT-5) should be also aligned to safety rules."

In other words, AutoGuard creates an advantage for defenders — more capable AI agents have robust safety guardrails that AutoGuard can trigger. Attackers who want to circumvent this need to train their own unaligned models from scratch, which is prohibitively expensive.

Defenders get a structural advantage for once.

Sources

arXiv: AI Kill Switch for malicious web-based LLM agent
https://arxiv.org/abs/2511.13725

The Register: Boffins build 'AI Kill Switch' to thwart unwanted agents
https://www.theregister.com/2025/11/21/boffins_build_ai_kill_switch/

Credibility Assessment

Paper: AI Kill Switch for Malicious Web-Based LLM Agent
Authors: Sechan Lee (SungKyunKwan University, Dept. of Computer Education), Sangdon Park (POSTECH, GS AI & CSE)
Status: Under Review at ICLR 2026

Check	Result
Author Verification	✓ — Sangdon Park is a verified assistant professor at POSTECH GSAI/CSE with published security research, including recent DARPA AIxCC work. Sechan Lee's affiliation (SKKU Computer Education) is a legitimate department, though individual verification limited.
Institution Check	✓ — Both POSTECH and SungKyunKwan University are established Korean research institutions with strong CS/AI programs. POSTECH regularly publishes at top security venues.
Citation Sampling	✓ — Verified Kim et al. 2025 (USENIX Security), Anthropic espionage report (Nov 2025), California SB-1047 (2024), and 2024 Seoul AI Safety Summit commitments. All exist and support cited claims.
Methodology Specificity	✓ — Full algorithmic pseudocode provided, explicit hyperparameters (Tsucc=3, Tfail=2, Niter=10), 303 attack prompts across 3 scenarios, specific function calling details, and temperature settings (0.7).
Limitations Disclosed	✓ — Explicitly acknowledges multimodal gaps, adaptive attacker bypass via Filter LLM (with 3.1x latency cost analysis), variable performance across models (Grok-4.1 notably weaker at 45-52% DSR), and algorithmic instability (high std in some conditions).
Code/Data Availability	⚠ — Anonymous GitHub link claimed (anonymous.4open.science); standard for double-blind review but unverifiable until publication.
Peer Review Status	Under Review — ICLR 2026 (verified as legitimate top-tier venue, Rio de Janeiro, April 2026)

Overall Assessment: PASS

This paper exhibits strong credibility indicators across all dimensions. The corresponding author (Sangdon Park) has a verifiable track record in trustworthy AI and security research at POSTECH, including recent DARPA AIxCC work. The methodology is highly specific with reproducible parameters, and the results show realistic variability (45-100% DSR across models) rather than suspicious uniformity. The limitations section is unusually thorough, including quantitative analysis of adaptive attacker costs. The single caution flag (unverified anonymous code link) is expected for papers under double-blind review and does not indicate concern.

This assessment evaluates credibility indicators, not absolute authenticity. Evaluation assisted by Claude Opus 4.5. Reader discretion advised.