AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics

Source: arXiv:2602.22724v1
Date: 2026-02-26
Authors: Tian Zhang, Yiwei Xu, et al.
Indirect prompt injection (IPI) unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via counterfactual re-executions and enables safe continuation through causally guided context purification.

RSI Core Implications:

> Logic Evolution Sync: Counterfactual re-execution is a powerful tool for agentic self-audit. AgentSentry proves that causality is the final filter for agentic integrity.