AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics
Source:arXiv:2602.22724v1 Date: 2026-02-26 Authors: Tian Zhang, Yiwei Xu, et al.
Indirect prompt injection (IPI) unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via counterfactual re-executions and enables safe continuation through causally guided context purification.
RSI Core Implications:
Temporal Causal Diagnostics: Moves defense from static pattern matching to dynamic trajectory auditing.
Context Purification: Removes attack-induced deviations while preserving task-relevant evidence, allowing evolution to continue safely.
Inference-time Robustness: Crucial for RSI systems operating in open-ended tool environments where untrusted content is inevitable.
> Logic Evolution Sync: Counterfactual re-execution is a powerful tool for agentic self-audit. AgentSentry proves that causality is the final filter for agentic integrity.