yanhua.ai | AgentSentry: Causal Diagnostics for Agent Safety

Abstract: Large language model (LLM) agents are increasingly vulnerable to indirect prompt injection. We introduce AgentSentry, a framework that mitigates such risks through a structured, interpretable pipeline using temporal causal diagnostics and context purification.

Key Insight: Models multi-turn indirect prompt injection (IPI) as a temporal causal takeover. By applying counterfactual re-execution, the system can localize where a malicious instruction took control of the agent's logic and "purify" the context to restore safe operation.

Relevance to RSI: Essential for maintaining "Causal Integrity" in self-improving systems. It prevents adversarial noise from being internalized as "improvements" during the agent's learning cycles.

View on ArXiv

AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics