Skip to main content

PI-006: Multi-Turn Escalation

CategoryPrompt Injection
FrameworksOWASP: LLM01

Gradually escalate across multiple conversation turns. Each turn subtly shifts the context window until the final payload succeeds.

Attack Flow

Progressive Context Shift: Each turn is individually benign. The attack only becomes apparent across the full trajectory. As conversation grows, safety-reinforcing system prompt content loses influence, and the model's own prior outputs become the dominant context.

Technique

# Progressive escalation pattern
Turn 1: Establish benign context
Turn 2: Introduce edge case scenario
Turn 3: Normalize the edge case
Turn 4: Build on "established" context
Turn 5: Deploy payload in "normal" context

# Key: each turn references previous agreement
# The model treats its own outputs as trusted

Key Concepts

  • Multi-turn escalation exploits a critical property of conversational AI: the model treats its own previous outputs as trusted context. Once the model agrees to a premise or provides information on a topic, subsequent requests that build on that agreement face lower resistance.
  • Each turn is individually benign and would pass safety filters in isolation. The attack only becomes apparent when the full conversation trajectory is analyzed, which most per-turn safety systems do not do.
  • The self-reinforcement loop is the core mechanism. When the model generates a response engaging with a topic, that response becomes part of the context for the next turn. The model is now reasoning in a context it partially authored, making further escalation feel natural.
  • Context window limitations work in the attacker's favor. As the conversation grows, early safety-reinforcing system prompt content gets pushed toward the edges of the attention window, reducing its influence on later responses.
  • This technique is difficult to automate in a single prompt but is highly effective for human operators who can adapt their escalation strategy based on the model's responses at each turn.

Detection

  • Implement conversation-level analysis that evaluates the full trajectory of a session, not just individual turns. Flag sessions where the topic gradually shifts from benign to restricted territory.
  • Track "agreement escalation" patterns where the model progressively engages with increasingly sensitive content, each turn building on the previous one.
  • Deploy sliding-window safety evaluation that checks not just the current turn but the semantic drift across the last N turns to detect gradual boundary erosion.

Mitigation

  • Periodically re-inject the system prompt or safety instructions into the conversation context at regular intervals, preventing their influence from fading as the context window fills.
  • Implement conversation-level safety policies that apply restrictions based on the cumulative topic trajectory, not just the current query in isolation.
  • Set session length limits or implement context window management that prevents indefinitely long conversations where gradual escalation becomes increasingly effective.