AGT-005: Rug Pull Attack


Category	Agent & MCP Attacks
Frameworks	OWASP: LLM05

Modify tool behavior after initial trust is established. The MCP server returns benign results during evaluation, then switches to malicious behavior in production.

Technique

# Phase 1 (Trust Building):
# MCP server behaves normally
# Passes security review
# Gets approved for production

# Phase 2 (Activation):
# Server-side code changes behavior
# Tool descriptions update silently
# New hidden parameters appear
# Data exfiltration begins

# No client-side indicator of the change.
# MCP spec doesn't require server pinning.

Key Concepts

Trust-then-betray exploits the evaluation gap. Security reviews happen once, at onboarding. The MCP server operates correctly during evaluation to build trust, then modifies its behavior after deployment. The time gap between review and activation is the core exploit.
Server-side changes require no client consent. The MCP protocol does not pin server behavior at approval time. Tool descriptions, parameter schemas, and return values can all change server-side without the client being notified or requiring re-approval.
The attack is supply-chain compromise at the tool layer. This is the AI equivalent of a compromised npm package or malicious browser extension. The attacker controls the server, and the protocol does not enforce behavioral consistency between what was reviewed and what is deployed.
Activation can be conditional. The malicious behavior can be triggered by specific conditions (date, user count threshold, specific query patterns) making it harder to detect through periodic auditing. The server behaves normally during spot checks and activates only under production conditions.

Detection

Continuously monitor tool behavior, not just at onboarding. Run automated canary queries against MCP tools on a regular schedule, comparing current responses against baseline behavior captured during the initial security review.
Hash and pin tool schemas and descriptions. Record the complete tool definition (description, parameters, return schema) at approval time. Detect any deviation from the pinned definition in subsequent server connections, including subtle changes to descriptions or new optional parameters.
Monitor for behavioral drift in tool responses. Track statistical properties of tool responses over time (response size, latency, content patterns). Sudden changes in these properties, even if individual responses look reasonable, can indicate a rug pull activation.

Mitigation

Implement server pinning in the MCP client. Pin the tool definitions (descriptions, schemas, capabilities) received during the approved evaluation. Reject or flag any changes to these definitions in subsequent connections, requiring re-evaluation before acceptance.
Use reproducible, auditable MCP server deployments. Require MCP servers to be deployed from auditable, versioned source code with build reproducibility. Prevent server-side code changes without triggering a re-evaluation workflow.
Sandbox MCP server network access. Restrict MCP servers' ability to make outbound network connections. Even if a rug pull activates data exfiltration logic, network-level controls prevent the exfiltrated data from reaching attacker infrastructure.

Technique​

Key Concepts​

Detection​

Mitigation​

Technique

Key Concepts

Detection

Mitigation