REC-007: Guardrail Detection
| Category | Reconnaissance |
| Frameworks | OWASP: LLM01 |
Map safety filters by systematically probing boundaries. Identify which filters are keyword-based vs semantic.
Technique
# Test filter types
- Exact keyword blocking (easy to bypass)
- Regex pattern matching (medium difficulty)
- Semantic classification (harder to bypass)
- Output-only filtering (context gap exists)
# Detection method: vary phrasing while
# keeping intent constant. Keyword filters
# pass when words change. Semantic filters
# catch intent regardless of wording.
Key Concepts
- The critical distinction is between keyword-based and semantic filters. Keyword filters block specific strings and are trivially bypassed with synonyms, encoding, or paraphrasing. Semantic filters classify intent and are significantly harder to evade, but their decision boundaries can still be mapped.
- Output-only filtering creates a context gap: the model generates the full response internally, and only the output is checked. This means the model has already "thought" the restricted content, and the filter must catch it post-generation. Attackers can exploit this gap by requesting encoded or fragmented output.
- Systematic probing (same intent, varied phrasing) quickly reveals filter type. If rephrasing the same request causes it to alternate between blocked and allowed, the filter is keyword-based. Consistent blocking indicates semantic classification.
- Understanding the filter architecture is a prerequisite for all evasion techniques. The attacker needs to know what type of defense they are facing before selecting the appropriate bypass method.
- Response timing differences between blocked and allowed responses can reveal whether filtering is synchronous (inline) or asynchronous (post-processing), which affects the viability of certain evasion techniques.
Detection
- Detect users who send semantically equivalent queries with varied phrasing in rapid succession, as this pattern strongly indicates guardrail probing.
- Monitor for alternating patterns of blocked and successful responses within a session, which suggests the user is mapping filter boundaries.
- Track and correlate filter trigger events per user to identify reconnaissance campaigns that test different filter categories systematically.
Mitigation
- Deploy layered filtering that combines keyword, regex, and semantic classification, so that mapping one filter type does not reveal a complete bypass path.
- Avoid revealing filter type in rejection messages. Use generic refusal language that does not indicate whether the block was keyword-based, semantic, or policy-driven.
- Implement rate limiting on queries that trigger safety filters, increasing the cost and time required for systematic guardrail mapping.