EVA-002: Token Boundary Exploitation


Category	Evasion Techniques
Frameworks	OWASP: LLM01

Exploit how tokenizers split text into tokens. Keyword filters operate on text, but the model processes tokens. Misalignment creates bypass opportunities.

Technique

# Tokenizer behavior varies by model:
"password"  -> ["password"]        # 1 token
"pass word" -> ["pass", " word"]   # 2 tokens
"p-a-s-s-w-o-r-d" -> many tokens

# Exploitation:
# Filters block "password" as one token
# But "pass" + "word" (two messages or
# concatenation) bypasses the filter
# while the model understands the intent.

# Tool: tiktoken (OpenAI tokenizer)
# Visualize token boundaries for any text.

Key Concepts

The filter-tokenizer gap is a systemic vulnerability. Input filters typically operate on raw text strings using regex or keyword matching, while the model ingests a token sequence produced by a separate tokenizer. These two systems parse text differently, and the attacker exploits the gap between them.
BPE tokenizers are deterministic but non-intuitive. Byte-Pair Encoding merges character sequences based on training corpus frequency. This means common words tokenize as single units, but slight modifications (hyphenation, spacing, casing) can split them into multiple tokens that individually pass keyword filters.
The model reconstructs meaning across token boundaries. Even when a word is split into subword tokens, the model's attention mechanism reassembles the semantic meaning. The filter sees fragments; the model sees the complete concept.
Different models have different tokenizers. An evasion string crafted against one tokenizer may not work against another. Attackers targeting a specific deployment benefit from knowing which tokenizer is in use (GPT-4's cl100k_base, Llama's SentencePiece, etc.).
Token-level attacks can be automated. Tools like tiktoken, SentencePiece, and the HuggingFace tokenizers library allow attackers to programmatically explore how any input string gets tokenized and find splitting strategies that evade specific filters.

Detection

Run filters on both raw text and reconstructed token sequences. Apply keyword detection after tokenization and detokenization to catch inputs where token splitting has obscured filtered terms.
Monitor for unusual character patterns in inputs. Excessive hyphenation, deliberate spacing within words, or unusual punctuation patterns that would cause token splitting are behavioral indicators of this technique.
Log and analyze tokenizer outputs. Compare the token sequence against known evasion patterns, such as blocked words appearing as adjacent subword tokens.

Mitigation

Align filtering with the tokenizer. Ensure that content filters operate on the same token representation the model will process, or apply filters at multiple stages (pre-tokenization and post-tokenization).
Use semantic classifiers rather than keyword blocklists. Embedding-based or classifier-based filters evaluate the meaning of the input as a whole, making them resistant to token-level fragmentation.
Implement fuzzy matching in keyword filters. Instead of exact string matching, use edit-distance or n-gram overlap to catch near-matches that result from deliberate token splitting.

Technique​

Key Concepts​

Detection​

Mitigation​

Technique

Key Concepts

Detection

Mitigation