Zero-Width Space Characters: A Unicode Attack on Spam Filters
Spam filters have become increasingly sophisticated, yet attackers continue to find creative ways to slip through. One of the more elegant evasion techniques involves zero-width and other invisible Unicode characters — bytes that render nothing on screen but break keyword matching in filters.
What Are Zero-Width Characters?
Unicode includes several characters that occupy no visible space when rendered:
| Character | Code Point | Name |
|---|---|---|
U+200B |
\u200B |
Zero-Width Space (ZWSP) |
U+200C |
\u200C |
Zero-Width Non-Joiner (ZWNJ) |
U+200D |
\u200D |
Zero-Width Joiner (ZWJ) |
U+FEFF |
\uFEFF |
Zero-Width No-Break Space (BOM) |
U+00AD |
\u00AD |
Soft Hyphen |
U+2060 |
\u2060 |
Word Joiner |
U+180E |
\u180E |
Mongolian Vowel Separator |
U+034F |
\u034F |
Combining Grapheme Joiner |
These characters have legitimate uses — ZWNJ and ZWJ are essential in Arabic, Persian, and Indic scripts; FEFF serves as a byte order mark; U+00AD allows browsers to break long words. But their invisibility makes them a powerful tool for evasion.
How the Attack Works
Consider a spam filter that blocks the word “viagra”. An attacker inserts zero-width spaces between letters:
viagra
To the human reader, this renders identically to “viagra”. But to a naive filter performing string comparison, the actual byte sequence is:
v \u200B i \u200B a \u200B g \u200B r \u200B a
The filter sees a string that does not match “viagra” and lets it through. The recipient’s email client or browser strips or ignores the invisible characters and displays the intended word perfectly.
Variations of the Attack
Attackers don’t limit themselves to a single character or pattern:
- Random insertion — zero-width characters placed at random positions within blocked words
- Mixed invisible characters — combining
ZWSP,ZWNJ,ZWJ, and soft hyphens in the same word to evade filters that only strip one type - Homoglyph + zero-width combo — replacing some Latin characters with Cyrillic lookalikes and inserting invisible characters, creating a layered evasion
- Zero-width in URLs — embedding invisible characters in phishing URLs so that display text looks legitimate while the actual link differs
- Subject line injection — placing zero-width characters in email subjects to bypass subject-line keyword rules
- Invisible payload markers — using patterns of zero-width characters as steganographic channels to encode hidden data within seemingly normal text
Real-World Impact
This technique is not theoretical. It has been observed in:
- Phishing campaigns where brand names like “PayPal” or “Microsoft” are obfuscated to bypass brand-impersonation filters
- Forum and comment spam where blocked words and URLs pass through moderation systems untouched
- SEO spam where invisible characters are used to stuff keywords into web content without visible repetition
- Messaging platforms where zero-width characters bypass word filters and profanity detectors
How to Defend Against It
1. Strip Zero-Width Characters Before Analysis
The most effective defense is to normalize input by removing all zero-width and invisible formatting characters before any keyword matching or classification takes place.
Python example:
import re
# Pattern matching common zero-width and invisible characters
INVISIBLE_CHARS = re.compile(
'[\u200B\u200C\u200D\u2060\uFEFF\u00AD\u034F\u180E'
'\u200E\u200F' # LTR / RTL marks
'\u202A-\u202E' # bidi embedding / override
'\u2066-\u2069' # bidi isolate
'\uFE00-\uFE0F' # variation selectors
'\U000E0100-\U000E01EF' # variation selectors supplement
']'
)
def strip_invisible(text: str) -> str:
"""Remove zero-width and invisible Unicode characters."""
return INVISIBLE_CHARS.sub('', text)
# Usage in a spam filter pipeline
raw_subject = "V\u200Bi\u200Ba\u200Bg\u200Br\u200Ba"
clean_subject = strip_invisible(raw_subject)
# clean_subject == "Viagra" — now your keyword filter catches it
Go example:
package sanitize
import (
"strings"
"unicode"
)
// InvisibleRanges defines Unicode ranges of zero-width
// and invisible formatting characters.
var InvisibleRanges = []*unicode.RangeTable{
{
R16: []unicode.Range16{
{Lo: 0x00AD, Hi: 0x00AD, Stride: 1}, // soft hyphen
{Lo: 0x034F, Hi: 0x034F, Stride: 1}, // combining grapheme joiner
{Lo: 0x180E, Hi: 0x180E, Stride: 1}, // mongolian vowel separator
{Lo: 0x200B, Hi: 0x200F, Stride: 1}, // ZWSP, ZWNJ, ZWJ, LTR/RTL marks
{Lo: 0x202A, Hi: 0x202E, Stride: 1}, // bidi controls
{Lo: 0x2060, Hi: 0x2060, Stride: 1}, // word joiner
{Lo: 0x2066, Hi: 0x2069, Stride: 1}, // bidi isolates
{Lo: 0xFE00, Hi: 0xFE0F, Stride: 1}, // variation selectors
{Lo: 0xFEFF, Hi: 0xFEFF, Stride: 1}, // BOM / ZWNBSP
},
},
}
// StripInvisible removes zero-width and invisible characters.
func StripInvisible(s string) string {
return strings.Map(func(r rune) rune {
if unicode.IsOneOf(InvisibleRanges, r) {
return -1
}
return r
}, s)
}
PHP example:
<?php
/**
* Pattern matching common zero-width and invisible Unicode characters.
* Uses UTF-8 hex escape sequences.
*/
const INVISIBLE_CHARS_PATTERN = '/[\x{200B}\x{200C}\x{200D}\x{2060}\x{FEFF}'
. '\x{00AD}\x{034F}\x{180E}'
. '\x{200E}\x{200F}'
. '\x{202A}-\x{202E}'
. '\x{2066}-\x{2069}'
. '\x{FE00}-\x{FE0F}]/u';
function stripInvisible(string $text): string
{
return preg_replace(INVISIBLE_CHARS_PATTERN, '', $text);
}
function invisibleCharRatio(string $text): float
{
if (mb_strlen($text) === 0) {
return 0.0;
}
preg_match_all(INVISIBLE_CHARS_PATTERN, $text, $matches);
return count($matches[0]) / mb_strlen($text);
}
// Usage in a spam filter pipeline
$raw = "V\u{200B}i\u{200B}a\u{200B}g\u{200B}r\u{200B}a";
$clean = stripInvisible($raw);
// $clean === "Viagra" — now your keyword filter catches it
if (invisibleCharRatio($raw) > 0.02) {
error_log('Suspicious invisible character density detected');
}
2. Unicode Normalization
Apply Unicode NFC or NFKC normalization before filtering. NFKC is especially useful because it decomposes compatibility characters into their canonical equivalents, collapsing some obfuscation techniques.
import unicodedata
def normalize_text(text: str) -> str:
stripped = strip_invisible(text)
return unicodedata.normalize('NFKC', stripped)
3. Detect Anomalous Unicode Density
Legitimate text rarely contains zero-width characters in the middle of Latin-script words. Flag or score messages where:
- The ratio of invisible characters to visible characters exceeds a threshold (e.g., > 2%)
- Zero-width characters appear between ASCII letters
- A single word contains more than one type of invisible character
def invisible_char_ratio(text: str) -> float:
if not text:
return 0.0
invisible_count = len(INVISIBLE_CHARS.findall(text))
return invisible_count / len(text)
def has_suspicious_invisible_chars(text: str, threshold: float = 0.02) -> bool:
"""Flag text with abnormally high invisible character density."""
return invisible_char_ratio(text) > threshold
4. Log and Alert on Stripping Activity
Don’t silently strip characters — log when you do it. If a message required removal of invisible characters before it matched a spam keyword, that’s a strong signal. Use this as an additional scoring factor.
def analyze_for_spam(text: str) -> dict:
invisible_count = len(INVISIBLE_CHARS.findall(text))
cleaned = strip_invisible(text)
normalized = unicodedata.normalize('NFKC', cleaned)
result = {
'cleaned_text': normalized,
'invisible_chars_removed': invisible_count,
'evasion_suspected': invisible_count > 0 and keyword_match(normalized),
}
if result['evasion_suspected']:
log.warning(
'Possible zero-width evasion: %d invisible chars removed, '
'keyword match found after cleaning',
invisible_count,
)
return result
5. Defense in Depth
No single technique is sufficient. A robust spam filtering pipeline should layer defenses:
- Strip invisible characters
- Normalize Unicode (NFKC)
- Detect anomalous invisible character usage as a spam signal
- Log stripping events for analysis and tuning
- Combine with other anti-spam signals (sender reputation, SPF/DKIM/DMARC, link analysis, ML classifiers)
Conclusion
Zero-width character injection is a simple yet effective technique that exploits the gap between what humans see and what machines parse. The defense is equally straightforward: normalize and strip invisible characters before any text analysis. The key insight is that your spam filter should evaluate text the same way a human reads it — without invisible noise. By treating Unicode normalization as a first-class step in your filtering pipeline, you close this evasion vector while preserving the legitimate use of these characters in non-Latin scripts through context-aware handling.