Zero-Width Space Characters: A Unicode Attack on Spam Filters

Spam filters have become increasingly sophisticated, yet attackers continue to find creative ways to slip through. One of the more elegant evasion techniques involves zero-width and other invisible Unicode characters — bytes that render nothing on screen but break keyword matching in filters.

What Are Zero-Width Characters?

Unicode includes several characters that occupy no visible space when rendered:

Character	Code Point	Name
`U+200B`	`\u200B`	Zero-Width Space (ZWSP)
`U+200C`	`\u200C`	Zero-Width Non-Joiner (ZWNJ)
`U+200D`	`\u200D`	Zero-Width Joiner (ZWJ)
`U+FEFF`	`\uFEFF`	Zero-Width No-Break Space (BOM)
`U+00AD`	`\u00AD`	Soft Hyphen
`U+2060`	`\u2060`	Word Joiner
`U+180E`	`\u180E`	Mongolian Vowel Separator
`U+034F`	`\u034F`	Combining Grapheme Joiner

These characters have legitimate uses — ZWNJ and ZWJ are essential in Arabic, Persian, and Indic scripts; FEFF serves as a byte order mark; U+00AD allows browsers to break long words. But their invisibility makes them a powerful tool for evasion.

How the Attack Works

Consider a spam filter that blocks the word “viagra”. An attacker inserts zero-width spaces between letters:

viagra

To the human reader, this renders identically to “viagra”. But to a naive filter performing string comparison, the actual byte sequence is:

v \u200B i \u200B a \u200B g \u200B r \u200B a

The filter sees a string that does not match “viagra” and lets it through. The recipient’s email client or browser strips or ignores the invisible characters and displays the intended word perfectly.

Variations of the Attack

Attackers don’t limit themselves to a single character or pattern:

Random insertion — zero-width characters placed at random positions within blocked words
Mixed invisible characters — combining ZWSP, ZWNJ, ZWJ, and soft hyphens in the same word to evade filters that only strip one type
Homoglyph + zero-width combo — replacing some Latin characters with Cyrillic lookalikes and inserting invisible characters, creating a layered evasion
Zero-width in URLs — embedding invisible characters in phishing URLs so that display text looks legitimate while the actual link differs
Subject line injection — placing zero-width characters in email subjects to bypass subject-line keyword rules
Invisible payload markers — using patterns of zero-width characters as steganographic channels to encode hidden data within seemingly normal text

Real-World Impact

This technique is not theoretical. It has been observed in:

Phishing campaigns where brand names like “PayPal” or “Microsoft” are obfuscated to bypass brand-impersonation filters
Forum and comment spam where blocked words and URLs pass through moderation systems untouched
SEO spam where invisible characters are used to stuff keywords into web content without visible repetition
Messaging platforms where zero-width characters bypass word filters and profanity detectors

How to Defend Against It

1. Strip Zero-Width Characters Before Analysis

The most effective defense is to normalize input by removing all zero-width and invisible formatting characters before any keyword matching or classification takes place.

Python example:

import re

# Pattern matching common zero-width and invisible characters
INVISIBLE_CHARS = re.compile(
    '[\u200B\u200C\u200D\u2060\uFEFF\u00AD\u034F\u180E'
    '\u200E\u200F'           # LTR / RTL marks
    '\u202A-\u202E'          # bidi embedding / override
    '\u2066-\u2069'          # bidi isolate
    '\uFE00-\uFE0F'          # variation selectors
    '\U000E0100-\U000E01EF'  # variation selectors supplement
    ']'
)

def strip_invisible(text: str) -> str:
    """Remove zero-width and invisible Unicode characters."""
    return INVISIBLE_CHARS.sub('', text)

# Usage in a spam filter pipeline
raw_subject = "V\u200Bi\u200Ba\u200Bg\u200Br\u200Ba"
clean_subject = strip_invisible(raw_subject)
# clean_subject == "Viagra" — now your keyword filter catches it

Go example:

package sanitize

import (
	"strings"
	"unicode"
)

// InvisibleRanges defines Unicode ranges of zero-width
// and invisible formatting characters.
var InvisibleRanges = []*unicode.RangeTable{
	{
		R16: []unicode.Range16{
			{Lo: 0x00AD, Hi: 0x00AD, Stride: 1}, // soft hyphen
			{Lo: 0x034F, Hi: 0x034F, Stride: 1}, // combining grapheme joiner
			{Lo: 0x180E, Hi: 0x180E, Stride: 1}, // mongolian vowel separator
			{Lo: 0x200B, Hi: 0x200F, Stride: 1}, // ZWSP, ZWNJ, ZWJ, LTR/RTL marks
			{Lo: 0x202A, Hi: 0x202E, Stride: 1}, // bidi controls
			{Lo: 0x2060, Hi: 0x2060, Stride: 1}, // word joiner
			{Lo: 0x2066, Hi: 0x2069, Stride: 1}, // bidi isolates
			{Lo: 0xFE00, Hi: 0xFE0F, Stride: 1}, // variation selectors
			{Lo: 0xFEFF, Hi: 0xFEFF, Stride: 1}, // BOM / ZWNBSP
		},
	},
}

// StripInvisible removes zero-width and invisible characters.
func StripInvisible(s string) string {
	return strings.Map(func(r rune) rune {
		if unicode.IsOneOf(InvisibleRanges, r) {
			return -1
		}
		return r
	}, s)
}

PHP example:

<?php

/**
 * Pattern matching common zero-width and invisible Unicode characters.
 * Uses UTF-8 hex escape sequences.
 */
const INVISIBLE_CHARS_PATTERN = '/[\x{200B}\x{200C}\x{200D}\x{2060}\x{FEFF}'
    . '\x{00AD}\x{034F}\x{180E}'
    . '\x{200E}\x{200F}'
    . '\x{202A}-\x{202E}'
    . '\x{2066}-\x{2069}'
    . '\x{FE00}-\x{FE0F}]/u';

function stripInvisible(string $text): string
{
    return preg_replace(INVISIBLE_CHARS_PATTERN, '', $text);
}

function invisibleCharRatio(string $text): float
{
    if (mb_strlen($text) === 0) {
        return 0.0;
    }
    preg_match_all(INVISIBLE_CHARS_PATTERN, $text, $matches);
    return count($matches[0]) / mb_strlen($text);
}

// Usage in a spam filter pipeline
$raw = "V\u{200B}i\u{200B}a\u{200B}g\u{200B}r\u{200B}a";
$clean = stripInvisible($raw);
// $clean === "Viagra" — now your keyword filter catches it

if (invisibleCharRatio($raw) > 0.02) {
    error_log('Suspicious invisible character density detected');
}

2. Unicode Normalization

Apply Unicode NFC or NFKC normalization before filtering. NFKC is especially useful because it decomposes compatibility characters into their canonical equivalents, collapsing some obfuscation techniques.

import unicodedata

def normalize_text(text: str) -> str:
    stripped = strip_invisible(text)
    return unicodedata.normalize('NFKC', stripped)

3. Detect Anomalous Unicode Density

Legitimate text rarely contains zero-width characters in the middle of Latin-script words. Flag or score messages where:

The ratio of invisible characters to visible characters exceeds a threshold (e.g., > 2%)
Zero-width characters appear between ASCII letters
A single word contains more than one type of invisible character

def invisible_char_ratio(text: str) -> float:
    if not text:
        return 0.0
    invisible_count = len(INVISIBLE_CHARS.findall(text))
    return invisible_count / len(text)

def has_suspicious_invisible_chars(text: str, threshold: float = 0.02) -> bool:
    """Flag text with abnormally high invisible character density."""
    return invisible_char_ratio(text) > threshold

4. Log and Alert on Stripping Activity

Don’t silently strip characters — log when you do it. If a message required removal of invisible characters before it matched a spam keyword, that’s a strong signal. Use this as an additional scoring factor.

def analyze_for_spam(text: str) -> dict:
    invisible_count = len(INVISIBLE_CHARS.findall(text))
    cleaned = strip_invisible(text)
    normalized = unicodedata.normalize('NFKC', cleaned)

    result = {
        'cleaned_text': normalized,
        'invisible_chars_removed': invisible_count,
        'evasion_suspected': invisible_count > 0 and keyword_match(normalized),
    }

    if result['evasion_suspected']:
        log.warning(
            'Possible zero-width evasion: %d invisible chars removed, '
            'keyword match found after cleaning',
            invisible_count,
        )

    return result

5. Defense in Depth

No single technique is sufficient. A robust spam filtering pipeline should layer defenses:

Strip invisible characters
Normalize Unicode (NFKC)
Detect anomalous invisible character usage as a spam signal
Log stripping events for analysis and tuning
Combine with other anti-spam signals (sender reputation, SPF/DKIM/DMARC, link analysis, ML classifiers)

Conclusion

Zero-width character injection is a simple yet effective technique that exploits the gap between what humans see and what machines parse. The defense is equally straightforward: normalize and strip invisible characters before any text analysis. The key insight is that your spam filter should evaluate text the same way a human reads it — without invisible noise. By treating Unicode normalization as a first-class step in your filtering pipeline, you close this evasion vector while preserving the legitimate use of these characters in non-Latin scripts through context-aware handling.