Reveal Unicode Poisoning
A VS Code extension that detects and exposes invisible Unicode characters embedded in source files — a technique used to smuggle hidden payloads, execute Trojan Source attacks, or deceive AI code-review tools.
The Threat
The Unicode Tags block (U+E0000–U+E007F) contains characters that are completely invisible in virtually every rendering surface: browsers, terminals, code editors, and AI assistants. Each tag character maps to a printable ASCII equivalent offset by 0xE0000, so a sequence of them can encode an arbitrary hidden message inside an otherwise normal-looking file.
Three threat categories are detected:
| Priority |
Category |
Codepoints |
Risk |
| 1 |
Unicode Tags |
U+E0000–U+E007F |
Invisible payload — silent instruction injection |
| 2 |
Bidi Overrides |
U+202A–U+202E, U+2066–U+2069 |
Trojan Source — reorders visible text |
| 3 |
Homoglyphs |
Cyrillic, Greek, Fullwidth |
Lookalike characters that bypass identifier checks |
Features
- Status bar item — always visible; shows
⚠ 3 hidden chars in red when findings exist, green shield when clean. Click to open the reveal panel.
- Gutter dots + wavy underlines — colour-coded markers at every flagged character position (even though the characters are invisible, the cursor stop remains).
- Hover cards — hover any flagged position for a table showing codepoint, decoded value, line, and column.
- Problems panel — findings are pushed to VS Code's diagnostics so they appear in the Problems tab and survive CI lint passes.
- Reveal panel — a dedicated webview (opens beside the editor) that shows:
- A red banner with the fully reconstructed hidden payload.
- The full annotated source with every invisible character rendered as
[U+E0041 → 'A'].
- A sortable findings table.
- Strip command — removes all flagged characters and saves a clean copy after confirmation.
Commands
| Command |
Title |
unicodePoisonDetector.scan |
Scan File for Unicode Poison |
unicodePoisonDetector.reveal |
Reveal Hidden Payload |
unicodePoisonDetector.strip |
Strip All Suspicious Characters |
Configuration
| Setting |
Default |
Description |
unicodePoisonDetector.scanOnSave |
true |
Scan automatically on save |
unicodePoisonDetector.scanOnOpen |
true |
Scan automatically on open |
unicodePoisonDetector.severity |
"error" |
Diagnostic severity: error, warning, info |
Running Locally
npm install
npm run compile
# Press F5 in VS Code to launch the Extension Development Host
Open samples/poisoned.ts in the development host to see the extension fire immediately.
Detection Logic
The scanner iterates real Unicode codepoints (not UTF-16 code units) using String.prototype.codePointAt, advancing by 2 for surrogate pairs. This avoids double-counting the high and low surrogates of characters above U+FFFF — which is exactly what tag block characters are.
while (i < text.length) {
const cp = text.codePointAt(i)!;
if (cp >= 0xE0000 && cp <= 0xE007F) {
// hidden tag — decoded glyph is String.fromCodePoint(cp - 0xE0000)
}
i += cp > 0xFFFF ? 2 : 1;
}
Project Structure
src/
extension.ts — activate / deactivate, commands, lifecycle hooks
scanner.ts — core codepoint-level detection (all three threat tiers)
decoder.ts — tokenizer and summary builder
decorations.ts — gutter dots, wavy underlines, hover cards
panel.ts — webview reveal panel with annotated source + findings table
samples/
poisoned.ts — demonstration file with embedded invisible payload
References