Reveal Unicode Poisoning

A VS Code extension that detects and exposes invisible Unicode characters embedded in source files — a technique used to smuggle hidden payloads, execute Trojan Source attacks, or deceive AI code-review tools.

The Threat

The Unicode Tags block (U+E0000–U+E007F) contains characters that are completely invisible in virtually every rendering surface: browsers, terminals, code editors, and AI assistants. Each tag character maps to a printable ASCII equivalent offset by 0xE0000, so a sequence of them can encode an arbitrary hidden message inside an otherwise normal-looking file.

Three threat categories are detected:

Priority	Category	Codepoints	Risk
1	Unicode Tags	`U+E0000–U+E007F`	Invisible payload — silent instruction injection
2	Bidi Overrides	`U+202A–U+202E`, `U+2066–U+2069`	Trojan Source — reorders visible text
3	Homoglyphs	Cyrillic, Greek, Fullwidth	Lookalike characters that bypass identifier checks

Features

Status bar item — always visible; shows ⚠ 3 hidden chars in red when findings exist, green shield when clean. Click to open the reveal panel.
Gutter dots + wavy underlines — colour-coded markers at every flagged character position (even though the characters are invisible, the cursor stop remains).
Hover cards — hover any flagged position for a table showing codepoint, decoded value, line, and column.
Problems panel — findings are pushed to VS Code's diagnostics so they appear in the Problems tab and survive CI lint passes.
Reveal panel — a dedicated webview (opens beside the editor) that shows:
- A red banner with the fully reconstructed hidden payload.
- The full annotated source with every invisible character rendered as [U+E0041 → 'A'].
- A sortable findings table.
Strip command — removes all flagged characters and saves a clean copy after confirmation.

Commands

Command	Title
`unicodePoisonDetector.scan`	Scan File for Unicode Poison
`unicodePoisonDetector.reveal`	Reveal Hidden Payload
`unicodePoisonDetector.strip`	Strip All Suspicious Characters

Configuration

Setting	Default	Description
`unicodePoisonDetector.scanOnSave`	`true`	Scan automatically on save
`unicodePoisonDetector.scanOnOpen`	`true`	Scan automatically on open
`unicodePoisonDetector.severity`	`"error"`	Diagnostic severity: `error`, `warning`, `info`

Running Locally

npm install
npm run compile
# Press F5 in VS Code to launch the Extension Development Host

Open samples/poisoned.ts in the development host to see the extension fire immediately.

Detection Logic

The scanner iterates real Unicode codepoints (not UTF-16 code units) using String.prototype.codePointAt, advancing by 2 for surrogate pairs. This avoids double-counting the high and low surrogates of characters above U+FFFF — which is exactly what tag block characters are.

while (i < text.length) {
  const cp = text.codePointAt(i)!;
  if (cp >= 0xE0000 && cp <= 0xE007F) {
    // hidden tag — decoded glyph is String.fromCodePoint(cp - 0xE0000)
  }
  i += cp > 0xFFFF ? 2 : 1;
}

Project Structure

src/
  extension.ts   — activate / deactivate, commands, lifecycle hooks
  scanner.ts     — core codepoint-level detection (all three threat tiers)
  decoder.ts     — tokenizer and summary builder
  decorations.ts — gutter dots, wavy underlines, hover cards
  panel.ts       — webview reveal panel with annotated source + findings table
samples/
  poisoned.ts    — demonstration file with embedded invisible payload

Reveal Unicode Poisoning

Resat Caner Bas

Reveal Unicode Poisoning

The Threat

Features

Commands

Configuration

Running Locally

Detection Logic

Project Structure

References