Kokoro Speak (VS Code)

Read text aloud in any editor using the open-weight Kokoro 82M TTS model. A planned LLM-driven step rewrites text into a speech-friendly form first.

See REQUIREMENTS.md for the full design.

How it works

Synthesis runs in the extension host via onnxruntime-node (device: "cpu"), ~3x realtime. No webview, no CSP, no browser autoplay restrictions.
Text is split into sentences (tts.stream() + an explicit TextSplitterStream we close()). Each sentence is played as soon as it's synthesized — and because host synthesis (~3x) outruns playback, the rest are always ready in time.
Playback — native PCM streaming (src/pcmplayer.ts): on a host where the native audify addon loads (N-API → RtAudio → CoreAudio), audio is streamed as 16-bit PCM into one persistent output stream (Kokoro's 24 kHz resampled to the device rate). This is natively gapless and gives exact, instant pause/resume (rt.stop()/start() at the precise sample) and instant stop — no temp files. This is the macOS path.
Fallback — file players (src/player.ts): if audify can't load (or on a platform without it), each sentence is written to a temp WAV and played with the OS CLI player (afplay/paplay/aplay/ffplay). Those can't be fed a stream and have a fixed ~0.85 s startup, so the StreamPlayer hides it by spawning segment N+1 exactly duration(N) after segment N — the startup overlaps the previous tail and seams line up. (Verified: three 1.0 s segments play in ~3.9 s, one startup not three.) Here pause kills the player (instant) and resume replays the current sentence.
kokoroSpeak.bufferSeconds: 0 (default) streams per sentence for the lowest latency; a larger value groups more audio per segment. Playback is gapless either way.

kokoro-js is ESM-only and the host bundle is CommonJS, so it's loaded with a real dynamic import() (hidden from esbuild) to avoid ERR_REQUIRE_ESM on the Node that ships in VS Code.

Features

Speak Selection (Cmd/Ctrl+Alt+S; empty selection → whole document), Speak Document, Speak Clipboard (Cmd/Ctrl+Alt+V — works from any webview panel such as the Claude Code panel or Markdown preview: copy, then press it), Pause/Resume (Cmd/Ctrl+Alt+P), Stop (Cmd/Ctrl+Alt+X), right-click Speak Selection, and a status-bar item that shows state (Loading / Synthesizing selection… / Speaking / Paused). Clicking the status item toggles Pause/Resume while playing (and speaks the clipboard when idle). The full controls — Speak clipboard · Pause/Resume · Stop · Config — are command links in the item's hover tooltip (status items support only one click action and no right-click, and VS Code can't pin a popup open, so the tooltip is the control surface).
Markdown preview is handled: the keybinding works there and the underlying document is spoken with Markdown syntax stripped (src/textprep.ts).
Settings: kokoroSpeak.voice (28 voices), kokoroSpeak.speed (0.5–2.0), kokoroSpeak.bufferSeconds, plus a Select Voice quick-pick.

Normalization

Before synthesis, text is rewritten into a speech-friendly form (src/normalize.ts, $10B → 10 billion dollars). Three modes via kokoroSpeak.normalization.mode:

off — none (Markdown is still stripped).
deterministic (default) — built-in local rules + your replacements. Offline, instant, private. Expands currency/percent/versions/quarters/ abbreviations/symbols and leaves bare digits for Kokoro to vocalize.
ai — an LLM applies your natural-language kokoroSpeak.normalization.rules; falls back to the deterministic rules on any failure. Opt-in (asks once before first use, since it sends the spoken text to a model). Backend auto detects a local claude/codex CLI (no API key) or uses the Claude API (Kokoro Speak: Set Normalization API Key). A local gate skips the model when there's nothing to normalize, and results are cached.

Customize per user (User settings) or per repo (.vscode/settings.json): kokoroSpeak.normalization.rules (natural language) and .replacements (literal/regex), e.g. { "from": "K8s", "to": "Kubernetes" }. Design notes: docs/normalization-design.md.

AI-driven config

The status-bar tooltip has a ⚙ Config link (also Kokoro Speak: Configure (AI)). It reads skills/normalization-config.md — an editable instruction file — and runs the same LLM backend. It configures the voice and speed as well as normalization, so plain-language requests like "use a British male voice" or "talk a bit slower" set kokoroSpeak.voice / kokoroSpeak.speed for you (the model is given the catalog of valid voice ids).

The AI chooses the UI on the fly: each turn it returns a control to render, and the extension shows it natively — a quick-pick (choose one), a multi-select quick-pick (choose several), or an input box (free text). It opens with a pick menu of common actions, asks follow-ups as needed, then proposes settings you confirm before they're applied (to workspace settings if a folder is open, else user settings). Update the .md to change the menu, the questions, or how requests map to settings — no code change. Falls back to the Settings UI if no backend is available.

Voice for AI agents (MCP)

A standalone MCP server (kokoro-speak-mcp) lets an MCP-capable agent — Claude Code, the Codex VS Code extension, Claude Desktop — speak its replies aloud or join a spoken discussion by calling a speak tool. It reuses the same vscode-free core (src/speaker.ts) and keeps the model warm. Build with npm run build:mcp (→ dist/mcp.js), register it with your agent, and optionally add the speak-aloud skill so it speaks proactively. Full setup: docs/mcp-voice.md.

Verification

npm test runs a partial end-to-end round-trip: text → Kokoro TTS → audio → local Whisper (whisper-base.en) → text, asserting word-level match — proof the speech is intelligible, not just non-empty. All local, no API key.

Finding captured as a test: both whisper base.en and small.en mishear the brand name "Kokoro" (out-of-vocab proper noun) — a verifier limit, not a synthesis defect — so ordinary words are asserted to survive while the brand name may not.

Host synthesis is verified in Node (src/synth.ts): real Kokoro stream concatenates sentences into one gapless WAV at ~2.5–3x realtime; the OS player (src/player.ts) is verified for finish/pause/stop and fallback.

Layout

Path	What
`src/extension.ts`	Host: commands, status bar, streaming play queue.
`src/synth.ts`	Host Kokoro synthesis → concatenated WAV segments.
`src/player.ts`	OS audio player wrapper (play/pause/stop, fallback chain).
`src/textprep.ts`	Markdown stripping.
`src/normalize.ts`	Speech normalization: deterministic rules + LLM (CLI/API).
`skills/normalization-config.md`	Editable instructions the AI ⚙ Config flow follows.
`scripts/spike.mjs`	Standalone Node synthesis spike → `out/spike.wav`.
`test/roundtrip.test.mjs` + `test/lib/`	Round-trip TTS↔ASR verification (`npm test`).

Run it (F5)

npm install     # approve native postinstalls (onnxruntime-node, esbuild)
npm run build   # bundle the extension host (dist/extension.js)
# press F5 → "Run Kokoro Speak (Extension)", then select text → Cmd/Ctrl+Alt+S

First Speak downloads ~80–90MB (q8 ONNX Kokoro weights) from HuggingFace and caches them; subsequent runs are offline.

Packaging note

npm run package builds a VSIX, but because it bundles the native ML runtime (onnxruntime-node + @huggingface/transformers) the result is large (~300MB) and platform-specific. That's the trade for host-side synthesis (fast, gapless, no webview). For real distribution this needs per-platform targets and aggressive trimming; for local use, F5 is the intended path.

Kokoro Speak

James Tan