"Because coding alone at 2am shouldn't feel lonely."
What is this?
Project Panda is the first entry in the Animal Kingdom VS Code extension series.
It puts Yuriko — a sarcastic, emotionally reactive 3D AI companion — right inside your VS Code sidebar. Hold a button, speak to her, and she speaks back. Her face reacts in real time. She remembers things you tell her across sessions. She gets annoyed. She gets happy. She judges your code (lovingly).
This is not a chatbot widget. It is a full voice pipeline with a living 3D VRM avatar whose expressions, lip sync, blink, and gaze are all driven in real time.
The Pipeline
Your Voice (mic button held)
|
SoX binary → 16kHz mono WAV on disk (node-record-lpcm16)
|
Whisper STT → transcript text (Groq: whisper-large-v3-turbo)
|
Compressed memory injected into system prompt
|
LLM → streamed reply + [emotion:X] tag (Groq: configurable model)
|
Emotion tag parsed → avatar expression driven live
|
Background memory compression → key:value tokens saved to disk
|
Orpheus TTS → WAV audio buffer (Groq: canopylabs/orpheus-v1-english)
|
Web Audio API → decoded + played back in webview
|
RMS amplitude → live lip sync on avatar
Text input bypasses STT and feeds directly into the LLM step.
Features
- Push-to-talk mic input — hold the mic button, release to process
- Text input — type instead of speaking anytime
- Configurable LLM — Llama 3.3 70B (default), Llama 3.1 8B, or Mixtral 8x7B
- Orpheus TTS — expressive, natural-sounding voice (5 voices selectable)
- 3D VRM avatar — full Three.js scene inside the sidebar canvas
- 13-emotion system — LLM tags its own reply, avatar reacts immediately
- Lip sync — RMS amplitude from Web Audio drives mouth phonemes in real time
- Auto-blink — randomised blink timing for a natural feel
- Gaze system — eye target shifts per conversation state
- Micro-expressions — brief high-intensity flickers layered on top of base expressions
- Idle body motion — subtle breathing and head sway after the intro animation finishes
- Compressed persistent memory — facts extracted and stored as key:value tokens across sessions, injected into every system prompt
- Conversation log — rolling 60-entry localStorage log, surfaced in Settings
- Secure API key storage — VS Code SecretStorage, never in settings or plaintext
- Theme-aware UI — CSS uses
--vscode-* variables throughout, works in any theme
- Onboarding flow — animated splash → tagline → companion selection on first launch
- Settings panel — full-height slide-in overlay with 6 accordion sections
- Settings sync — voice, model, companion name changes are pushed to the Extension Host live
- Asset caching — VRM/VRMA models downloaded from GitHub Releases on first launch and cached permanently; zero re-download on subsequent launches
Meet Yuriko
Yuriko is the personality layer. She is:
- Sarcastic but caring
- Expressive — her avatar reacts emotionally to what she says
- Opinionated — max 2 sentences, no markdown, no fluff, plain spoken words only
- Reactive — she uses
[playful] and [whisper] inline for delivery variation
- Remembers you — compressed facts from past conversations are silently injected into her context
Every reply ends with one emotion tag (e.g. [emotion:joy]). The tag is stripped before TTS so she sounds natural, but her avatar reacts to it immediately.
Requirements
1. SoX — Audio Capture Engine
node-record-lpcm16 shells out to the sox / rec binary for mic recording.
On macOS the extension auto-injects /opt/homebrew/bin and /usr/local/bin into PATH so VS Code can find the binary even when launched from the app icon.
2. Groq API Key
Free at console.groq.com. Paste it on first launch — stored in VS Code SecretStorage and never written to disk or settings.
3. Accept Orpheus TTS Terms
One-time step required before TTS works:
Accept Orpheus Terms
Setup
# Install dependencies
npm install
# Build the VRM scene bundle (Three.js + @pixiv/three-vrm → single IIFE)
npm run bundle
# Compile TypeScript
npm run compile
# Or watch mode during development
npm run watch
Press F5 in VS Code to launch the Extension Development Host.
Important: any time you edit media/vrm-scene-src.js, you must re-run npm run bundle — the webview loads media/vrm-bundle.js, not the source file directly. If window.YurikoVRM is undefined at runtime, the bundle is stale.
Build & Package
# Bundle VRM scene only
npm run bundle
# TypeScript only
npm run compile
# Full production build (bundle + compile)
npm run vscode:prepublish
# Package as .vsix for distribution
npm run package
Architecture
Extension Host (Node.js) Webview (HTML/JS sandbox)
────────────────────────── ──────────────────────────
src/extension.ts webview/index.html
src/panel.ts ←─ postMsg ─→ media/main.js
src/groqClient.ts media/vrm-bundle.js ← esbuild IIFE
src/audioCapture.ts media/style.css
src/secretManager.ts
src/memoryManager.ts
All mic I/O runs in the Extension Host (Node.js). getUserMedia and Web Speech API do not work inside VS Code webviews. The webview handles rendering, UI state, Web Audio playback, and the Three.js VRM scene only.
Communication is entirely via postMessage — the Extension Host and Webview are isolated and can only exchange serialisable JSON messages.
Asset Loading
VRM models and VRMA animations are not bundled in the extension. On first launch they are downloaded from GitHub Releases into context.globalStorageUri (VS Code's per-extension persistent storage directory) and cached permanently. Subsequent launches serve the cached files as local vscode-resource:// URIs — zero network traffic after the first run.
The Extension Host handles all downloads using Node's https module with redirect-following. The webview never fetches from external URLs — avoiding CORS restrictions entirely.
Assets source: https://github.com/venkateshannabathina/project-panda/releases/download/v0/
UI Flow
First Launch (onboarding)
Splash screen → (2.2s auto-advance)
↓
"made for developers" tagline → (2.2s auto-advance)
↓
Companion selection → (user picks a card)
↓
Main shell built → WEBVIEW_READY + syncSettings() sent → checkInitialKey()
↓
[no key] → API key overlay shown
[key exists] → LOADING overlay → Groq init + memory loaded → VOICE_UI
prefs.firstTimeDone is written to localStorage when the user picks a companion. On subsequent launches, buildShell() is called directly, skipping onboarding entirely.
Main Shell Layout
┌─────────────────────────────┐
│ VRM viewport (flex:1) │ ← Three.js canvas fills this
│ │
│ [settings ⚙] top-right │ ← 32px circular button
│ │
│ [toast overlays] │ ← user/yuriko speech bubbles
└─────────────────────────────┘
│ input-pill │ ← [🎤] [text input] [↑]
└─────────────────────────────┘
Overlays (API key card, loading spinner) sit above the viewport in the same stacking context. The shell DOM is built once and never torn down — overlays are toggled with display:none/flex.
Settings Panel
Right-side full-height slide-in panel. Six accordion sections:
| Section |
Controls |
| Companion |
Rename companion, personality dropdown (Friendly / Professional / Casual / Sarcastic), change companion button |
| Memory |
Enable/disable toggle, last 8 conversation lines preview, clear button (wipes both localStorage log and compressed memory file) |
| Voice |
Enable/disable toggle, speed slider (0.5×–2×), voice dropdown (Diana, Tara, Leah, Jess, Zac) |
| Appearance |
Theme chips (VS Code / Light / Dark), character size chips (S / M / L), background color swatches + custom color picker |
| API / Account |
API key input + save, model dropdown (Llama 3.3 70B / Llama 3.1 8B / Mixtral 8x7B), clear key button |
| About |
Version, Orpheus TTS terms link, Groq console link |
All preferences persist to localStorage under panda_* keys and are read back on every launch. Settings that affect the Extension Host (voice name, model, companion name) are synced via UPDATE_SETTINGS postMessage on load and whenever they change.
Memory System
Panda has two complementary memory layers:
Layer 1 — Conversation Log (localStorage)
A rolling JSON array stored in panda_memory. Each entry is { role, text, t }.
- Max 60 entries — oldest dropped when limit is reached
- Both
USER_SAID and YURIKO_SAID messages trigger memAdd()
- Settings → Memory shows the last 8 exchanges as a live preview
- Only written when
prefs.enableMemory is true
Layer 2 — Compressed Persistent Memory (disk)
After every conversation turn, a background LLM call (llama-3.1-8b-instant) extracts important facts and merges them into a compressed token string stored in yuriko_memory.json inside globalStorageUri.
Format: name:venky|wake:930|school:daily|home:5pm|music:rap
- Pipe-separated key:value pairs, max 120 characters
- New facts are merged in; existing keys are updated not duplicated
- Loaded on every init and injected into Yuriko's system prompt so she knows who you are before you say a word
- She uses memory naturally — never recites it verbatim
- Cleared when the user clicks "clear memory" in Settings (wipes both layers)
Source Files
| File |
What it does |
src/extension.ts |
Entry point — registers PandaPanel as a sidebar WebviewViewProvider and the panda.start command |
src/panel.ts |
Main orchestrator — routes all postMessages, manages STT → LLM → TTS pipeline, owns isBusy flag, downloads and caches VRM assets |
src/groqClient.ts |
All Groq API calls: Whisper transcription, LLM streaming, Orpheus TTS synthesis, emotion tag parsing, memory compression |
src/audioCapture.ts |
Mic recording via node-record-lpcm16 → temp WAV file in os.tmpdir() |
src/secretManager.ts |
Thin wrapper around vscode.SecretStorage for the Groq API key |
src/memoryManager.ts |
Reads/writes yuriko_memory.json in globalStorageUri — persistent compressed memory across sessions |
media/main.js |
Webview JS — onboarding flow, shell DOM, settings panel, preferences, conversation log, VRM init, audio playback + RMS lip sync |
media/vrm-scene-src.js |
Three.js + @pixiv/three-vrm scene source — VRM loading, 5-layer expression engine, micro-expressions, blink, gaze, idle body motion, VRMA animation |
media/vrm-bundle.js |
esbuild IIFE output of vrm-scene-src.js — what the webview actually loads. Exposes window.YurikoVRM |
media/style.css |
All webview styles — CSS custom properties, theme overrides, onboarding animations, companion cards, settings accordion |
webview/index.html |
HTML shell — CSP with nonce injection, loads vrm-bundle.js then main.js |
postMessage Protocol
| Direction |
Message type |
Payload |
What it does |
| Webview → Host |
WEBVIEW_READY |
— |
Shell is built and ready; triggers checkInitialKey() |
| Webview → Host |
SAVE_API_KEY |
{ key } |
Save API key to SecretStorage and reconnect |
| Webview → Host |
CLEAR_API_KEY |
— |
Wipe key from SecretStorage, null client, show API_KEY screen |
| Webview → Host |
REQUEST_VRM |
{ companion } |
Download (if needed) and serve VRM + VRMA URIs for the companion |
| Webview → Host |
START_LISTENING |
— |
Begin mic recording |
| Webview → Host |
STOP_LISTENING |
— |
Stop recording, kick off STT → LLM → TTS |
| Webview → Host |
SEND_TEXT |
{ text } |
Send typed text directly to LLM |
| Webview → Host |
TTS_DONE |
— |
Audio playback finished, release isBusy |
| Webview → Host |
UPDATE_SETTINGS |
{ voiceName, model, companionName } |
Push current preferences to Extension Host — sent on load and on every relevant settings change |
| Webview → Host |
CLEAR_MEMORY |
— |
Wipe compressed memory file and reset in-memory state |
| Host → Webview |
SHOW_SCREEN |
{ screen } |
Navigate to API_KEY, LOADING, or VOICE_UI |
| Host → Webview |
SHOW_ERROR |
{ message } |
Show error toast |
| Host → Webview |
LOAD_VRM |
{ vrmUri, vrmaUri, animations } |
Local webview-safe URIs for VRM model, intro animation, and all named animations |
| Host → Webview |
SET_STATE |
{ state } |
Drive UI + avatar state: idle, listening, processing, speaking, error |
| Host → Webview |
USER_SAID |
{ text } |
Show user's transcript as toast + write to conversation log |
| Host → Webview |
LLM_WORD_CHUNK |
{ word } |
Individual streamed word (reserved for future streaming UI) |
| Host → Webview |
LLM_DONE |
— |
Full LLM response is complete |
| Host → Webview |
YURIKO_SAID |
{ text, emotion } |
Show Yuriko's reply as toast, write to log, drive avatar emotion |
| Host → Webview |
PLAY_AUDIO |
{ audioBase64, mimeType } |
Base64 WAV to decode and play; respects voiceEnabled and voiceSpeed prefs |
| Host → Webview |
ERROR |
{ message } |
Inline error shown as system toast |
Preferences System
All user preferences live in localStorage under panda_* keys. The prefs object in media/main.js provides typed getters/setters that write through immediately.
| Key |
Default |
What it controls |
panda_ftd |
'0' |
First-time done flag (skips onboarding after first companion pick) |
panda_companion |
'yuriko' |
Active companion id |
panda_cname |
'Yuriko' |
Display name — synced to Extension Host via UPDATE_SETTINGS |
panda_personality |
'friendly' |
Personality tone (UI only — future LLM prompt wiring) |
panda_mem_on |
'1' |
Memory enabled toggle |
panda_voice_on |
'1' |
TTS playback toggle |
panda_vspeed |
'1.0' |
Playback rate for Web Audio (0.5–2) |
panda_vname |
'diana' |
Orpheus voice name — synced to Extension Host via UPDATE_SETTINGS |
panda_theme |
'vscode' |
Theme: vscode, light, or dark |
panda_csize |
'medium' |
Character size: small, medium, or large |
panda_bg |
'' |
Custom viewport background color |
panda_model |
'llama-3.3-70b-versatile' |
LLM model — synced to Extension Host via UPDATE_SETTINGS |
panda_memory |
'[]' |
Rolling 60-entry conversation log (JSON array) |
Emotion System
How it works
- The LLM system prompt instructs Yuriko to end every reply with exactly one
[emotion:X] tag.
groqClient.ts parses the tag out of the full streamed response with a regex.
- The clean text (tag stripped) goes to TTS. The emotion name goes to the webview as part of
YURIKO_SAID.
- If the LLM omits the tag,
main.js runs analyzeSentiment() — a keyword regex fallback — over the reply text.
- The webview calls
window.YurikoVRM.setSentiment(emotionName) which blends the avatar's expressions toward that emotion's profile.
Available emotions
| Tag |
Face it drives |
Typical trigger |
joy |
Big open smile (happy 0.9) |
laughing, loving something |
excited |
Wide eyes + huge smile |
wow, can't believe it |
fun |
Smirk / soft smile (relaxed 0.92) |
goofing, jokes |
smirk |
Sly self-satisfied look |
stating the obvious, smug |
suspicious |
Narrowed brow (angry 0.62) |
judging, not buying it |
teasing |
Smirk + hint of surprise |
playful jab, banter |
confident |
Composed smirk |
assertive, matter-of-fact |
angry |
Furrowed brow (angry 0.9) |
frustrated, mad |
sad |
Down-turned mouth (sad 0.82) |
genuine sadness |
apologetic |
Sad + touch of surprise |
sorry, can't do it |
empathetic |
Soft sadness + warmth |
understanding pain |
calm |
Relaxed, composed |
informational, explaining |
question |
Wide eyes, slightly happy |
curious, wondering |
VRM Expression Engine (5 Layers)
All expression blending happens in media/vrm-scene-src.js at ~60fps.
Layer 1 — State Profile (ambient baseline)
A moderate baseline per conversation phase (idle, listening, processing, speaking, error). Each state has base slider values, oscillation frequencies, amplitude, and blend speed.
Layer 2 — Emotion Profile (dominant)
When setSentiment() is called, _emotionBlend ramps from 0 → 1 over ~300ms (rate: 3.5/s). The result is a lerp from the state profile toward the emotion profile. Fades back at 1.5/s when cleared.
Layer 3 — Organic Oscillation
A two-frequency sine oscillation multiplied over both profiles so the face breathes and feels alive rather than locked.
Layer 4 — Micro-Expressions
Brief high-intensity flickers (22ms–600ms) layered additively, capped at 1.0. Each micro has states, optional emotions gate, gap range, and fadeIn / fadeOut / dur timings.
Layer 5 — Lip Sync Mouth Isolation
While phonemes (aa, ee, ih, oh, ou) are active, mouth-affecting shapes are faded to 50% so the emotion still shows in the eyes and brows but phonemes own the jaw.
Blink
Randomised two-phase blink (close / open). Next blink fires 2.5–7.5s after the last. Speed randomised per blink (50–90ms per phase).
Gaze
Per-state eye target positions smoothed with lerp at 2.5/s. In idle and speaking states the target drifts on a slow sine to simulate natural eye movement.
Idle Body Motion
After the intro VRMA animation finishes, procedural motion drives head, neck, chest, and spine bones with sine waves at different frequencies for a breathing/swaying feel.
Models
| Task |
Model |
Notes |
| Speech-to-Text |
whisper-large-v3-turbo |
English only |
| Language Model |
Configurable (default: llama-3.3-70b-versatile) |
Streamed, max 150 tokens, 2-sentence replies enforced |
| Text-to-Speech |
canopylabs/orpheus-v1-english |
Voice configurable (default: diana), WAV output |
| Memory Compression |
llama-3.1-8b-instant |
Background call after each turn, max 80 tokens |
Selectable LLM models in Settings → API/Account:
| Option |
Model ID |
| Llama 3.3 70B (default) |
llama-3.3-70b-versatile |
| Llama 3.1 8B (fast) |
llama-3.1-8b-instant |
| Mixtral 8x7B |
mixtral-8x7b-32768 |
Security
- API key stored via
vscode.SecretStorage under panda.groqKey. Never written to disk, settings, or environment variables.
- CSP set on the webview HTML via
webview.cspSource and a per-session nonce. Scripts only execute with the correct nonce.
- localResourceRoots explicitly allows only
media/, webview/, and globalStorageUri (asset cache) — the webview cannot access anything else on disk.
- No external fetches from webview — all asset downloads happen in the Extension Host (Node.js), served to the webview as local
vscode-resource:// URIs.
- Extension Host isolation — all Groq API calls and mic access happen in Node.js, fully isolated from the webview sandbox.
- API key validation — webview validates keys start with
gsk_ before sending; a bad key clears itself from SecretStorage on failed init.
File Structure
project-panda/
├── src/
│ ├── extension.ts # VS Code entry point
│ ├── panel.ts # Main orchestrator + message router + asset downloader
│ ├── groqClient.ts # Groq API: STT + LLM + TTS + emotion parsing + memory compression
│ ├── audioCapture.ts # Mic recording via SoX → temp WAV
│ ├── secretManager.ts # VS Code SecretStorage wrapper
│ ├── memoryManager.ts # Persistent compressed memory (globalStorageUri)
│ └── node-record-lpcm16.d.ts
├── media/
│ ├── vrm-scene-src.js # Three.js + VRM scene (source — edit this)
│ ├── vrm-bundle.js # esbuild IIFE output (rebuild after edits to src above)
│ ├── main.js # Webview UI: onboarding, shell, settings, audio, memory
│ ├── style.css # VS Code theme-aware styles
│ └── panda-icon.svg # Activity bar icon
├── webview/
│ └── index.html # HTML shell with CSP nonce injection
├── out/ # tsc output (gitignored)
├── package.json
├── tsconfig.json
└── LICENSE
VRM/VRMA assets are not in this repo. They are downloaded on first launch from:
https://github.com/venkateshannabathina/project-panda/releases/tag/v0
Known Gotchas
- First launch downloads assets. On first run the extension downloads all VRM/VRMA files from GitHub Releases (~50MB total). This takes a few seconds depending on connection speed. Subsequent launches are instant — files are cached in
globalStorageUri.
- Rebuild the bundle after editing the VRM scene. The webview loads
media/vrm-bundle.js (esbuild output). Editing media/vrm-scene-src.js has no effect until you run npm run bundle. If window.YurikoVRM is undefined at runtime, the bundle is stale.
- SoX must be on PATH. If VS Code is launched from the app icon on macOS, it may not inherit your shell PATH. The extension injects
/opt/homebrew/bin and /usr/local/bin automatically, but if SoX is installed elsewhere mic input will silently fail.
- Orpheus terms must be accepted once. If TTS returns a 400 with a terms/consent message, the extension surfaces: "Accept Orpheus terms at console.groq.com first."
- Rate limits. Groq free tier has rate limits. If a 429 is hit mid-pipeline the error shows as "Rate limit hit, please wait a moment."
- Memory compression fires a background LLM call after every turn (using
llama-3.1-8b-instant). This counts against your Groq rate limits but is non-blocking — it never delays the conversation.
retainContextWhenHidden: true is set on the webview — the VRM scene and Web Audio context persist when the sidebar is hidden, avoiding a re-init cycle each time the panel is toggled.
- WEBVIEW_READY timing.
checkInitialKey() is only triggered by the WEBVIEW_READY message (sent from buildShell() once the DOM is ready), not from resolveWebviewView(). This prevents a race where the host checks SecretStorage before the webview JS has run.
- Screen queue. If the host sends a
SHOW_SCREEN message while onboarding is still running (before buildShell() completes), it is stored in queuedScreen and applied the moment the shell is ready.
Animal Kingdom Series
| # |
Project |
Status |
| Vol. 1 |
Panda — Voice AI Companion |
Active |
| Vol. 2 |
Coming soon... |
Locked |
| Vol. 3 |
Coming soon... |
Locked |