Project Panda

Animal Kingdom Series — Vol. 1

A 3D AI girl lives in your VS Code sidebar. She talks. She reacts. She remembers you.

"Because coding alone at 2am shouldn't feel lonely."

What is this?

Project Panda is the first entry in the Animal Kingdom VS Code extension series.

It puts Yuriko — a sarcastic, emotionally reactive 3D AI companion — right inside your VS Code sidebar. Hold a button, speak to her, and she speaks back. Her face reacts in real time. She remembers things you tell her across sessions. She gets annoyed. She gets happy. She judges your code (lovingly).

This is not a chatbot widget. It is a full voice pipeline with a living 3D VRM avatar whose expressions, lip sync, blink, and gaze are all driven in real time.

The Pipeline

Your Voice  (mic button held)
     |
  SoX binary  →  16kHz mono WAV on disk  (node-record-lpcm16)
     |
  Whisper STT  →  transcript text  (Groq: whisper-large-v3-turbo)
     |
  Compressed memory injected into system prompt
     |
  LLM  →  streamed reply + [emotion:X] tag  (Groq: configurable model)
     |
  Emotion tag parsed  →  avatar expression driven live
     |
  Background memory compression  →  key:value tokens saved to disk
     |
  Orpheus TTS  →  WAV audio buffer  (Groq: canopylabs/orpheus-v1-english)
     |
  Web Audio API  →  decoded + played back in webview
     |
  RMS amplitude  →  live lip sync on avatar

Text input bypasses STT and feeds directly into the LLM step.

Features

Push-to-talk mic input — hold the mic button, release to process
Text input — type instead of speaking anytime
Configurable LLM — Llama 3.3 70B (default), Llama 3.1 8B, or Mixtral 8x7B
Orpheus TTS — expressive, natural-sounding voice (5 voices selectable)
3D VRM avatar — full Three.js scene inside the sidebar canvas
13-emotion system — LLM tags its own reply, avatar reacts immediately
Lip sync — RMS amplitude from Web Audio drives mouth phonemes in real time
Auto-blink — randomised blink timing for a natural feel
Gaze system — eye target shifts per conversation state
Micro-expressions — brief high-intensity flickers layered on top of base expressions
Idle body motion — subtle breathing and head sway after the intro animation finishes
Compressed persistent memory — facts extracted and stored as key:value tokens across sessions, injected into every system prompt
Conversation log — rolling 60-entry localStorage log, surfaced in Settings
Secure API key storage — VS Code SecretStorage, never in settings or plaintext
Theme-aware UI — CSS uses --vscode-* variables throughout, works in any theme
Onboarding flow — animated splash → tagline → companion selection on first launch
Settings panel — full-height slide-in overlay with 6 accordion sections
Settings sync — voice, model, companion name changes are pushed to the Extension Host live
Asset caching — VRM/VRMA models downloaded from GitHub Releases on first launch and cached permanently; zero re-download on subsequent launches

Meet Yuriko

Yuriko is the personality layer. She is:

Sarcastic but caring
Expressive — her avatar reacts emotionally to what she says
Opinionated — max 2 sentences, no markdown, no fluff, plain spoken words only
Reactive — she uses [playful] and [whisper] inline for delivery variation
Remembers you — compressed facts from past conversations are silently injected into her context

Every reply ends with one emotion tag (e.g. [emotion:joy]). The tag is stripped before TTS so she sounds natural, but her avatar reacts to it immediately.

Requirements

1. SoX — Audio Capture Engine

node-record-lpcm16 shells out to the sox / rec binary for mic recording.

Platform	Install
macOS	`brew install sox`
Linux	`sudo apt install sox`
Windows	sox.sourceforge.net

On macOS the extension auto-injects /opt/homebrew/bin and /usr/local/bin into PATH so VS Code can find the binary even when launched from the app icon.

2. Groq API Key

Free at console.groq.com. Paste it on first launch — stored in VS Code SecretStorage and never written to disk or settings.

3. Accept Orpheus TTS Terms

One-time step required before TTS works: Accept Orpheus Terms

Setup

# Install dependencies
npm install

# Build the VRM scene bundle (Three.js + @pixiv/three-vrm → single IIFE)
npm run bundle

# Compile TypeScript
npm run compile

# Or watch mode during development
npm run watch

Press F5 in VS Code to launch the Extension Development Host.

Important: any time you edit media/vrm-scene-src.js, you must re-run npm run bundle — the webview loads media/vrm-bundle.js, not the source file directly. If window.YurikoVRM is undefined at runtime, the bundle is stale.

Build & Package

# Bundle VRM scene only
npm run bundle

# TypeScript only
npm run compile

# Full production build (bundle + compile)
npm run vscode:prepublish

# Package as .vsix for distribution
npm run package

Architecture

Extension Host (Node.js)              Webview (HTML/JS sandbox)
──────────────────────────            ──────────────────────────
src/extension.ts                      webview/index.html
src/panel.ts          ←─ postMsg ─→   media/main.js
src/groqClient.ts                      media/vrm-bundle.js   ← esbuild IIFE
src/audioCapture.ts                    media/style.css
src/secretManager.ts
src/memoryManager.ts

All mic I/O runs in the Extension Host (Node.js). getUserMedia and Web Speech API do not work inside VS Code webviews. The webview handles rendering, UI state, Web Audio playback, and the Three.js VRM scene only.

Communication is entirely via postMessage — the Extension Host and Webview are isolated and can only exchange serialisable JSON messages.

Asset Loading

VRM models and VRMA animations are not bundled in the extension. On first launch they are downloaded from GitHub Releases into context.globalStorageUri (VS Code's per-extension persistent storage directory) and cached permanently. Subsequent launches serve the cached files as local vscode-resource:// URIs — zero network traffic after the first run.

The Extension Host handles all downloads using Node's https module with redirect-following. The webview never fetches from external URLs — avoiding CORS restrictions entirely.

Assets source: https://github.com/venkateshannabathina/project-panda/releases/download/v0/

UI Flow

First Launch (onboarding)

Splash screen  →  (2.2s auto-advance)
     ↓
"made for developers" tagline  →  (2.2s auto-advance)
     ↓
Companion selection  →  (user picks a card)
     ↓
Main shell built  →  WEBVIEW_READY + syncSettings() sent  →  checkInitialKey()
     ↓
  [no key]  →  API key overlay shown
  [key exists]  →  LOADING overlay  →  Groq init + memory loaded  →  VOICE_UI

prefs.firstTimeDone is written to localStorage when the user picks a companion. On subsequent launches, buildShell() is called directly, skipping onboarding entirely.

Main Shell Layout

┌─────────────────────────────┐
│  VRM viewport (flex:1)      │  ← Three.js canvas fills this
│                             │
│  [settings ⚙]  top-right   │  ← 32px circular button
│                             │
│  [toast overlays]           │  ← user/yuriko speech bubbles
└─────────────────────────────┘
│  input-pill                 │  ← [🎤] [text input] [↑]
└─────────────────────────────┘

Overlays (API key card, loading spinner) sit above the viewport in the same stacking context. The shell DOM is built once and never torn down — overlays are toggled with display:none/flex.

Settings Panel

Right-side full-height slide-in panel. Six accordion sections:

Section	Controls
Companion	Rename companion, personality dropdown (Friendly / Professional / Casual / Sarcastic), change companion button
Memory	Enable/disable toggle, last 8 conversation lines preview, clear button (wipes both localStorage log and compressed memory file)
Voice	Enable/disable toggle, speed slider (0.5×–2×), voice dropdown (Diana, Tara, Leah, Jess, Zac)
Appearance	Theme chips (VS Code / Light / Dark), character size chips (S / M / L), background color swatches + custom color picker
API / Account	API key input + save, model dropdown (Llama 3.3 70B / Llama 3.1 8B / Mixtral 8x7B), clear key button
About	Version, Orpheus TTS terms link, Groq console link

All preferences persist to localStorage under panda_* keys and are read back on every launch. Settings that affect the Extension Host (voice name, model, companion name) are synced via UPDATE_SETTINGS postMessage on load and whenever they change.

Memory System

Panda has two complementary memory layers:

Layer 1 — Conversation Log (localStorage)

A rolling JSON array stored in panda_memory. Each entry is { role, text, t }.

Max 60 entries — oldest dropped when limit is reached
Both USER_SAID and YURIKO_SAID messages trigger memAdd()
Settings → Memory shows the last 8 exchanges as a live preview
Only written when prefs.enableMemory is true

Layer 2 — Compressed Persistent Memory (disk)

After every conversation turn, a background LLM call (llama-3.1-8b-instant) extracts important facts and merges them into a compressed token string stored in yuriko_memory.json inside globalStorageUri.

Format: name:venky|wake:930|school:daily|home:5pm|music:rap

Pipe-separated key:value pairs, max 120 characters
New facts are merged in; existing keys are updated not duplicated
Loaded on every init and injected into Yuriko's system prompt so she knows who you are before you say a word
She uses memory naturally — never recites it verbatim
Cleared when the user clicks "clear memory" in Settings (wipes both layers)

Source Files

File	What it does
`src/extension.ts`	Entry point — registers `PandaPanel` as a sidebar `WebviewViewProvider` and the `panda.start` command
`src/panel.ts`	Main orchestrator — routes all postMessages, manages STT → LLM → TTS pipeline, owns `isBusy` flag, downloads and caches VRM assets
`src/groqClient.ts`	All Groq API calls: Whisper transcription, LLM streaming, Orpheus TTS synthesis, emotion tag parsing, memory compression
`src/audioCapture.ts`	Mic recording via `node-record-lpcm16` → temp WAV file in `os.tmpdir()`
`src/secretManager.ts`	Thin wrapper around `vscode.SecretStorage` for the Groq API key
`src/memoryManager.ts`	Reads/writes `yuriko_memory.json` in `globalStorageUri` — persistent compressed memory across sessions
`media/main.js`	Webview JS — onboarding flow, shell DOM, settings panel, preferences, conversation log, VRM init, audio playback + RMS lip sync
`media/vrm-scene-src.js`	Three.js + `@pixiv/three-vrm` scene source — VRM loading, 5-layer expression engine, micro-expressions, blink, gaze, idle body motion, VRMA animation
`media/vrm-bundle.js`	esbuild IIFE output of `vrm-scene-src.js` — what the webview actually loads. Exposes `window.YurikoVRM`
`media/style.css`	All webview styles — CSS custom properties, theme overrides, onboarding animations, companion cards, settings accordion
`webview/index.html`	HTML shell — CSP with nonce injection, loads `vrm-bundle.js` then `main.js`

postMessage Protocol

Direction	Message type	Payload	What it does
Webview → Host	`WEBVIEW_READY`	—	Shell is built and ready; triggers `checkInitialKey()`
Webview → Host	`SAVE_API_KEY`	`{ key }`	Save API key to SecretStorage and reconnect
Webview → Host	`CLEAR_API_KEY`	—	Wipe key from SecretStorage, null client, show API_KEY screen
Webview → Host	`REQUEST_VRM`	`{ companion }`	Download (if needed) and serve VRM + VRMA URIs for the companion
Webview → Host	`START_LISTENING`	—	Begin mic recording
Webview → Host	`STOP_LISTENING`	—	Stop recording, kick off STT → LLM → TTS
Webview → Host	`SEND_TEXT`	`{ text }`	Send typed text directly to LLM
Webview → Host	`TTS_DONE`	—	Audio playback finished, release `isBusy`
Webview → Host	`UPDATE_SETTINGS`	`{ voiceName, model, companionName }`	Push current preferences to Extension Host — sent on load and on every relevant settings change
Webview → Host	`CLEAR_MEMORY`	—	Wipe compressed memory file and reset in-memory state
Host → Webview	`SHOW_SCREEN`	`{ screen }`	Navigate to `API_KEY`, `LOADING`, or `VOICE_UI`
Host → Webview	`SHOW_ERROR`	`{ message }`	Show error toast
Host → Webview	`LOAD_VRM`	`{ vrmUri, vrmaUri, animations }`	Local webview-safe URIs for VRM model, intro animation, and all named animations
Host → Webview	`SET_STATE`	`{ state }`	Drive UI + avatar state: `idle`, `listening`, `processing`, `speaking`, `error`
Host → Webview	`USER_SAID`	`{ text }`	Show user's transcript as toast + write to conversation log
Host → Webview	`LLM_WORD_CHUNK`	`{ word }`	Individual streamed word (reserved for future streaming UI)
Host → Webview	`LLM_DONE`	—	Full LLM response is complete
Host → Webview	`YURIKO_SAID`	`{ text, emotion }`	Show Yuriko's reply as toast, write to log, drive avatar emotion
Host → Webview	`PLAY_AUDIO`	`{ audioBase64, mimeType }`	Base64 WAV to decode and play; respects `voiceEnabled` and `voiceSpeed` prefs
Host → Webview	`ERROR`	`{ message }`	Inline error shown as system toast

Preferences System

All user preferences live in localStorage under panda_* keys. The prefs object in media/main.js provides typed getters/setters that write through immediately.

Key	Default	What it controls
`panda_ftd`	`'0'`	First-time done flag (skips onboarding after first companion pick)
`panda_companion`	`'yuriko'`	Active companion id
`panda_cname`	`'Yuriko'`	Display name — synced to Extension Host via `UPDATE_SETTINGS`
`panda_personality`	`'friendly'`	Personality tone (UI only — future LLM prompt wiring)
`panda_mem_on`	`'1'`	Memory enabled toggle
`panda_voice_on`	`'1'`	TTS playback toggle
`panda_vspeed`	`'1.0'`	Playback rate for Web Audio (0.5–2)
`panda_vname`	`'diana'`	Orpheus voice name — synced to Extension Host via `UPDATE_SETTINGS`
`panda_theme`	`'vscode'`	Theme: `vscode`, `light`, or `dark`
`panda_csize`	`'medium'`	Character size: `small`, `medium`, or `large`
`panda_bg`	`''`	Custom viewport background color
`panda_model`	`'llama-3.3-70b-versatile'`	LLM model — synced to Extension Host via `UPDATE_SETTINGS`
`panda_memory`	`'[]'`	Rolling 60-entry conversation log (JSON array)

Emotion System

How it works

The LLM system prompt instructs Yuriko to end every reply with exactly one [emotion:X] tag.
groqClient.ts parses the tag out of the full streamed response with a regex.
The clean text (tag stripped) goes to TTS. The emotion name goes to the webview as part of YURIKO_SAID.
If the LLM omits the tag, main.js runs analyzeSentiment() — a keyword regex fallback — over the reply text.
The webview calls window.YurikoVRM.setSentiment(emotionName) which blends the avatar's expressions toward that emotion's profile.

Available emotions

Tag	Face it drives	Typical trigger
`joy`	Big open smile (happy 0.9)	laughing, loving something
`excited`	Wide eyes + huge smile	wow, can't believe it
`fun`	Smirk / soft smile (relaxed 0.92)	goofing, jokes
`smirk`	Sly self-satisfied look	stating the obvious, smug
`suspicious`	Narrowed brow (angry 0.62)	judging, not buying it
`teasing`	Smirk + hint of surprise	playful jab, banter
`confident`	Composed smirk	assertive, matter-of-fact
`angry`	Furrowed brow (angry 0.9)	frustrated, mad
`sad`	Down-turned mouth (sad 0.82)	genuine sadness
`apologetic`	Sad + touch of surprise	sorry, can't do it
`empathetic`	Soft sadness + warmth	understanding pain
`calm`	Relaxed, composed	informational, explaining
`question`	Wide eyes, slightly happy	curious, wondering

VRM Expression Engine (5 Layers)

All expression blending happens in media/vrm-scene-src.js at ~60fps.

Layer 1 — State Profile (ambient baseline)

A moderate baseline per conversation phase (idle, listening, processing, speaking, error). Each state has base slider values, oscillation frequencies, amplitude, and blend speed.

Layer 2 — Emotion Profile (dominant)

When setSentiment() is called, _emotionBlend ramps from 0 → 1 over ~300ms (rate: 3.5/s). The result is a lerp from the state profile toward the emotion profile. Fades back at 1.5/s when cleared.

Layer 3 — Organic Oscillation

A two-frequency sine oscillation multiplied over both profiles so the face breathes and feels alive rather than locked.

Layer 4 — Micro-Expressions

Brief high-intensity flickers (22ms–600ms) layered additively, capped at 1.0. Each micro has states, optional emotions gate, gap range, and fadeIn / fadeOut / dur timings.

Layer 5 — Lip Sync Mouth Isolation

While phonemes (aa, ee, ih, oh, ou) are active, mouth-affecting shapes are faded to 50% so the emotion still shows in the eyes and brows but phonemes own the jaw.

Blink

Randomised two-phase blink (close / open). Next blink fires 2.5–7.5s after the last. Speed randomised per blink (50–90ms per phase).

Gaze

Per-state eye target positions smoothed with lerp at 2.5/s. In idle and speaking states the target drifts on a slow sine to simulate natural eye movement.

Idle Body Motion

After the intro VRMA animation finishes, procedural motion drives head, neck, chest, and spine bones with sine waves at different frequencies for a breathing/swaying feel.

Models

Task	Model	Notes
Speech-to-Text	`whisper-large-v3-turbo`	English only
Language Model	Configurable (default: `llama-3.3-70b-versatile`)	Streamed, max 150 tokens, 2-sentence replies enforced
Text-to-Speech	`canopylabs/orpheus-v1-english`	Voice configurable (default: `diana`), WAV output
Memory Compression	`llama-3.1-8b-instant`	Background call after each turn, max 80 tokens

Selectable LLM models in Settings → API/Account:

Option	Model ID
Llama 3.3 70B (default)	`llama-3.3-70b-versatile`
Llama 3.1 8B (fast)	`llama-3.1-8b-instant`
Mixtral 8x7B	`mixtral-8x7b-32768`

Security

API key stored via vscode.SecretStorage under panda.groqKey. Never written to disk, settings, or environment variables.
CSP set on the webview HTML via webview.cspSource and a per-session nonce. Scripts only execute with the correct nonce.
localResourceRoots explicitly allows only media/, webview/, and globalStorageUri (asset cache) — the webview cannot access anything else on disk.
No external fetches from webview — all asset downloads happen in the Extension Host (Node.js), served to the webview as local vscode-resource:// URIs.
Extension Host isolation — all Groq API calls and mic access happen in Node.js, fully isolated from the webview sandbox.
API key validation — webview validates keys start with gsk_ before sending; a bad key clears itself from SecretStorage on failed init.

File Structure

project-panda/
├── src/
│   ├── extension.ts          # VS Code entry point
│   ├── panel.ts              # Main orchestrator + message router + asset downloader
│   ├── groqClient.ts         # Groq API: STT + LLM + TTS + emotion parsing + memory compression
│   ├── audioCapture.ts       # Mic recording via SoX → temp WAV
│   ├── secretManager.ts      # VS Code SecretStorage wrapper
│   ├── memoryManager.ts      # Persistent compressed memory (globalStorageUri)
│   └── node-record-lpcm16.d.ts
├── media/
│   ├── vrm-scene-src.js      # Three.js + VRM scene (source — edit this)
│   ├── vrm-bundle.js         # esbuild IIFE output (rebuild after edits to src above)
│   ├── main.js               # Webview UI: onboarding, shell, settings, audio, memory
│   ├── style.css             # VS Code theme-aware styles
│   └── panda-icon.svg        # Activity bar icon
├── webview/
│   └── index.html            # HTML shell with CSP nonce injection
├── out/                      # tsc output (gitignored)
├── package.json
├── tsconfig.json
└── LICENSE

VRM/VRMA assets are not in this repo. They are downloaded on first launch from: https://github.com/venkateshannabathina/project-panda/releases/tag/v0

Known Gotchas

First launch downloads assets. On first run the extension downloads all VRM/VRMA files from GitHub Releases (~50MB total). This takes a few seconds depending on connection speed. Subsequent launches are instant — files are cached in globalStorageUri.
Rebuild the bundle after editing the VRM scene. The webview loads media/vrm-bundle.js (esbuild output). Editing media/vrm-scene-src.js has no effect until you run npm run bundle. If window.YurikoVRM is undefined at runtime, the bundle is stale.
SoX must be on PATH. If VS Code is launched from the app icon on macOS, it may not inherit your shell PATH. The extension injects /opt/homebrew/bin and /usr/local/bin automatically, but if SoX is installed elsewhere mic input will silently fail.
Orpheus terms must be accepted once. If TTS returns a 400 with a terms/consent message, the extension surfaces: "Accept Orpheus terms at console.groq.com first."
Rate limits. Groq free tier has rate limits. If a 429 is hit mid-pipeline the error shows as "Rate limit hit, please wait a moment."
Memory compression fires a background LLM call after every turn (using llama-3.1-8b-instant). This counts against your Groq rate limits but is non-blocking — it never delays the conversation.
retainContextWhenHidden: true is set on the webview — the VRM scene and Web Audio context persist when the sidebar is hidden, avoiding a re-init cycle each time the panel is toggled.
WEBVIEW_READY timing. checkInitialKey() is only triggered by the WEBVIEW_READY message (sent from buildShell() once the DOM is ready), not from resolveWebviewView(). This prevents a race where the host checks SecretStorage before the webview JS has run.
Screen queue. If the host sends a SHOW_SCREEN message while onboarding is still running (before buildShell() completes), it is stored in queuedScreen and applied the moment the shell is ready.

Animal Kingdom Series

#	Project	Status
Vol. 1	Panda — Voice AI Companion	Active
Vol. 2	Coming soon...	Locked
Vol. 3	Coming soon...	Locked

Built by Venkatesh Annabathina

Part of the Animal Kingdom VS Code Extension Series

Panda — Voice Companion

Venkatesh Annabathina