Claude Steward — Reduce Claude API Costs by 60–90%
100% local. Zero data sent. No subscriptions. Just cheaper AI.
Claude Steward is a free VS Code extension that silently sits between your editor and the Anthropic API, using smart caching, compression, model routing, and context compaction to dramatically cut your Claude Code token costs — without changing how you work.
Target: $5/day for a full developer workload. No API key changes. No new accounts. Install and go.
Why Claude Steward?
Claude Code bills per token. Long conversations, repeated questions, and heavy context windows add up fast. Claude Steward intercepts every request before it reaches Anthropic and applies five layers of optimization:
- Cache answers to repeated or similar questions — pay once, reuse forever
- Compress prompts by removing filler and whitespace — 5–20% fewer tokens every request
- Route simple questions to Haiku (3× cheaper) instead of always using Sonnet
- Compact long conversations — summarize old turns so you're not re-sending your entire chat history on every message
- Mask PII before anything leaves your machine — emails, API keys, phone numbers replaced with safe placeholders
Everything runs on localhost. No data is stored, transmitted, or logged anywhere outside your own machine.
Quick Start
- Install Claude Steward from the VS Code Marketplace
- Open a new Claude Code chat window (existing windows pre-date the proxy — or click "Route all windows ↺" in the dashboard)
- Start chatting — the status bar shows 🟢 when the proxy is active
No configuration needed. Savings start immediately.
How it works
Claude Code → localhost:8787 (Claude Steward) → Anthropic API
Every request passes through a 5-stage local pipeline before reaching Anthropic:
| Stage |
What it does |
Typical saving |
| PII Filter |
Detects emails, phone numbers, credit cards, API keys, IBANs. Replaces them with safe tokens before sending. Reversed in the response. |
Privacy |
| Semantic Cache |
Embeds each question and compares against a local cache (cosine ≥ 0.92). Identical or near-identical questions get the cached response instantly — no API call. |
100% on hits |
| Prompt Compression |
Strips filler phrases ("Sure!", "Of course!"), collapses whitespace, trims oversized tool results and repeated content. |
5–20% per request |
| Context Compaction |
When conversations exceed ~20 turns, Haiku summarizes older turns into a compact block. Old turns dropped, recent turns kept. Shorter context = fewer tokens on every subsequent message. |
30–60% on long sessions |
| Model Router |
Classifies request complexity locally. Simple Q&A, explanations, git output → Haiku (3× cheaper). Code writing and debugging → Sonnet. Zero added latency for recognized patterns. |
Up to 10× on simple tasks |
Features
🔒 100% Local — No Data Leaves Your Machine
- All processing happens on
localhost
- The semantic cache, conversation summaries, and traces are stored only in VS Code workspace state
- The dashboard polls
localhost:8787 — no external calls
- PII masking runs before any data is sent to Anthropic
- No analytics, no telemetry, no cloud sync
⚡ Model Router — Stop Paying Sonnet Prices for Simple Questions
The router classifies requests locally (no Haiku API call needed for common patterns):
- Short messages under 80 characters with no code-write keywords → Haiku
- "Explain", "describe", "list", "show me", "what is", "summarize" questions → Haiku
- Git/bash/terminal one-liners → Haiku
- "Write", "implement", "fix", "refactor", "create" → Sonnet (stays there)
For ambiguous requests, a background Haiku classifier fires and caches the result for next time. Zero latency on the current request.
Modes: balanced (default) · aggressive (moderate tasks also → Haiku) · conservative (no routing)
💾 Semantic Cache — Ask Once, Pay Once
- Compares each question against cached answers using cosine similarity
- Threshold: 0.92 by default (configurable)
- Cache size: 500 entries with LRU eviction
- Works across VS Code sessions
✂️ Prompt Compression
- Removes conversational filler
- Collapses blank lines and redundant whitespace
- Trims oversized tool results and duplicate file contents
- Never modifies code blocks or technical content
📝 Context Compaction — Long Conversations, Short Bills
- Triggers automatically after ~20 message turns
- Haiku summarizes older turns, extracting: variables, file paths, decisions, task state, constraints, errors resolved
- Summary injected into system prompt; old turns dropped
- Background execution: current request unaffected; compaction applied from next request
- Cached by content hash — old turns are never summarized twice
🛡️ PII Filter
- Detects: email addresses, phone numbers, credit card numbers, IBANs, API keys, SSNs
- Replaces with readable tokens:
[EMAIL_1], [PHONE_1], [API_KEY_1]
- Reversed in the response so your output reads normally
- VS Code warning notification on detection (throttled to once per 30 seconds)
📊 Live Dashboard
- Open: status bar click, or
Ctrl+Shift+P → Claude Steward: Show Dashboard
- Shows: tokens saved, cost saved (estimated), cache hit rate, model routes, per-request trace table
- Real-time updates via SSE
- Route all windows ↺: restarts the VS Code extension host so existing Claude Code windows start routing through the proxy
- Export all traces to CSV
- Proxy status: 🟢 connected / 🔴 offline
Configuration
Ctrl+, → search Claude Steward
| Setting |
Default |
Description |
claudeSteward.proxyPort |
8787 |
Local proxy port |
claudeSteward.enableCache |
true |
Semantic response cache |
claudeSteward.cacheThreshold |
0.92 |
Similarity cutoff (0–1) |
claudeSteward.enableModelRouter |
true |
Route to cheaper models |
claudeSteward.routerMode |
balanced |
balanced / aggressive / conservative |
claudeSteward.enablePromptCompression |
true |
Strip filler and whitespace |
claudeSteward.enablePiiFilter |
true |
Detect and mask PII |
Commands
Ctrl+Shift+P → type Claude Steward
| Command |
Description |
Claude Steward: Show Dashboard |
Open the live savings dashboard |
Claude Steward: Toggle Proxy ON/OFF |
Pause or resume the proxy |
Claude Steward: Clear Cache |
Wipe the semantic cache |
Cost Estimates
Savings shown in the dashboard are estimates based on Anthropic's public per-token pricing and the token delta between the original request and what was actually sent. Verify against your Anthropic console.
Troubleshooting
Dashboard shows "Proxy offline"
Proxy failed to start. Output panel → Claude Steward for errors.
All requests still going to Sonnet
Open a new Claude Code chat window — existing windows opened before the extension installed don't route through the proxy. Or use "Route all windows ↺" in the dashboard.
No data in the dashboard
Send at least one message from a new Claude Code chat window after installing.
Technical Notes
- The extension patches
https.request, globalThis.fetch, and the undici dispatcher in the VS Code extension host at startup
- Internal calls (Haiku classifier, context summarizer) use an
x-steward-internal header so they bypass the optimization pipeline and don't loop
- The proxy captures the original
https.request before patching, so forwarding to Anthropic is always direct
Contributing
Issues and PRs welcome at github.com/prashantmalan/5DD.
- One thing per PR
- No speculative abstractions
- Open an issue before fixing a bug