Skip to content
| Marketplace
Sign in
Visual Studio Code>AI>Trimli — AI Token OptimizerNew to Visual Studio Code? Get it now.
Trimli — AI Token Optimizer

Trimli — AI Token Optimizer

trimliai

| (0) | Free
Cut your AI bill by 40% — works with Claude Code, Cursor, Continue & Cline
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

AI Token Optimizer

Cut your AI coding costs by an average of 40% — up to 60% on long sessions and agentic workflows. Works silently across Cursor, Continue, Cline, Claude Code, and any OpenAI-compatible tool.

No config changes to your AI tools. No prompts modified visibly. Just lower bills.


The problem this solves

AI coding tools are expensive at scale. A typical developer sending 100 requests a day to GPT-4o spends $8–15/month just on input tokens — most of it wasted on repeated context, verbose history, and filler the model doesn't need.

tok-optimizer sits between your tool and the API. It strips the waste, keeps the signal, and forwards a leaner prompt. The model never knows. Your bill does.

Before:  2,840 tokens  →  $0.0071 per request
After:   1,690 tokens  →  $0.0042 per request
Saving:  1,150 tokens  →  $0.0029 saved  (40% reduction)

Across 100 requests a day that's $87/month back in your pocket at the conservative estimate. On longer agentic sessions the number is higher.


Setup in 60 seconds

1. Install the extension. A local proxy starts automatically on http://localhost:8765.

2. Point your AI tool at the proxy:

Tool Setting
Cursor Settings → AI → Base URL → http://localhost:8765
Continue config.json → "apiBase": "http://localhost:8765"
Cline Settings → API Base URL → http://localhost:8765
Claude Code Open terminal in VS Code → run claude — works automatically
Any OpenAI-compatible tool OPENAI_BASE_URL=http://localhost:8765

3. Code normally. The optimizer runs silently on every request.

Watch the status bar update in real time: ⚡ 1,240 tkns saved $0.004


How much will you actually save?

Savings depend on how you work. Here's what to expect across common workflows, based on our 59-test accuracy benchmark suite:

Workflow Typical session Long session / agentic
Short single-turn queries 5–15% —
Multi-turn chat sessions 25–45% 45–55%
Code review with long context 30–50% 50–60%
Agentic sessions (tool calls) 35–55% 55–65%
Long debugging sessions 40–55% 55–65%
Average across all workflows ~40% ~60%

The more context your session accumulates — conversation history, tool results, repeated code blocks — the more the optimizer saves. Short queries get modest savings. Long agentic sessions routinely hit 55–65%.

Accuracy guarantee: 59 tests across factual queries, code generation, multi-turn reasoning, and tool call round-trips — zero quality degradation detected at any compression level.


Real savings by model

At 100 requests/day using the conservative 40% estimate:

Model Without optimizer With optimizer (40%) With optimizer (60%)
GPT-4o ($2.50/M) ~$15/mo ~$9/mo ($6 saved) ~$6/mo ($9 saved)
GPT-4.1 ($2.00/M) ~$12/mo ~$7.20/mo ($4.80 saved) ~$4.80/mo ($7.20 saved)
Claude Sonnet ($3.00/M) ~$18/mo ~$10.80/mo ($7.20 saved) ~$7.20/mo ($10.80 saved)
Claude Opus ($15.00/M) ~$90/mo ~$54/mo ($36 saved) ~$36/mo ($54 saved)

For a 10-person team on Claude Sonnet: $72–$108/month saved. The optimizer pays for itself many times over on Pro.


The six optimization strategies

Strategies run in cascade, cheapest first. Each one only runs if the previous ones haven't hit the token budget yet — so you always get the minimum compression needed, never more than necessary.

1. Whitespace Normalize — 3–8% savings

Lossless. Always runs first.

Collapses redundant whitespace, repeated blank lines, trailing spaces, and excessive punctuation. No meaning is lost — just noise removed. On a typical 2,000-token prompt this recovers 60–160 tokens at zero risk.

Before: "Hello   world\n\n\n\nPlease   help!!!"
After:  "Hello world\n\nPlease help!"

2. Deduplicate — 5–20% savings

Removes repeated sentences across the conversation.

In multi-turn sessions, the same context often appears verbatim in multiple messages — requirements restated, errors repeated, instructions echoed back. This strategy fingerprints every sentence and removes duplicates after the first occurrence. System messages are never touched.

Before: [turn 3 repeats the same requirement from turn 1 verbatim]
After:  [duplicate removed — the model already has it from turn 1]

3. Intent Distill — 10–30% savings

Strips filler from user queries.

Developers often write conversationally: "I was wondering if you could please help me understand..." The model doesn't need the preamble — it needs the intent. This strategy removes opener hedges, politeness wrappers, sign-offs, and meta-commentary while preserving the actual request.

Before: "Hi! I was wondering if you could please help me fix this
         bug. I've been struggling with it for a while. Thanks!"
After:  "Fix this bug."

4. Reference Substitute — 10–25% savings

Aliases repeated long strings.

Long file paths, URLs, repeated code snippets, and verbose type names that appear multiple times across a session are replaced with short aliases (§REF1, §REF2). An alias legend is prepended to the system message so the model retains full context.

Before: "/workspace/src/components/auth/AuthenticationProvider.tsx"
        appearing 8 times across the conversation
After:  "§REF1" × 8 + legend prepended once
Net:    7 × 52 chars = 91 tokens saved

5. History Summarize — 30–50% savings

Compresses old conversation turns into a bullet-point summary.

After ~10 turns, early messages are low-signal relative to the current context. This strategy replaces old turns with a compact summary — extracting decisions made, requirements stated, and tech choices confirmed — while keeping the last 6 messages verbatim. No LLM call required: extraction runs locally using pattern matching.

Before: 14 full conversation turns = 4,200 tokens
After:  Summary block + last 6 turns verbatim = 1,800 tokens
Saving: 2,400 tokens (57%)

6. Context Prune — 20–40% savings

Drops low-relevance messages when still over budget.

When the conversation is still over the token budget after all other strategies, every message is scored against the current query using keyword overlap and recency weighting. Low-scoring messages are dropped. System messages and the last 4 messages are always kept regardless of score.

Token budget: 8,000
Before pruning: 11,200 tokens
After pruning:  7,800 tokens
Dropped: 6 low-relevance messages from earlier in the session

Bonus: LLMLingua (Pro) — additional 20–40% on top of heuristics

Pro tier unlocks Microsoft's LLMLingua-2 semantic compression model. Unlike the heuristic strategies above, LLMLingua understands meaning — it compresses at the word and phrase level, keeping semantically important tokens and dropping redundant ones. This layer runs after all six heuristics.

Heuristics alone:             average 40%,  up to 55%
Heuristics + LLMLingua (Pro): average 55%,  up to 65%

Safety guards — what the optimizer will never do

  • System messages are never compressed — they contain critical instructions
  • Messages under 200 characters are skipped — too short to compress safely
  • Over-compression protection — any result removing more than 70% of characters is rejected and the original is used instead
  • API keys are forwarded transparently and never stored
  • Responses are never modified — only the input prompt is compressed

Dashboard

Click the ⚡ status bar item to open the savings dashboard:

  • Lifetime stats — total tokens saved, estimated cost saved, average compression ratio
  • Per-request history — every request with raw vs optimized tokens, strategies applied, cost delta, Python savings column
  • Python service status — green dot when LLMLingua is active (Pro)

Web dashboard: log in at tok-optimizer-enterprise.vercel.app with your licence key to see full analytics, 30-day charts, and strategy breakdown.


Commands

Command What it does
AI Token Optimizer: Show Dashboard Open the savings dashboard panel
AI Token Optimizer: Toggle On/Off Pause or resume optimization
AI Token Optimizer: Toggle Forward Proxy Auto-inject proxy into VS Code terminal sessions (Claude Code)
AI Token Optimizer: Optimize Current Context Manually optimize selected text
AI Token Optimizer: Reset Session Clear session stats and start fresh

Settings

Setting Default Description
tokOptimizer.enabled true Enable/disable optimization globally
tokOptimizer.model gpt-4o Primary model (for token counting and cost estimates)
tokOptimizer.tokenBudget 8000 Max input tokens before context pruning activates
tokOptimizer.aggressiveness 0.5 Compression aggressiveness: 0 = conservative, 1 = maximum
tokOptimizer.strategies all 6 Which strategies to apply (remove any to disable)
tokOptimizer.pythonServiceUrl http://localhost:8766 LLMLingua service URL (Pro — local Docker or hosted)
tokOptimizer.licenceKey `` Your licence key (auto-generated on first install)

Tiers

Free Pro ($15/mo) Enterprise ($30/seat/mo)
Strategies All 6 All 6 All 6 + org shared pools
LLMLingua compression ✓ Included ✓ Included ✓ Hosted + self-hostable
Average savings ~40% ~40% ~40–60%
Daily savings cap 200K tokens Unlimited Unlimited
Analytics VS Code only Full web portal Org-level + audit logs + CFO report
SSO (Okta / Azure AD) — — ✓
On-premise deployment — — ✓ Docker + Helm
Support Community Email Priority + SLA

No account required on the free tier. A licence key is created automatically when you install. Upgrade at tok-optimizer-enterprise.vercel.app.


FAQ

Does it store my prompts? No. The proxy optimizes in-flight and immediately discards the messages. Nothing is logged, cached, or sent anywhere except directly to the upstream AI API. Your API key is forwarded transparently and never stored.

Will it change the quality of AI responses? No. Tested across 59 accuracy benchmarks covering factual queries, code generation, multi-turn reasoning, and tool calls — zero quality degradation detected. Three safety guards prevent over-compression: system messages are never touched, short messages under 200 characters are skipped, and any result removing more than 70% of characters is rejected and the original used instead.

Does it work with streaming? Yes. Streaming responses (stream: true) pass through unchanged. Only the input prompt is compressed.

Does it work offline? Yes. The proxy and all six heuristic strategies run entirely on your machine. The optional LLMLingua service (Pro) also runs locally via docker compose up in the tok-optimizer-python/ directory.

Why does it save more on longer sessions? The most powerful strategies — History Summarize and Context Prune — only activate when sessions accumulate enough history. A 3-turn conversation gets whitespace and deduplication only (~10–15%). A 20-turn agentic session triggers all six strategies plus LLMLingua, routinely hitting 55–65%.

What if I use multiple AI tools? Point all of them at http://localhost:8765. The proxy handles OpenAI and Anthropic API formats simultaneously and routes each request to the correct upstream endpoint.

Does it work with Claude Code? Yes — enable the forward proxy (Command Palette → AI Token Optimizer: Toggle Forward Proxy), then launch Claude Code from the VS Code terminal. It picks up ANTHROPIC_BASE_URL automatically.

How do I verify it's working?

curl http://localhost:8765/health   # → {"status":"ok","version":"0.1.1"}
curl http://localhost:8765/stats    # → lifetime savings totals

Or just watch the ⚡ status bar — it updates live after every request.


License

Business Source License 1.1 — free to use for individuals and teams. You may not offer a competing token optimization SaaS. Converts to Apache 2.0 on 2030-04-11.

  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft