AI Token Optimizer
No config changes to your AI tools. No prompts modified visibly. Just lower bills. The problem this solvesAI coding tools are expensive at scale. A typical developer sending 100 requests a day to GPT-4o spends $8–15/month just on input tokens — most of it wasted on repeated context, verbose history, and filler the model doesn't need. tok-optimizer sits between your tool and the API. It strips the waste, keeps the signal, and forwards a leaner prompt. The model never knows. Your bill does.
Across 100 requests a day that's $87/month back in your pocket at the conservative estimate. On longer agentic sessions the number is higher. Setup in 60 seconds1. Install the extension. A local proxy starts automatically on 2. Point your AI tool at the proxy:
3. Code normally. The optimizer runs silently on every request. Watch the status bar update in real time: How much will you actually save?Savings depend on how you work. Here's what to expect across common workflows, based on our 59-test accuracy benchmark suite:
The more context your session accumulates — conversation history, tool results, repeated code blocks — the more the optimizer saves. Short queries get modest savings. Long agentic sessions routinely hit 55–65%. Accuracy guarantee: 59 tests across factual queries, code generation, multi-turn reasoning, and tool call round-trips — zero quality degradation detected at any compression level. Real savings by modelAt 100 requests/day using the conservative 40% estimate:
For a 10-person team on Claude Sonnet: $72–$108/month saved. The optimizer pays for itself many times over on Pro. The six optimization strategiesStrategies run in cascade, cheapest first. Each one only runs if the previous ones haven't hit the token budget yet — so you always get the minimum compression needed, never more than necessary. 1. Whitespace Normalize — 3–8% savingsLossless. Always runs first. Collapses redundant whitespace, repeated blank lines, trailing spaces, and excessive punctuation. No meaning is lost — just noise removed. On a typical 2,000-token prompt this recovers 60–160 tokens at zero risk.
2. Deduplicate — 5–20% savingsRemoves repeated sentences across the conversation. In multi-turn sessions, the same context often appears verbatim in multiple messages — requirements restated, errors repeated, instructions echoed back. This strategy fingerprints every sentence and removes duplicates after the first occurrence. System messages are never touched.
3. Intent Distill — 10–30% savingsStrips filler from user queries. Developers often write conversationally: "I was wondering if you could please help me understand..." The model doesn't need the preamble — it needs the intent. This strategy removes opener hedges, politeness wrappers, sign-offs, and meta-commentary while preserving the actual request.
4. Reference Substitute — 10–25% savingsAliases repeated long strings. Long file paths, URLs, repeated code snippets, and verbose type names that appear multiple times across a session are replaced with short aliases (
5. History Summarize — 30–50% savingsCompresses old conversation turns into a bullet-point summary. After ~10 turns, early messages are low-signal relative to the current context. This strategy replaces old turns with a compact summary — extracting decisions made, requirements stated, and tech choices confirmed — while keeping the last 6 messages verbatim. No LLM call required: extraction runs locally using pattern matching.
6. Context Prune — 20–40% savingsDrops low-relevance messages when still over budget. When the conversation is still over the token budget after all other strategies, every message is scored against the current query using keyword overlap and recency weighting. Low-scoring messages are dropped. System messages and the last 4 messages are always kept regardless of score.
Bonus: LLMLingua (Pro) — additional 20–40% on top of heuristicsPro tier unlocks Microsoft's LLMLingua-2 semantic compression model. Unlike the heuristic strategies above, LLMLingua understands meaning — it compresses at the word and phrase level, keeping semantically important tokens and dropping redundant ones. This layer runs after all six heuristics.
Safety guards — what the optimizer will never do
DashboardClick the
Web dashboard: log in at tok-optimizer-enterprise.vercel.app with your licence key to see full analytics, 30-day charts, and strategy breakdown. Commands
Settings
Tiers
No account required on the free tier. A licence key is created automatically when you install. Upgrade at tok-optimizer-enterprise.vercel.app. FAQDoes it store my prompts? No. The proxy optimizes in-flight and immediately discards the messages. Nothing is logged, cached, or sent anywhere except directly to the upstream AI API. Your API key is forwarded transparently and never stored. Will it change the quality of AI responses? No. Tested across 59 accuracy benchmarks covering factual queries, code generation, multi-turn reasoning, and tool calls — zero quality degradation detected. Three safety guards prevent over-compression: system messages are never touched, short messages under 200 characters are skipped, and any result removing more than 70% of characters is rejected and the original used instead. Does it work with streaming?
Yes. Streaming responses ( Does it work offline?
Yes. The proxy and all six heuristic strategies run entirely on your machine. The optional LLMLingua service (Pro) also runs locally via Why does it save more on longer sessions? The most powerful strategies — History Summarize and Context Prune — only activate when sessions accumulate enough history. A 3-turn conversation gets whitespace and deduplication only (~10–15%). A 20-turn agentic session triggers all six strategies plus LLMLingua, routinely hitting 55–65%. What if I use multiple AI tools?
Point all of them at Does it work with Claude Code?
Yes — enable the forward proxy (Command Palette → How do I verify it's working?
Or just watch the LicenseBusiness Source License 1.1 — free to use for individuals and teams. You may not offer a competing token optimization SaaS. Converts to Apache 2.0 on 2030-04-11. |