⚡ Ayu-Gravity — Token Compressor
Automatically compresses your prompts before sending — cutting token costs by 30–50% without losing meaning.

🧠 The Problem
Every token you send to an AI model costs money. Long, verbose prompts with filler words, redundancy, and conversational fluff can double your API bill without adding any value to the response.
Most developers don't realize that up to 40% of their prompt tokens are wasted on words the model doesn't need to understand your intent.
Ayu-Gravity solves this. It sits between you and the AI — compressing your prompts on-the-fly using a fast, cheap model before forwarding them to your primary expensive model. You type naturally. The AI receives a lean, optimized prompt. You save tokens. You save money.
✨ Features
| Feature |
Description |
| 🗜️ Auto Compression |
Prompts are compressed before sending — completely transparent to you |
| 🤖 5 AI Providers |
Anthropic, OpenAI, Gemini, Ollama, and GitHub Copilot built-in |
| 🎛️ Model Selector |
Choose your exact model per provider — from the cheapest to the most powerful |
| ⚡ Streaming Responses |
Real-time token-by-token streaming with Markdown rendering |
| 🔒 Encrypted API Keys |
All keys stored in VS Code's native SecretStorage — never in plaintext |
| 💾 Export Chat |
Save any conversation as a Markdown file |
| 🔄 Auto-Retry |
Exponential backoff on 429 rate limits — no crashes |
| 🎨 Native VS Code UI |
Matches your theme perfectly — dark mode, light mode, everything |
| 📜 Chat History |
Conversations persist across panel open/close cycles |
| 🆓 Free Options |
Use GitHub Copilot (zero config) or Ollama (local, free) |
🔌 Supported Providers
| Provider |
Setup Required |
Cost |
| GitHub Copilot |
None — auto-detected if installed |
Included with Copilot subscription |
| Ollama |
Run Ollama locally on port 11434 |
Free (local inference) |
| Anthropic |
Set API key via command palette |
Pay per token |
| OpenAI |
Set API key via command palette |
Pay per token |
| Google Gemini |
Set API key via command palette |
Pay per token / Free tier available |
🎯 Supported Models
Anthropic
| Model |
Best For |
claude-opus-4-20250514 |
Complex reasoning, long-form analysis |
claude-sonnet-4-20250514 |
Balanced speed and quality (default) |
claude-3-5-haiku-20241022 |
Fast responses, cost-efficient |
OpenAI
| Model |
Best For |
gpt-4o |
Most capable, multimodal (default) |
gpt-4o-mini |
Fast and affordable |
o1-mini |
Reasoning-focused tasks |
Google Gemini
| Model |
Best For |
gemini-2.0-flash |
Ultra-fast, latest generation (default) |
gemini-1.5-pro |
Long context, complex tasks |
gemini-1.5-flash |
Speed-optimized |
Ollama
Models are fetched dynamically from your local Ollama installation. Whatever you have pulled locally — Llama, Mistral, Phi, Qwen — it appears automatically in the dropdown.
GitHub Copilot
Model selection is automatic via the VS Code Language Model API. No configuration needed.
🔧 How It Works
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ You type a │────▶│ Ayu-Gravity │────▶│ AI Model │
│ prompt │ │ compresses it │ │ responds │
│ (100 tokens)│ │ (65 tokens) │ │ (streamed) │
└──────────────┘ └──────────────────┘ └──────────────┘
- You type your prompt naturally in the sidebar chat
- Token estimation — if your prompt is ≥ 20 estimated tokens, compression kicks in
- Compression — a fast, cheap model (Haiku / gpt-4o-mini / Flash) strips filler words, redundancy, and verbose phrasing in under 10 seconds
- Original + compressed — you see both in the UI with a toggle
- Forwarding — the compressed prompt gets sent to your selected model
- Streaming — the response streams back token-by-token with live Markdown rendering
- Savings — you save 30–50% of input tokens on most prompts
📦 Installation
From VS Code Marketplace
- Open VS Code
- Go to Extensions (
Ctrl+Shift+X)
- Search for "Ayu-Gravity"
- Click Install
From VSIX File
code --install-extension ayu-gravity-2.1.0.vsix
🔑 Setting Up API Keys
- Open the Command Palette (
Ctrl+Shift+P)
- Run "Antigravity: Set API Key"
- Select your provider (Anthropic / OpenAI / Gemini)
- Paste your API key (input is hidden)
- Done! The provider now appears in your dropdown
Note: API keys are stored in VS Code's native SecretStorage — encrypted at rest and never written to disk in plaintext.
No API Key Needed For:
- GitHub Copilot — works automatically if you have a Copilot subscription
- Ollama — runs locally, just have the Ollama server running on
localhost:11434
🗜️ Compression Modes
| Mode |
Reduction Target |
Best For |
| ⚡ Balanced |
~35% |
Everyday use — good savings, full meaning preserved |
| 🔥 Aggressive |
~50% |
Maximum savings — strips everything non-essential |
| 💡 Light |
~20% |
Minimal compression — when precision matters |
Tip: Compression is only applied to prompts with 20+ estimated tokens. Short prompts are sent as-is.
📸 Screenshots
Coming soon — screenshots of the sidebar panel, model selector, and compression in action.
❓ FAQ
Q: Does compression affect response quality?
A: No. Compression removes filler words and redundancy — not meaning. LLMs actually perform better with concise, clear prompts.
Q: What model is used for compression?
A: The cheapest, fastest model for each provider: claude-3-5-haiku for Anthropic, gpt-4o-mini for OpenAI, gemini-1.5-flash for Gemini. Compression cost is negligible.
Q: Can I see what was compressed?
A: Yes! Every compressed message has a "View compressed prompt" toggle in the UI that shows you exactly what was sent.
Q: Does it work offline?
A: Only with Ollama (local models). All other providers require internet access.
Q: Are my API keys safe?
A: Yes. All keys are stored in VS Code's native encrypted SecretStorage. They are never logged, transmitted, or written to plaintext files.
Q: What happens if compression fails?
A: The original prompt is sent as-is. Compression has a 10-second timeout and graceful fallback — it never blocks your message.
Q: Can I choose which model to chat with?
A: Yes! The model selector dropdown lets you pick any available model for your selected provider.
📄 License
MIT License — free for personal and commercial use.