A VS Code extension that connects your editor to self‑hosted or local LLMs via any OpenAI‑compatible server (vLLM, Ollama, TGI, llama.cpp, LocalAI, etc.). Keep source code on your infrastructure while using AI for coding, refactoring, analysis, and more.
✨ Highlights
Works with any OpenAI Chat Completions–compatible endpoint
Function calling tools with optional parallel execution
Safe token budgeting based on model context window
Built‑in retries with exponential backoff and detailed logging
Model list caching for fewer network calls
API keys securely stored in VS Code SecretStorage
Status bar health monitor with quick actions
Server presets for quick switching between endpoints
--enable-prefix-caching: enable prefix/KV cache for repeated prompts
--async-scheduling: schedule requests asynchronously for better throughput
Ollama example
ollama run qwen3:8b
Configure the extension
Open VS Code Settings and search for “Local Model Provider”.
Required: set local.model.provider.serverUrl (e.g. http://localhost:8000)
Optional: run “Local Model Provider: Set API Key (Secure)” to store a key in SecretStorage
Use your models
Open the model manager and enable models from the “Local Model Provider”.
🖼️ Screenshots
Model configuration
Model selection
Test execution
Feature menu
Server preset
⚙️ Configuration
All settings are under the local.model.provider.* namespace.
Server Configuration
serverUrl (string): base URL, e.g. http://localhost:8000
serverPresets (array): saved server configurations for quick switching
defaultModel (string): default model ID to use (leave empty for auto-select)
requestTimeout (number, ms): default 60000
Token & Context Settings
defaultMaxTokens (number): estimated context window (default 32768). If your model/server supports larger context, consider increasing this for better continuity (e.g., 65k–128k).
defaultMaxOutputTokens (number): max generation tokens (default 4096). Increase when you need longer answers; ensure input + output stays within the model's context window.
Function Calling
enableToolCalling (boolean): enable function calling (default true)