LLM Sidecar
VS Code extension that runs a local OpenAI-compatible proxy with bind-and-return tool orchestration. Use it with Copilot/Cursor BYOK when you want strong upstream reasoning models but need tool calls to stay local — the upstream never receives structured tool calls; only prose reasoning crosses the network.
How it works
- Reason phase — upstream model plans in plain prose only (no tool-call JSON/XML/special blocks; tools stripped from the request; tool results folded into prose context). Upstream reasoning is buffered on-device before anything is shown to the user.
- Bind phase — local llama-server is the sole determiner of whether, which, and how to call tools. It emits grammar-constrained JSON action lists and per-tool arguments.
- Return to editor — when tools are needed, intermediate reasoning prose is hidden and only synthesized
tool_calls are returned; when no tools are needed, the final answer is returned. Copilot executes tools with HITL approval.
- Context gathering — workspace file map, key manifests (
package.json, Cargo.toml, etc.), file-type summary, ripgrep hits, open files, and diagnostics are folded into the reason prompt on-device.
Any request with tools automatically uses this path, regardless of endpoint adapter. Plain chat (no tools) uses the endpoint's configured adapter.
sequenceDiagram
participant Copilot
participant Sidecar
participant Upstream as Upstream thinking model
participant Llama as Local Llama bind
Copilot->>Sidecar: chat request with tools
Sidecar->>Upstream: tools stripped prose only
Upstream-->>Sidecar: buffered prose plan
Sidecar->>Llama: plan plus candidate tools
Llama-->>Sidecar: action list and arguments
Sidecar-->>Copilot: tool_calls or final answer
Copilot->>Copilot: HITL approve and execute tool
Copilot->>Sidecar: next request with tool result
Quick start
Contributors
pnpm install
pnpm run setup:dev # proxy + llama-server + default model + compile
pnpm run verify:assets
Press F5 to launch the Extension Development Host.
End users
Install from one of:
After install, download runtime assets (not bundled in the VSIX):
- Run LLM Sidecar: Download Llama Server (auto-detects CPU/CUDA/Vulkan/Metal).
- Run LLM Sidecar: Download Bind Model — choose Llama 3.2 3B (default) or Phi-4 mini (US-compliant catalog only).
- Run LLM Sidecar: Add First Endpoint → choose Corporate LLM (bind-and-return).
- Set API key, Sync Language Models, reload window, pick LLM Sidecar in chat.
For air-gapped installs, set llmSidecar.orchestrator.modelPath and llmSidecar.orchestrator.llamaServerBinaryPath, or use the offline bundle from GitHub Releases.
Build from source
pnpm install
pnpm run build # sidecar-proxy + extension (no model download)
pnpm test
Full contributor bootstrap: pnpm run setup:dev (see docs/CONTRIBUTING.md).
Settings (highlights)
| Setting |
Purpose |
llmSidecar.orchestrator |
Local bind model (llama.cpp) settings |
llmSidecar.endpoints |
Upstream reasoning endpoints; tool requests always use bind-and-return |
llmSidecar.orchestrator.selectedModelId |
Local bind model (llama-3.2-3b-instruct-ud-q4 or phi-4-mini-instruct-q4) |
llmSidecar.orchestrator.llamaServerVariant |
auto, cpu, cuda12, cuda13, vulkan, or metal |
llmSidecar.orchestrator.modelPath |
Explicit GGUF path (air-gapped) |
llmSidecar.orchestrator.modelMirrorUrl |
Corporate mirror for model download |
llmSidecar.enforceHumanInTheLoop |
Disable YOLO / force per-tool approval |
llmSidecar.orchestrator.localOnly |
Block all upstream egress |
llmSidecar.orchestrator.ctxSize |
llama-server context window for bind requests (raise on machines with more RAM; model supports up to ~128K) |
llmSidecar.orchestrator.maxCandidateTools |
Max tools considered per bind turn (default 12; smaller sets improve 3B accuracy) |
llmSidecar.orchestrator.maxToolCallsPerTurn |
Max parallel tool calls per turn (default 3; each extra call adds one bind round-trip) |
Note: upstream reasoning is always buffered before bind. When tools are needed, intermediate prose is hidden from the user; final answers appear after upstream + bind complete (not token-streamed live from the upstream model).
Enterprise HITL policies
Lock tool auto-approval fleet-wide with VS Code enterprise policies:
ChatToolsAutoApprove → chat.tools.global.autoApprove
ChatToolsEligibleForAutoApproval → chat.tools.eligibleForAutoApproval
ChatToolsTerminalEnableAutoApprove → chat.tools.terminal.enableAutoApprove
See SECURITY.md for audit logging and DLP behavior.
Packaging note
GGUF models and platform llama-server binaries are large. The Marketplace VSIX ships sidecar-proxy and the runtime manifest; full binaries ship via GitHub Releases or on-demand download commands. US-compliant bind-model catalog: Meta Llama 3.2 3B and Microsoft Phi-4 mini only.
See docs/PUBLISHING.md.