Skip to content
| Marketplace
Sign in
Visual Studio Code>Machine Learning>LLM SidecarNew to Visual Studio Code? Get it now.
LLM Sidecar

LLM Sidecar

Jo Hemphill

|
1 install
| (0) | Free
Bind-and-return tool orchestration for Copilot BYOK: local llama.cpp sidecar synthesizes tool calls so upstream models only reason in prose; Copilot executes tools with Human-in-the-Loop approval.
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

LLM Sidecar

VS Code extension that runs a local OpenAI-compatible proxy with bind-and-return tool orchestration. Use it with Copilot/Cursor BYOK when you want strong upstream reasoning models but need tool calls to stay local — the upstream never receives structured tool calls; only prose reasoning crosses the network.

How it works

  1. Reason phase — upstream model plans in plain prose only (no tool-call JSON/XML/special blocks; tools stripped from the request; tool results folded into prose context). Upstream reasoning is buffered on-device before anything is shown to the user.
  2. Bind phase — local llama-server is the sole determiner of whether, which, and how to call tools. It emits grammar-constrained JSON action lists and per-tool arguments.
  3. Return to editor — when tools are needed, intermediate reasoning prose is hidden and only synthesized tool_calls are returned; when no tools are needed, the final answer is returned. Copilot executes tools with HITL approval.
  4. Context gathering — workspace file map, key manifests (package.json, Cargo.toml, etc.), file-type summary, ripgrep hits, open files, and diagnostics are folded into the reason prompt on-device.

Any request with tools automatically uses this path, regardless of endpoint adapter. Plain chat (no tools) uses the endpoint's configured adapter.

sequenceDiagram
    participant Copilot
    participant Sidecar
    participant Upstream as Upstream thinking model
    participant Llama as Local Llama bind
    Copilot->>Sidecar: chat request with tools
    Sidecar->>Upstream: tools stripped prose only
    Upstream-->>Sidecar: buffered prose plan
    Sidecar->>Llama: plan plus candidate tools
    Llama-->>Sidecar: action list and arguments
    Sidecar-->>Copilot: tool_calls or final answer
    Copilot->>Copilot: HITL approve and execute tool
    Copilot->>Sidecar: next request with tool result

Quick start

Contributors

pnpm install
pnpm run setup:dev      # proxy + llama-server + default model + compile
pnpm run verify:assets

Press F5 to launch the Extension Development Host.

End users

Install from one of:

  • Visual Studio Marketplace (VS Code)
  • Open VSX (VSCodium and compatible editors)
  • GitHub Releases VSIX or offline bundle (air-gapped)

After install, download runtime assets (not bundled in the VSIX):

  1. Run LLM Sidecar: Download Llama Server (auto-detects CPU/CUDA/Vulkan/Metal).
  2. Run LLM Sidecar: Download Bind Model — choose Llama 3.2 3B (default) or Phi-4 mini (US-compliant catalog only).
  3. Run LLM Sidecar: Add First Endpoint → choose Corporate LLM (bind-and-return).
  4. Set API key, Sync Language Models, reload window, pick LLM Sidecar in chat.

For air-gapped installs, set llmSidecar.orchestrator.modelPath and llmSidecar.orchestrator.llamaServerBinaryPath, or use the offline bundle from GitHub Releases.

Build from source

pnpm install
pnpm run build          # sidecar-proxy + extension (no model download)
pnpm test

Full contributor bootstrap: pnpm run setup:dev (see docs/CONTRIBUTING.md).

Settings (highlights)

Setting Purpose
llmSidecar.orchestrator Local bind model (llama.cpp) settings
llmSidecar.endpoints Upstream reasoning endpoints; tool requests always use bind-and-return
llmSidecar.orchestrator.selectedModelId Local bind model (llama-3.2-3b-instruct-ud-q4 or phi-4-mini-instruct-q4)
llmSidecar.orchestrator.llamaServerVariant auto, cpu, cuda12, cuda13, vulkan, or metal
llmSidecar.orchestrator.modelPath Explicit GGUF path (air-gapped)
llmSidecar.orchestrator.modelMirrorUrl Corporate mirror for model download
llmSidecar.enforceHumanInTheLoop Disable YOLO / force per-tool approval
llmSidecar.orchestrator.localOnly Block all upstream egress
llmSidecar.orchestrator.ctxSize llama-server context window for bind requests (raise on machines with more RAM; model supports up to ~128K)
llmSidecar.orchestrator.maxCandidateTools Max tools considered per bind turn (default 12; smaller sets improve 3B accuracy)
llmSidecar.orchestrator.maxToolCallsPerTurn Max parallel tool calls per turn (default 3; each extra call adds one bind round-trip)

Note: upstream reasoning is always buffered before bind. When tools are needed, intermediate prose is hidden from the user; final answers appear after upstream + bind complete (not token-streamed live from the upstream model).

Enterprise HITL policies

Lock tool auto-approval fleet-wide with VS Code enterprise policies:

  • ChatToolsAutoApprove → chat.tools.global.autoApprove
  • ChatToolsEligibleForAutoApproval → chat.tools.eligibleForAutoApproval
  • ChatToolsTerminalEnableAutoApprove → chat.tools.terminal.enableAutoApprove

See SECURITY.md for audit logging and DLP behavior.

Packaging note

GGUF models and platform llama-server binaries are large. The Marketplace VSIX ships sidecar-proxy and the runtime manifest; full binaries ship via GitHub Releases or on-demand download commands. US-compliant bind-model catalog: Meta Llama 3.2 3B and Microsoft Phi-4 mini only.

See docs/PUBLISHING.md.

  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft