Skip to content
| Marketplace
Sign in
Visual Studio Code>AI>Forge LLMNew to Visual Studio Code? Get it now.
Forge LLM

Forge LLM

Efso.o

|
4 installs
| (0) | Free
Local-LLM coding assistant — direct llama.cpp, strict tool schemas, single execute-style workflow
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

Forge

Local-first AI coding assistant for VS Code powered by llama.cpp, zero cloud, zero telemetry.

Forge runs GGUF models directly on your machine via llama-server. No API key, no subscription, no data leaves your box.


Features

  • Single execute-style workflow: no mode switching — one conversation, full tool access
  • Multi-tab concurrent streaming: run multiple independent chats in parallel; switching tabs never cancels a running stream
  • Inline diff viewer: every file write shows a collapsible red/green diff block in the chat immediately after the tool runs
  • Direct llama.cpp control: Forge spawns and manages llama-server
  • Ollama support: local Ollama models or Ollama cloud routing — auth via ollama auth login, not Forge
  • Isolated backend pools: llama.cpp and Ollama backends are tracked separately — switching to an Ollama model never stops a running llama-server
  • VRAM management: closing a tab with a local model prompts to unload it from VRAM immediately
  • Hot model swap: switch between GGUF or Ollama models without restarting VS Code
  • Per-action confirmation gate: approve or deny each tool call before it runs
  • Per-turn checkpoints: Undo or Keep after any turn that writes files
  • Token budget bar: live used/max context token estimate shown in the header
  • Reasoning token display: streamed thinking output shown inline when enabled
  • Runtime capability checks: inspect llama.cpp metadata and warn on mismatched tool/thinking features
  • Thinking-channel stripping: optionally hide <think> and related channel markup
  • Strict tool schemas: typed JSON Schema for every tool — no free-form string blobs
  • Slash commands in chat: type / to open built-in chat actions
  • FORGE.md workspace instructions: drop a FORGE.md in any project root to give the agent persistent navigation rules, stack context, and hard stops — auto-injected into every prompt
  • Optional web search: Tavily or Brave via user-supplied API key
  • Bridge mode: connect to any already-running OpenAI-compatible server

Requirements

  • VS Code 1.90 or later
  • llama-server (for direct GGUF mode)
  • One or more local GGUF files, or a running Ollama daemon

Quick Start

1. Install the extension

Install Forge from the VS Code Marketplace or via the Extensions panel.

2. Build llama-server (direct GGUF mode)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

Skip this step if you are using Ollama only.

3. Create config.yaml

Forge looks for a config file at:

<your-project>/.forge/config.yaml

If the file is missing, open the sidebar and the setup wizard will generate it, or create it manually:

active_model: my-model

llama_server:
  binary: /path/to/llama-server
  host: 127.0.0.1
  port: 8080
  n_gpu_layers: -1
  default_num_ctx: 8192
  n_batch: 512
  n_parallel: 1
  type_k: q8_0
  type_v: q8_0
  flash_attn_default: true

models:
  my-model:
    gguf_path: /path/to/model.gguf
    n_gpu_layers: -1
    num_ctx: 8192
    flash_attn: true
    think: false
    strip_thinking_channels: true
    sampling:
      temperature: 0.6
      top_p: 0.95
      top_k: 64
      max_tokens: 8192

4. Open the Forge sidebar

Click the Forge icon in the Activity Bar. The backend starts on the first prompt.


Ollama

Add Ollama models alongside GGUF models in the same config. Auth is handled by ollama auth login on your machine — Forge sends no credentials.

models:
  gemma4:26b:
    provider: ollama
    endpoint: http://127.0.0.1:11434
    num_ctx: 262144
    think: true
    reasoning_effort: medium       # low | medium | high

  deepseek-v4-flash:cloud:
    provider: ollama
    endpoint: http://127.0.0.1:11434
    num_ctx: 1000000
    think: true
    reasoning_effort: medium

Forge merges GGUF and Ollama entries in a single model picker. Ollama backends are tracked in an isolated pool — selecting an Ollama model never stops a running llama-server, and vice versa.


VRAM Management

Forge keeps models loaded between prompts so the first message in a follow-up turn is instant. To free VRAM explicitly:

  • Close a tab — if the model is local (llama.cpp, or Ollama on a local endpoint) and no other open tab uses it, Forge shows a notification: "[model] is still loaded in VRAM. Unload it to free memory?" with an Unload Now button.
  • /unloadModel — stops all backends immediately and releases everything.
  • /restartBackend — stops then restarts the active llama-server.

Note: closing a tab while another tab uses the same model will not trigger the unload prompt — the model is still in use.


Multi-Tab Concurrent Streaming

Each conversation tab runs its own independent agent loop with a dedicated abort controller. You can:

  • Send a prompt in tab A and switch to tab B while it's still generating — tab A keeps streaming in the background
  • Cancel a specific tab without affecting others
  • See a live indicator on tabs that are currently generating

Switching between tabs never cancels a running stream. Only closing a tab or pressing Cancel within it stops that conversation.

max_simultaneous_models (default 1) controls how many llama-server processes stay alive at once. If you open more conversations than the pool limit, the least-recently-used llama.cpp server is evicted when a new model needs to start. Ollama models are not subject to this limit.


Inline Diff Viewer

After every file write (write_file, replace_in_file, delete_file), Forge renders a collapsible diff block directly in the chat:

  • Badge: new / modified / deleted
  • Path: file path relative to the workspace root
  • Hunks: unified diff with 3 lines of context, green + for additions, red − for removals
  • Files over 500 lines fall back to a "file too large to diff inline" notice

The diff uses the per-turn checkpoint snapshot as the before-state, so it always reflects exactly what the agent changed.


Model Behavior

Forge inspects llama.cpp runtime metadata (via /props) before sending requests:

  • whether the active model exposes a usable chat template
  • whether the template likely supports tool calling
  • whether the template supports thinking toggles (enable_thinking, preserve_thinking)

When Forge detects a mismatch it warns in the UI and narrows the request rather than sending incompatible fields. These checks are advisory — GGUF metadata and community templates can still be incomplete.


Checkpoints (Undo / Keep)

After every turn that writes files, Forge shows an Undo / Keep bar in the editor via CodeLens.

  • Undo restores all files modified in that turn to their state before the agent ran
  • Keep commits the checkpoint and clears the bar

You can also use /undo and /keep from the chat input, or the command palette (Forge: Undo Last Turn, Forge: Keep Changes).


Thinking Output

models:
  my-model:
    gguf_path: /path/to/model.gguf
    think: true
    strip_thinking_channels: false   # show reasoning inline

When think: true, reasoning tokens stream into a collapsible block in the UI. When strip_thinking_channels: true and think: false, Forge strips <think>...</think> markers from visible output.

For Ollama models, use reasoning_effort: low | medium | high to control the reasoning budget.


Bridge Mode

If you already run your own llama-server or any OpenAI-compatible server:

bridge_mode: true

llama_server:
  host: 127.0.0.1
  port: 8080

In bridge mode Forge connects to the existing process but does not own it — releasing a model from memory is the bridge's responsibility.


Web Search

Forge supports Tavily and Brave Search as search providers. The API key is stored securely in VS Code's SecretStorage (OS keychain) — never in config files or git.

Setup (two steps)

1. Run the command

Ctrl+Shift+P → Forge: Set Search API Key

  • If no search provider is configured yet, Forge will ask you to pick Tavily or Brave, then automatically add the search: block to your config.yaml.
  • If a provider is already configured, it will just prompt for the key.

2. Reload the window

Ctrl+Shift+P → Developer: Reload Window — required only if the search: block was just added to config.yaml for the first time.

That's it. The web_search tool becomes available to the agent on the next prompt.

Manual config (optional)

If you prefer to configure the search block yourself before running the command:

search:
  provider: tavily        # or brave
  secret_key_name: forge.tavily.apiKey
  max_results: 5

FORGE.md — Workspace Instructions

Drop a FORGE.md file in any project's root folder to give the Forge agent persistent context about that workspace: where things live, what the stack is, and what operations require confirmation.

Forge injects it into the system prompt on every turn — the agent reads it automatically without you needing to re-explain the project.

Generate it automatically

Type /initForge in the chat input. Forge will scan your workspace (directory layout, package.json, config files) and ask the active model to generate a FORGE.md tailored to that project. The file appears in your editor immediately after.

You can also copy FORGE.md.example from the extension directory as a starting template and fill it in manually.

What to put in it

## Stack
TypeScript + Node.js. esbuild for bundling. Vitest for tests.

## Workspace Layout
src/        — all source code
src/api/    — Express route handlers
src/db/     — database models and migrations
config/     — environment config files

## Key Files
- src/index.ts      — entry point
- src/config.ts     — app-wide config
- prisma/schema.prisma — database schema

## Navigation Rules
- All API routes live in src/api/ — never add routes elsewhere
- Grep before creating — check for existing helpers first

## Hard Stops
- Never run database migrations without explicit user confirmation
- Never delete files in /data — these are production assets

The file is watched for changes — edits take effect on the next message with no restart required.


Slash Commands

Type / in the chat input to open the command list.

Command Description
/initForge Scan workspace and generate a FORGE.md agent instructions file
/newChat Start a new conversation tab
/clearChat Clear the current conversation
/undo Restore files from the last write turn
/keep Commit the current checkpoint
/compact Summarize and compress conversation history
/review Run a code review on the current file or selection
/restartBackend Restart the managed llama-server process
/unloadModel Stop all backends and release models from memory
/reloadWindow Reload the VS Code window

Commands

Command Description
Forge: Open Sidebar Open the Forge panel
Forge: Restart Backend Restart llama-server
Forge: New Chat Open a new conversation tab
Forge: Undo Last Turn Restore files from the last write turn
Forge: Keep Changes Commit the current checkpoint
Forge: Send Selection to Chat Prefill the prompt with the active editor selection
Forge: Set Search API Key Store a Tavily or Brave API key

Settings

Setting Default Description
forge.logLevel info Log verbosity (debug, info, warn, error)
forge.sidebar.retainContextWhenHidden true Keep webview state when the panel is hidden

Privacy

Forge makes no outbound network calls except to:

  • llama-server on your configured host
  • The local Ollama daemon (localhost:11434) when Ollama models are selected
  • Your configured search provider when search is enabled and you send a query

There is no telemetry, no analytics, and no auto-update pinging.


License

Apache 2.0. See LICENSE.

  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft