Forge
Local-first AI coding assistant for VS Code powered by llama.cpp, zero cloud, zero telemetry.
Forge runs GGUF models directly on your machine via llama-server. No API key, no subscription, no data leaves your box.
Features
- Single execute-style workflow: no mode switching — one conversation, full tool access
- Multi-tab concurrent streaming: run multiple independent chats in parallel; switching tabs never cancels a running stream
- Inline diff viewer: every file write shows a collapsible red/green diff block in the chat immediately after the tool runs
- Direct llama.cpp control: Forge spawns and manages
llama-server
- Ollama support: local Ollama models or Ollama cloud routing — auth via
ollama auth login, not Forge
- Isolated backend pools: llama.cpp and Ollama backends are tracked separately — switching to an Ollama model never stops a running llama-server
- VRAM management: closing a tab with a local model prompts to unload it from VRAM immediately
- Hot model swap: switch between GGUF or Ollama models without restarting VS Code
- Per-action confirmation gate: approve or deny each tool call before it runs
- Per-turn checkpoints: Undo or Keep after any turn that writes files
- Token budget bar: live used/max context token estimate shown in the header
- Reasoning token display: streamed thinking output shown inline when enabled
- Runtime capability checks: inspect llama.cpp metadata and warn on mismatched tool/thinking features
- Thinking-channel stripping: optionally hide
<think> and related channel markup
- Strict tool schemas: typed JSON Schema for every tool — no free-form string blobs
- Slash commands in chat: type
/ to open built-in chat actions
- FORGE.md workspace instructions: drop a
FORGE.md in any project root to give the agent persistent navigation rules, stack context, and hard stops — auto-injected into every prompt
- Optional web search: Tavily or Brave via user-supplied API key
- Bridge mode: connect to any already-running OpenAI-compatible server
Requirements
- VS Code 1.90 or later
llama-server (for direct GGUF mode)
- One or more local GGUF files, or a running Ollama daemon
Quick Start
1. Install the extension
Install Forge from the VS Code Marketplace or via the Extensions panel.
2. Build llama-server (direct GGUF mode)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
Skip this step if you are using Ollama only.
3. Create config.yaml
Forge looks for a config file at:
<your-project>/.forge/config.yaml
If the file is missing, open the sidebar and the setup wizard will generate it, or create it manually:
active_model: my-model
llama_server:
binary: /path/to/llama-server
host: 127.0.0.1
port: 8080
n_gpu_layers: -1
default_num_ctx: 8192
n_batch: 512
n_parallel: 1
type_k: q8_0
type_v: q8_0
flash_attn_default: true
models:
my-model:
gguf_path: /path/to/model.gguf
n_gpu_layers: -1
num_ctx: 8192
flash_attn: true
think: false
strip_thinking_channels: true
sampling:
temperature: 0.6
top_p: 0.95
top_k: 64
max_tokens: 8192
Click the Forge icon in the Activity Bar. The backend starts on the first prompt.
Ollama
Add Ollama models alongside GGUF models in the same config. Auth is handled by ollama auth login on your machine — Forge sends no credentials.
models:
gemma4:26b:
provider: ollama
endpoint: http://127.0.0.1:11434
num_ctx: 262144
think: true
reasoning_effort: medium # low | medium | high
deepseek-v4-flash:cloud:
provider: ollama
endpoint: http://127.0.0.1:11434
num_ctx: 1000000
think: true
reasoning_effort: medium
Forge merges GGUF and Ollama entries in a single model picker. Ollama backends are tracked in an isolated pool — selecting an Ollama model never stops a running llama-server, and vice versa.
VRAM Management
Forge keeps models loaded between prompts so the first message in a follow-up turn is instant. To free VRAM explicitly:
- Close a tab — if the model is local (llama.cpp, or Ollama on a local endpoint) and no other open tab uses it, Forge shows a notification: "[model] is still loaded in VRAM. Unload it to free memory?" with an Unload Now button.
/unloadModel — stops all backends immediately and releases everything.
/restartBackend — stops then restarts the active llama-server.
Note: closing a tab while another tab uses the same model will not trigger the unload prompt — the model is still in use.
Multi-Tab Concurrent Streaming
Each conversation tab runs its own independent agent loop with a dedicated abort controller. You can:
- Send a prompt in tab A and switch to tab B while it's still generating — tab A keeps streaming in the background
- Cancel a specific tab without affecting others
- See a live indicator on tabs that are currently generating
Switching between tabs never cancels a running stream. Only closing a tab or pressing Cancel within it stops that conversation.
max_simultaneous_models (default 1) controls how many llama-server processes stay alive at once. If you open more conversations than the pool limit, the least-recently-used llama.cpp server is evicted when a new model needs to start. Ollama models are not subject to this limit.
Inline Diff Viewer
After every file write (write_file, replace_in_file, delete_file), Forge renders a collapsible diff block directly in the chat:
- Badge:
new / modified / deleted
- Path: file path relative to the workspace root
- Hunks: unified diff with 3 lines of context, green
+ for additions, red − for removals
- Files over 500 lines fall back to a "file too large to diff inline" notice
The diff uses the per-turn checkpoint snapshot as the before-state, so it always reflects exactly what the agent changed.
Model Behavior
Forge inspects llama.cpp runtime metadata (via /props) before sending requests:
- whether the active model exposes a usable chat template
- whether the template likely supports tool calling
- whether the template supports thinking toggles (
enable_thinking, preserve_thinking)
When Forge detects a mismatch it warns in the UI and narrows the request rather than sending incompatible fields. These checks are advisory — GGUF metadata and community templates can still be incomplete.
Checkpoints (Undo / Keep)
After every turn that writes files, Forge shows an Undo / Keep bar in the editor via CodeLens.
- Undo restores all files modified in that turn to their state before the agent ran
- Keep commits the checkpoint and clears the bar
You can also use /undo and /keep from the chat input, or the command palette (Forge: Undo Last Turn, Forge: Keep Changes).
Thinking Output
models:
my-model:
gguf_path: /path/to/model.gguf
think: true
strip_thinking_channels: false # show reasoning inline
When think: true, reasoning tokens stream into a collapsible block in the UI. When strip_thinking_channels: true and think: false, Forge strips <think>...</think> markers from visible output.
For Ollama models, use reasoning_effort: low | medium | high to control the reasoning budget.
Bridge Mode
If you already run your own llama-server or any OpenAI-compatible server:
bridge_mode: true
llama_server:
host: 127.0.0.1
port: 8080
In bridge mode Forge connects to the existing process but does not own it — releasing a model from memory is the bridge's responsibility.
Web Search
Forge supports Tavily and Brave Search as search providers. The API key is stored securely in VS Code's SecretStorage (OS keychain) — never in config files or git.
Setup (two steps)
1. Run the command
Ctrl+Shift+P → Forge: Set Search API Key
- If no search provider is configured yet, Forge will ask you to pick Tavily or Brave, then automatically add the
search: block to your config.yaml.
- If a provider is already configured, it will just prompt for the key.
2. Reload the window
Ctrl+Shift+P → Developer: Reload Window — required only if the search: block was just added to config.yaml for the first time.
That's it. The web_search tool becomes available to the agent on the next prompt.
Manual config (optional)
If you prefer to configure the search block yourself before running the command:
search:
provider: tavily # or brave
secret_key_name: forge.tavily.apiKey
max_results: 5
FORGE.md — Workspace Instructions
Drop a FORGE.md file in any project's root folder to give the Forge agent persistent context about that workspace: where things live, what the stack is, and what operations require confirmation.
Forge injects it into the system prompt on every turn — the agent reads it automatically without you needing to re-explain the project.
Generate it automatically
Type /initForge in the chat input. Forge will scan your workspace (directory layout, package.json, config files) and ask the active model to generate a FORGE.md tailored to that project. The file appears in your editor immediately after.
You can also copy FORGE.md.example from the extension directory as a starting template and fill it in manually.
What to put in it
## Stack
TypeScript + Node.js. esbuild for bundling. Vitest for tests.
## Workspace Layout
src/ — all source code
src/api/ — Express route handlers
src/db/ — database models and migrations
config/ — environment config files
## Key Files
- src/index.ts — entry point
- src/config.ts — app-wide config
- prisma/schema.prisma — database schema
## Navigation Rules
- All API routes live in src/api/ — never add routes elsewhere
- Grep before creating — check for existing helpers first
## Hard Stops
- Never run database migrations without explicit user confirmation
- Never delete files in /data — these are production assets
The file is watched for changes — edits take effect on the next message with no restart required.
Slash Commands
Type / in the chat input to open the command list.
| Command |
Description |
/initForge |
Scan workspace and generate a FORGE.md agent instructions file |
/newChat |
Start a new conversation tab |
/clearChat |
Clear the current conversation |
/undo |
Restore files from the last write turn |
/keep |
Commit the current checkpoint |
/compact |
Summarize and compress conversation history |
/review |
Run a code review on the current file or selection |
/restartBackend |
Restart the managed llama-server process |
/unloadModel |
Stop all backends and release models from memory |
/reloadWindow |
Reload the VS Code window |
Commands
| Command |
Description |
Forge: Open Sidebar |
Open the Forge panel |
Forge: Restart Backend |
Restart llama-server |
Forge: New Chat |
Open a new conversation tab |
Forge: Undo Last Turn |
Restore files from the last write turn |
Forge: Keep Changes |
Commit the current checkpoint |
Forge: Send Selection to Chat |
Prefill the prompt with the active editor selection |
Forge: Set Search API Key |
Store a Tavily or Brave API key |
Settings
| Setting |
Default |
Description |
forge.logLevel |
info |
Log verbosity (debug, info, warn, error) |
forge.sidebar.retainContextWhenHidden |
true |
Keep webview state when the panel is hidden |
Privacy
Forge makes no outbound network calls except to:
llama-server on your configured host
- The local Ollama daemon (
localhost:11434) when Ollama models are selected
- Your configured search provider when search is enabled and you send a query
There is no telemetry, no analytics, and no auto-update pinging.
License
Apache 2.0. See LICENSE.