Internal Ollama
Run a private, bundled Ollama engine inside VS Code — no separate Ollama desktop install required. Use local models in GitHub Copilot Chat (BYOK), the built-in Ollama Chat panel, and inline tab completions.
All inference stays on your machine. The extension sets OLLAMA_NO_CLOUD=true and does not phone home to ollama.com.
Features
- Auto-bootstrap Ollama — downloads the official portable runtime on first run (~1.4 GB, one time); dev/F5 uses local
bin/ollama.exe when present
- Copilot Chat (BYOK) — auto-configures
github.copilot.chat.byok.ollamaEndpoint to the bundled server
- Large-model support —
OLLAMA_CONTEXT_LENGTH=262144 (same default as the Ollama desktop app); Ollama fits layers/KV cache to your GPU
- Ollama Chat panel — local chat with workspace context (
.github/copilot-instructions.md, agents, prompts)
- Inline completions — optional tab-complete powered by your default model
- Model management — pull models, optional aliases, unload RAM, custom model storage path
Requirements
| Requirement |
Notes |
| VS Code |
^1.120.0 (Copilot Chat + BYOK) |
| GitHub Copilot |
For Copilot Chat integration |
| Windows |
Primary target (amd64 or arm64 portable zip) |
| GPU |
NVIDIA recommended for large models (e.g. Gemma 4 26B) |
| Disk |
~1.4 GB for Ollama runtime (cached in extension global storage) + models under %USERPROFILE%\OllamaModels (13+ GB per large model) |
| Network |
First launch only — downloads Ollama from GitHub releases |
Quick start
- Install the extension from the Marketplace and reload VS Code.
- On first run, wait for the Internal Ollama notification to finish downloading and extracting the runtime (~1.4 GB).
- Open the Internal Ollama output channel — confirm
OLLAMA_CONTEXT_LENGTH=262144 and Copilot BYOK endpoint.
- Run Ollama: Install Local Model (e.g.
gemma4:26b-a4b-it-qat).
- In Copilot Chat → Manage Language Models → enable your model under the Ollama provider (not “Internal Ollama”).
- Start a new Copilot chat and select
gemma4:26b-a4b-it-qat.
First load of a large model can take a minute while Ollama fits weights to VRAM.
Copilot integration
Default mode is internalOllama.copilotIntegration: "byok" — Copilot talks to the bundled server via /v1/chat/completions, the same path as a system Ollama install.
| Provider in Copilot |
Use when |
| Ollama |
Default — pick gemma4:26b-a4b-it-qat etc. |
| Internal Ollama |
Only if you set copilotIntegration to "provider" (legacy) |
Optional copilot-* aliases are duplicate tags (FROM base-model). They appear under Ollama after Ollama: Create Optional Model Alias and a model-list refresh. You do not need an alias for Copilot to work.
Commands
| Command |
Description |
Ollama: Check Engine Status |
Port, model count, RAM usage |
Ollama: Open Chat |
Local chat panel with workspace context |
Ollama: Install Local Model |
Pull from Ollama registry |
Ollama: Create Optional Model Alias |
Duplicate tag with a shorter name |
Ollama: Set Model Storage Directory Path |
Change OLLAMA_MODELS location |
Ollama: Stop All Running Models (Free RAM) |
Unload models from memory |
Ollama: Refresh Copilot Model List |
Re-apply BYOK endpoint / refresh provider |
Ollama: Reinstall Runtime |
Re-download Ollama if bootstrap failed or runtime is corrupt |
Settings
| Setting |
Default |
Description |
internalOllama.port |
11434 |
Bundled Ollama HTTP port |
internalOllama.copilotIntegration |
byok |
byok (Copilot native) or provider (legacy LM provider) |
internalOllama.contextLength |
0 |
Override OLLAMA_CONTEXT_LENGTH; 0 = 262144 |
internalOllama.defaultModel |
"" |
Default for inline completion and Open Chat |
internalOllama.enableInlineCompletion |
true |
Tab completions via Ollama |
internalOllama.ollamaVersion |
0.30.7 |
Ollama version to download on first run |
Workspace context (Open Chat)
The chat panel reads the same .github/ layout Copilot uses:
.github/copilot-instructions.md
.github/instructions/*.instructions.md
.github/prompts/*.prompt.md (slash commands)
.github/agents/*.agent.md
Compact README, package.json, file tree, and the active editor file are appended to user messages so context survives small context windows.
Troubleshooting
“Model context is full (4096 tokens)”
Reload the window. Confirm the output channel shows OLLAMA_CONTEXT_LENGTH=262144. Use the Ollama provider in Copilot, not Internal Ollama. Start a new chat.
Copilot shows no models
Run Ollama: Refresh Copilot Model List. Check Settings → GitHub Copilot Chat → BYOK → Ollama Endpoint is http://127.0.0.1:11434.
Port 11434 in use
Quit any other Ollama instance or change internalOllama.port and refresh BYOK.
Alias “created” but not listed
Reload after the fix in 0.1.0. Aliases are optional; use the base model name under Ollama.
Build from source
npm install
npm run compile
Press F5 in VS Code to launch the Extension Development Host.
Package a .vsix (≈1.4 GB — includes bundled Ollama runtime):
npm run vsix
Publish to the Marketplace (requires a publisher account):
npx @vscode/vsce@2.32.0 publish -p <YOUR_PAT>
Note: Latest vsce may fail secret-scanning on multi-GB bin/ files (ERR_STRING_TOO_LONG). The vsix script pins vsce@2.32.0 until that is fixed upstream.
Update publisher, repository, and bugs in package.json if your Marketplace ID or GitHub URL differs.
Third-party software
This extension bundles the Ollama runtime (bin/ollama.exe). Ollama is licensed under the MIT License. See NOTICES.md.
Model weights pulled via Ollama are subject to each model’s license (e.g. Gemma terms from Google).
License
Extension source code: MIT — Copyright (c) 2026 Internal Ollama contributors.