Internal Ollama

Run a private, bundled Ollama engine inside VS Code — no separate Ollama desktop install required. Use local models in GitHub Copilot Chat (BYOK), the built-in Ollama Chat panel, and inline tab completions.

All inference stays on your machine. The extension sets OLLAMA_NO_CLOUD=true and does not phone home to ollama.com.

Features

Auto-bootstrap Ollama — downloads the official portable runtime on first run (~1.4 GB, one time); dev/F5 uses local bin/ollama.exe when present
Copilot Chat (BYOK) — auto-configures github.copilot.chat.byok.ollamaEndpoint to the bundled server
Large-model support — OLLAMA_CONTEXT_LENGTH=262144 (same default as the Ollama desktop app); Ollama fits layers/KV cache to your GPU
Ollama Chat panel — local chat with workspace context (.github/copilot-instructions.md, agents, prompts)
Inline completions — optional tab-complete powered by your default model
Model management — pull models, optional aliases, unload RAM, custom model storage path

Requirements

Requirement	Notes
VS Code	`^1.120.0` (Copilot Chat + BYOK)
GitHub Copilot	For Copilot Chat integration
Windows	Primary target (amd64 or arm64 portable zip)
GPU	NVIDIA recommended for large models (e.g. Gemma 4 26B)
Disk	~1.4 GB for Ollama runtime (cached in extension global storage) + models under `%USERPROFILE%\OllamaModels` (13+ GB per large model)
Network	First launch only — downloads Ollama from GitHub releases

Quick start

Install the extension from the Marketplace and reload VS Code.
On first run, wait for the Internal Ollama notification to finish downloading and extracting the runtime (~1.4 GB).
Open the Internal Ollama output channel — confirm OLLAMA_CONTEXT_LENGTH=262144 and Copilot BYOK endpoint.
Run Ollama: Install Local Model (e.g. gemma4:26b-a4b-it-qat).
In Copilot Chat → Manage Language Models → enable your model under the Ollama provider (not “Internal Ollama”).
Start a new Copilot chat and select gemma4:26b-a4b-it-qat.

First load of a large model can take a minute while Ollama fits weights to VRAM.

Copilot integration

Default mode is internalOllama.copilotIntegration: "byok" — Copilot talks to the bundled server via /v1/chat/completions, the same path as a system Ollama install.

Provider in Copilot	Use when
Ollama	Default — pick `gemma4:26b-a4b-it-qat` etc.
Internal Ollama	Only if you set `copilotIntegration` to `"provider"` (legacy)

Optional copilot-* aliases are duplicate tags (FROM base-model). They appear under Ollama after Ollama: Create Optional Model Alias and a model-list refresh. You do not need an alias for Copilot to work.

Commands

Command	Description
`Ollama: Check Engine Status`	Port, model count, RAM usage
`Ollama: Open Chat`	Local chat panel with workspace context
`Ollama: Install Local Model`	Pull from Ollama registry
`Ollama: Create Optional Model Alias`	Duplicate tag with a shorter name
`Ollama: Set Model Storage Directory Path`	Change `OLLAMA_MODELS` location
`Ollama: Stop All Running Models (Free RAM)`	Unload models from memory
`Ollama: Refresh Copilot Model List`	Re-apply BYOK endpoint / refresh provider
`Ollama: Reinstall Runtime`	Re-download Ollama if bootstrap failed or runtime is corrupt

Settings

Setting	Default	Description
`internalOllama.port`	`11434`	Bundled Ollama HTTP port
`internalOllama.copilotIntegration`	`byok`	`byok` (Copilot native) or `provider` (legacy LM provider)
`internalOllama.contextLength`	`0`	Override `OLLAMA_CONTEXT_LENGTH`; `0` = 262144
`internalOllama.defaultModel`	`""`	Default for inline completion and Open Chat
`internalOllama.enableInlineCompletion`	`true`	Tab completions via Ollama
`internalOllama.ollamaVersion`	`0.30.7`	Ollama version to download on first run

Workspace context (Open Chat)

The chat panel reads the same .github/ layout Copilot uses:

.github/copilot-instructions.md
.github/instructions/*.instructions.md
.github/prompts/*.prompt.md (slash commands)
.github/agents/*.agent.md

Compact README, package.json, file tree, and the active editor file are appended to user messages so context survives small context windows.

Troubleshooting

“Model context is full (4096 tokens)”
Reload the window. Confirm the output channel shows OLLAMA_CONTEXT_LENGTH=262144. Use the Ollama provider in Copilot, not Internal Ollama. Start a new chat.

Copilot shows no models
Run Ollama: Refresh Copilot Model List. Check Settings → GitHub Copilot Chat → BYOK → Ollama Endpoint is http://127.0.0.1:11434.

Port 11434 in use
Quit any other Ollama instance or change internalOllama.port and refresh BYOK.

Alias “created” but not listed
Reload after the fix in 0.1.0. Aliases are optional; use the base model name under Ollama.

Build from source

npm install
npm run compile

Press F5 in VS Code to launch the Extension Development Host.

Package a .vsix (≈1.4 GB — includes bundled Ollama runtime):

npm run vsix

Publish to the Marketplace (requires a publisher account):

npx @vscode/vsce@2.32.0 publish -p <YOUR_PAT>

Note: Latest vsce may fail secret-scanning on multi-GB bin/ files (ERR_STRING_TOO_LONG). The vsix script pins vsce@2.32.0 until that is fixed upstream.

Update publisher, repository, and bugs in package.json if your Marketplace ID or GitHub URL differs.

Third-party software

This extension bundles the Ollama runtime (bin/ollama.exe). Ollama is licensed under the MIT License. See NOTICES.md.

Model weights pulled via Ollama are subject to each model’s license (e.g. Gemma terms from Google).

Internal Ollama

010101010101

Internal Ollama

Features

Requirements

Quick start

Copilot integration

Commands

Settings

Workspace context (Open Chat)

Troubleshooting

Build from source

Third-party software

License