llama.vscode
Local LLM-assisted text completion extension for VS Code


Features
- Auto-suggest on input
- Accept a suggestion with
Tab
- Accept the first line of a suggestion with
Shift + Tab
- Accept the next word with
Ctrl/Cmd + Right
- Toggle the suggestion manually by pressing
Ctrl + L
- Control max text generation time
- Configure scope of context around the cursor
- Ring context with chunks from open and edited files and yanked text
- Supports very large contexts even on low-end hardware via smart context reuse
- Display performance stats
Installation
VS Code extension setup
Install the llama-vscode extension from the VS Code extension marketplace:

Note: also available at Open VSX
llama.cpp
setup
The plugin requires a llama.cpp server instance to be running at the configured endpoint:
Mac OS
brew install llama.cpp
Any other OS
Either use the latest binaries or build llama.cpp from source. For more information how to run the llama.cpp
server, please refer to the Wiki.
llama.cpp settings
Here are recommended settings, depending on the amount of VRAM that you have:
More than 16GB VRAM:
llama-server \
-hf ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF \
--port 8012 -ngl 99 -fa -ub 1024 -b 1024 \
--ctx-size 0 --cache-reuse 256
Less than 16GB VRAM:
llama-server \
-hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF \
--port 8012 -ngl 99 -fa -ub 1024 -b 1024 \
--ctx-size 0 --cache-reuse 256
Less than 8GB VRAM:
llama-server \
-hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
--port 8012 -ngl 99 -fa -ub 1024 -b 1024 \
--ctx-size 0 --cache-reuse 256
CPU-only configs
These are llama-server
settings for CPU-only hardware. Note that the quality will be significantly lower:
llama-server \
-hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
--port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256
llama-server \
-hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF \
--port 8012 -ub 1024 -b 1024 --ctx-size 0 --cache-reuse 256
You can use any other FIM-compatible model that your system can handle. By default, the models downloaded with the -hf
flag are stored in:
- Mac OS:
~/Library/Caches/llama.cpp/
- Linux:
~/.cache/llama.cpp
- Windows:
LOCALAPPDATA
Recommended LLMs
The plugin requires FIM-compatible models: HF collection
Examples
Speculative FIMs running locally on a M2 Studio:
https://github.com/user-attachments/assets/cab99b93-4712-40b4-9c8d-cf86e98d4482
Implementation details
The extension aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware.
Other IDEs