Skip to content
| Marketplace
Sign in
Visual Studio Code>Other>llama-vscodeNew to Visual Studio Code? Get it now.
llama-vscode

llama-vscode

ggml.ai

|
4,497 installs
| (5) | Free
Local LLM-assisted text completion using llama.cpp
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

llama.vscode

Local LLM-assisted text completion extension for VS Code

image


llama vscode-swift0

Features

  • Auto-suggest on input
  • Accept a suggestion with Tab
  • Accept the first line of a suggestion with Shift + Tab
  • Accept the next word with Ctrl/Cmd + Right
  • Toggle the suggestion manually by pressing Ctrl + L
  • Control max text generation time
  • Configure scope of context around the cursor
  • Ring context with chunks from open and edited files and yanked text
  • Supports very large contexts even on low-end hardware via smart context reuse
  • Display performance stats

Installation

VS Code extension setup

Install the llama-vscode extension from the VS Code extension marketplace:

image

Note: also available at Open VSX

llama.cpp setup

The plugin requires a llama.cpp server instance to be running at the configured endpoint:

image

Mac OS

brew install llama.cpp

Any other OS

Either use the latest binaries or build llama.cpp from source. For more information how to run the llama.cpp server, please refer to the Wiki.

llama.cpp settings

Here are recommended settings, depending on the amount of VRAM that you have:

  • More than 16GB VRAM:

    llama-server --fim-qwen-7b-default
    
  • Less than 16GB VRAM:

    llama-server --fim-qwen-3b-default
    
  • Less than 8GB VRAM:

    llama-server --fim-qwen-1.5b-default
    
CPU-only configs

These are llama-server settings for CPU-only hardware. Note that the quality will be significantly lower:

llama-server \
    -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
    --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256
llama-server \
    -hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF \
    --port 8012 -ub 1024 -b 1024 --ctx-size 0 --cache-reuse 256

You can use any other FIM-compatible model that your system can handle. By default, the models downloaded with the -hf flag are stored in:

  • Mac OS: ~/Library/Caches/llama.cpp/
  • Linux: ~/.cache/llama.cpp
  • Windows: LOCALAPPDATA

Recommended LLMs

The plugin requires FIM-compatible models: HF collection

Examples

Speculative FIMs running locally on a M2 Studio:

https://github.com/user-attachments/assets/cab99b93-4712-40b4-9c8d-cf86e98d4482

Implementation details

The extension aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware.

  • The initial implementation was done by Ivaylo Gardev @igardev using the llama.vim plugin as a reference
  • Techincal description: https://github.com/ggerganov/llama.cpp/pull/9787

Other IDEs

  • Vim/Neovim: https://github.com/ggml-org/llama.vim
  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2025 Microsoft