Skip to content
| Marketplace
Sign in
Visual Studio Code>AI>Copilot for llama-server LLMsNew to Visual Studio Code? Get it now.
Copilot for llama-server LLMs

Copilot for llama-server LLMs

Delft Solutions B.V.

|
53 installs
| (1) | Free
VS Code extension that integrates llama-server LLMs as language model chat providers.
Installation
Launch VS Code Quick Open (Ctrl+P), paste the following command, and press enter.
Copied to clipboard
More Info

Copilot for llama-server LLMs

A VS Code extension that integrates llama-server LLMs as language model chat providers, enabling local AI-powered coding assistance directly in VS Code.

Note: This extension has no affiliation with llama.cpp or its maintainers. It is an independent third-party extension that provides integration with llama-server.

Before using this extension, you need to install and run llama-server from the llama.cpp project. Follow the quick start guide to get started.

Installing llama.cpp

You can install llama.cpp in several ways:

  • Using package managers: Install using brew, nix, or winget
  • Docker: Run with Docker - see the Docker documentation
  • Pre-built binaries: Download from the releases page
  • Build from source: Clone the repository and build - check out the build guide

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Starting llama-server

After installing llama.cpp, you need to start llama-server with your models configured. Here's an example startup script (see examples/start-llms):

#!/bin/bash

llama-server --port 8013 --models-preset ./models.ini

Key Flags

  • --port: Specifies the port on which the server will listen (default: 8080)
  • --models-preset: Path to your models configuration file (INI format)

The server will start and load models according to your configuration. Make sure the server is running before configuring the VS Code extension.

Model Configuration

Models are configured using an INI file format. See examples/models.ini for a complete example. Here's an example for a MacBook with 128GB RAM:

[nemotron-3-nano-30b]
jinja = true
ctx-size = 256000
temp = 1.0
top-p = 1.00
fit = on
hf = unsloth/Nemotron-3-Nano-30B-A3B-GGUF:BF16

[qwen3-4b]
jinja = true
ctx-size = 32768
temp = 0.6
min-p = 0.0
top-p = 0.95
top-k = 20
hf = unsloth/Qwen3-4B-128K-GGUF:Q8_K_XL

[glm-4-7-flash]
jinja = true
ctx-size = 202752
temp = 0.7
top-p = 1.0
min-p = 0.01
repeat-penalty = 1.0
hf = unsloth/GLM-4.7-Flash-GGUF:BF16

Configuration Options

Please look at the modelfile accompanying your model for the settings to use. All available settings can be found in the llama-server readme.

Memory Considerations

Make sure to pick models and context sizes that work with your machine.

  • Context size: Larger context sizes require more RAM

    • 1,000,000 tokens ≈ 133GB RAM
    • 256,000 tokens ≈ 33GB RAM
    • 128,000 tokens ≈ 17GB RAM
    • 32,768 tokens ≈ 4GB RAM
  • Quantization: Smaller quantizations use less RAM

    • BF16: 2× model size (30b model => 60GB)
    • Q8_0: 1× model size
    • Q4_0: 0.5× model size
    • Q1_0: 0.125× model size

VS Code Extension Configuration

Configure the extension by adding endpoint settings to your VS Code settings (File → Preferences → Settings, or edit settings.json directly).

Basic Configuration

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013"
    }
  }
}

Endpoint Identifiers

Each endpoint has an identifier (e.g., "local"). Models from that endpoint will be displayed with the suffix @identifier (e.g., my-model@local). This allows you to:

  • Connect to multiple llama-server instances
  • Distinguish between models from different endpoints
  • Configure different settings per endpoint

Multiple Endpoints

You can configure multiple endpoints:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013"
    },
    "remote": {
      "url": "http://192.168.1.100:8080",
      "apiToken": "your-api-token-here"
    }
  }
}

Parameter Overrides

You can override generation parameters (temperature, top_p, etc.) at both the endpoint and model level. These overrides are merged into the request body sent to llama-server.

Endpoint-Level Overrides

Apply parameters to all models on an endpoint:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "requestBody": {
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 40,
        "min_p": 0.01,
        "repeat_penalty": 1.1,
        "max_tokens": 2048
      }
    }
  }
}

Model-Level Overrides

Override parameters for specific models. Model-level requestBody properties override endpoint-level properties:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "requestBody": {
        "temperature": 0.7,
        "top_p": 0.95
      },
      "models": {
        "my-model": {
          "requestBody": {
            "temperature": 0.6,
            "top_p": 0.9,
            "top_k": 40
          }
        }
      }
    }
  }
}

In this example, my-model will use temperature: 0.6, top_p: 0.9, and top_k: 40, while other models on the local endpoint will use temperature: 0.7 and top_p: 0.95.

Common Parameters

  • temperature (number): Controls randomness (0.0 = deterministic, 2.0 = very creative)
  • top_p (number): Nucleus sampling threshold (0.0 to 1.0)
  • top_k (number): Top-k sampling (number of tokens to consider)
  • min_p (number): Minimum probability threshold
  • repeat_penalty (number): Penalty for repeating tokens (1.0 = no penalty, >1.0 = penalty)
  • max_tokens (number): Maximum number of tokens to generate

Advanced Configuration

Headers

Add custom headers for authentication or other purposes:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "headers": {
        "X-Custom-Header": "value"
      },
      "models": {
        "my-model": {
          "headers": {
            "X-Model-Specific": "value"
          }
        }
      }
    }
  }
}

Model-level headers override endpoint-level headers.

API Token Authentication

For authenticated endpoints:

{
  "llamaCopilot.endpoints": {
    "secure": {
      "url": "https://api.example.com",
      "apiToken": "your-bearer-token-here"
    }
  }
}

The token will be sent as Authorization: Bearer <token> in all requests.

Context Size Overrides

Override the context size for a specific model:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "models": {
        "large-model": {
          "contextSize": 256000
        }
      }
    }
  }
}

Max Output Tokens Overrides

Override the maximum output tokens:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "models": {
        "my-model": {
          "maxOutputTokens": 4096
        }
      }
    }
  }
}

Capabilities Configuration

Configure model capabilities:

{
  "llamaCopilot.endpoints": {
    "local": {
      "url": "http://localhost:8013",
      "models": {
        "multimodal-model": {
          "capabilities": {
            "imageInput": true,
            "toolCalling": true
          }
        },
        "tool-model": {
          "capabilities": {
            "toolCalling": 10
          }
        }
      }
    }
  }
}
  • imageInput (boolean): Whether the model supports image input
  • toolCalling (boolean | number): Whether the model supports tool calling. Can be a boolean or a number (maximum number of tools)

Cursor Rules Integration

The extension includes a built-in tool that gives the LLM access to your project's cursor rules from .cursor/rules/. This allows the model to access project-specific guidelines, coding standards, and best practices automatically.

How It Works

  1. Rule Discovery: The extension automatically reads all .md and .mdc files from .cursor/rules/ in your workspace
  2. Glob Matching: Rules with glob patterns in their frontmatter are matched against:
    • File attachments (e.g., @src/logger.ts:32)
    • User messages
    • Assistant messages
    • Tool call parameters (first 1024 bytes)
  3. Session Scoping: Available rules are tracked per chat session. When a glob matches, that rule becomes available for that conversation
  4. Tool Exposure: When rules are available, a get-project-rule tool is automatically exposed to the LLM

Rule File Format

Rules can be simple markdown files (.md) or markdown files with frontmatter (.mdc):

Simple rule (.md):

# Coding Guidelines

Always use TypeScript strict mode.
Prefer async/await over promises.

Rule with frontmatter (.mdc):

---
description: "TypeScript coding standards"
globs: ["**/*.ts", "**/*.tsx"]
alwaysApply: false
---

# TypeScript Guidelines

- Use strict mode
- Prefer interfaces over types for object shapes
- Use const assertions where appropriate

Glob Pattern Matching

Glob patterns are converted to regex patterns:

  • * matches any characters except path separators: [a-zA-Z0-9.~@+=_|-]
  • ** matches any characters including path separators: [a-zA-Z0-9.~@+=_|\/-]
  • Both Windows (\) and Unix (/) path separators are supported

Tool Usage

When rules are available, the LLM can call the get-project-rule tool:

get-project-rule(rule: "coding-guidelines.md,style/markdown.mdc")

The tool supports:

  • Comma-separated rule names
  • Optional rule: prefix (e.g., rule:style.md or just style.md)
  • Fuzzy matching: If a rule isn't found exactly, the closest match (within Levenshtein distance 8) is used
  • Returns <empty file> if no matching rule is found

Configuration

Enable or disable the cursor rules feature in settings:

{
  "llamaCopilot.enableCursorRules": true
}

When disabled:

  • Rules are not parsed
  • The tool is not exposed to the LLM
  • No performance overhead from rule matching

Example

  1. Create a rule file .cursor/rules/typescript.md:
# TypeScript Rules

Always use explicit return types for functions.
Prefer `interface` over `type` for object shapes.
  1. Create a rule with glob matching .cursor/rules/react-components.mdc:
---
description: "React component guidelines"
globs: ["**/*.tsx", "src/components/**"]
---

# React Components

- Use functional components with hooks
- Extract complex logic into custom hooks
- Use React.memo for expensive components
  1. When you mention a file matching the glob (e.g., @src/components/Button.tsx), the rule becomes available to the LLM automatically

Usage

Selecting Models

  1. Open the Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
  2. Type "Chat: Start Session" or use the chat interface
  3. Select a model from the list (models appear as model-name@endpoint-id)

Using the Chat Interface

  • Start a chat session with a selected model
  • The extension supports tool calling if the model supports it
  • Models with image input capability can process images

Opening Settings

Use the command "Open Endpoint Settings" to quickly access the configuration, or navigate to Settings and search for "llamaCopilot".

Troubleshooting

Server Not Found

  • Ensure llama-server is running
  • Check that the URL in your configuration matches the server's address and port
  • Verify the server is accessible (try opening the URL in a browser)

Models Not Appearing

  • Check that models are loaded in llama-server (visit /models endpoint)
  • Ensure models don't have "/" in their ID (these are filtered out)
  • Verify the endpoint URL is correct
  • Check the VS Code output panel for error messages (View → Output → "LLaMA Server API")

Configuration Errors

  • Validate your JSON syntax in settings.json
  • Ensure required fields (url) are present
  • Check that endpoint identifiers don't contain special characters

Parameter Overrides Not Working

  • Verify the parameter names match llama-server's API (check llama-server documentation)
  • Remember that model-level requestBody overrides endpoint-level requestBody
  • Check the VS Code output panel for API request/response logs

Links

  • llama.cpp GitHub
  • llama.cpp Quick Start
  • llama-server Documentation
  • Contact us
  • Jobs
  • Privacy
  • Manage cookies
  • Terms of use
  • Trademarks
© 2026 Microsoft