La Llama Chat
La Llama Chat is a VS Code extension that brings AI-powered code intelligence directly into your editor — running entirely on your local machine, with no data leaving your environment. Chat with your codebase, explore execution flows, and retrieve semantically relevant context using llama.cpp as the LLM backend and ChromaDB as the vector store. ✨ Features
🔧 Prerequisites1. llama.cpp server
You will also need a compatible GGUF model (e.g. 2. ChromaDB
Or via pip:
3. VS CodeMinimum version: 📦 InstallationFrom the VS Code Marketplace
From a
|
| Setting | Default | Description |
|---|---|---|
laLlamaChat.llamaCpp.host |
127.0.0.1 |
llama.cpp server host |
laLlamaChat.llamaCpp.port |
8033 |
llama.cpp server port |
laLlamaChat.llamaCpp.executablePath |
./build/bin/llama-server |
Path to llama-server binary |
laLlamaChat.llamaCpp.modelPath |
./models/qwen2.5-coder-7b-instruct-q4_k_m.gguf |
Path to GGUF model |
laLlamaChat.llamaCpp.gpuLayers |
99 |
GPU layers to offload |
laLlamaChat.llamaCpp.contextSize |
16384 |
Context window in tokens |
laLlamaChat.llamaCpp.flashAttention |
true |
Enable flash attention |
laLlamaChat.llamaCpp.chatCompletionsPath |
/v1/chat/completions |
Chat endpoint path |
laLlamaChat.llamaCpp.jinja |
true |
Enable --jinja flag |
laLlamaChat.llamaCpp.tools |
all |
Value passed to --tools |
laLlamaChat.chromaDb.url |
http://127.0.0.1 |
ChromaDB base URL |
laLlamaChat.chromaDb.port |
8000 |
ChromaDB port |
laLlamaChat.chromaDb.excludeDirs |
see defaults | Folders skipped during indexing |
laLlamaChat.chromaDb.excludeFileGlobs |
["**/*.bin", ...] |
File patterns skipped during indexing |
laLlamaChat.chromaDb.maxFileSizeKb |
512 |
Max file size to index (KB) |
laLlamaChat.chromaDb.maxIndexedFiles |
2000 |
Max chunks/files per index run |
laLlamaChat.chromaDb.chunkSizeChars |
2000 |
Chunk size in characters |
laLlamaChat.chromaDb.chunkOverlapChars |
300 |
Chunk overlap in characters |
laLlamaChat.chromaDb.vectorCandidatePool |
50 |
Candidate pool for semantic retrieval |
laLlamaChat.chromaDb.maxQueryResults |
12 |
Max results returned per query |
laLlamaChat.chromaDb.minCosineSimilarity |
0.2 |
Minimum cosine similarity threshold |
laLlamaChat.chat.temperature |
0.2 |
Generation temperature |
laLlamaChat.chat.maxTokens |
2048 |
Max tokens per response |
laLlamaChat.chat.debug |
false |
Enable verbose logs |
laLlamaChat.chat.maxAttachedFileSizeKb |
256 |
Max size for manually attached files |
laLlamaChat.memory.contextWindowSize |
8192 |
Total context window token budget |
laLlamaChat.memory.safetyThreshold |
6500 |
Token threshold that triggers pruning |
laLlamaChat.memory.preserveSystemPrompt |
true |
Keep system prompt during pruning |
laLlamaChat.memory.preserveRecentMessagesCount |
2 |
Recent messages always preserved during pruning |
🚀 Usage
Step 1 — Start the llama.cpp server
Click Start Server in the extension panel, or run manually:
./build/bin/llama-server \
--model ./models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--host 127.0.0.1 \
--port 8033 \
--ctx-size 16384 \
--n-gpu-layers 99 \
--jinja
Step 2 — Index your workspace into ChromaDB
Open the La Llama Chat panel in the Activity Bar and click Index Workspace. The extension will:
- Walk your project files (respecting
excludeDirsandexcludeFileGlobs). - Chunk file contents into overlapping segments.
- Compute 384-dimensional vector embeddings using
Xenova/all-MiniLM-L6-v2(runs locally via@huggingface/transformers, ~22 MB cached). - Store everything in ChromaDB under a workspace-specific collection.
Re-index after significant refactors to keep context fresh.
Step 3 — Ask questions about your code
Where is the payment processing flow initiated?
Which services depend on UserRepository?
Explain the authentication middleware chain.
The extension selects a conversation flow automatically:
| Condition | Flow | Behaviour |
|---|---|---|
| No files attached, RAG enabled | GLOBAL_REACT_AGENT |
ReAct loop iteratively searches ChromaDB |
| Files attached, RAG enabled | DEEP_REACT_AGENT |
ReAct loop starts from attached files, expands dependencies via ChromaDB |
| Files attached, RAG disabled | LOCAL_RAG |
Isolated analysis of attached code, no retrieval |
| No files, RAG disabled | DIRECT_LLM |
Plain chat with model knowledge only |
Step 4 — Attach specific files
Click the Attach button (📎) to add individual files to the context:
[config.yml attached] What environment variables does this service require?
Keyboard Shortcuts
| Action | Shortcut |
|---|---|
| Send message | Enter |
| New line in input | Shift+Enter |
🖥️ Interface
La Llama Chat adds a sidebar panel with three tabs:
| Tab | Description |
|---|---|
| Chat | Main conversational interface with streamed Markdown responses and live token counter. |
| Settings | Quick-access panel for server and ChromaDB configuration. |
| About | Extension version and diagnostic information. |
🛠️ Conversation Flow Roles
Each conversation mode assigns a specific role to the LLM by default. Here's what each role does:
| Conversation Flow | Role | System Prompt Defines | Customization Key |
|---|---|---|---|
| DIRECT_LLM | General Software Engineer | Answers development Q&A using pre-trained knowledge; no RAG or file context | laLlamaChat.chat.directLlmTemplate |
| LOCAL_RAG | Deep Code Analyst | Analyzes ONLY attached files in isolation; warns about external dependencies | laLlamaChat.chat.localRagTemplate |
| GLOBAL_REACT_AGENT | Code Navigator & Architect | Iteratively searches the entire codebase using ChromaDB; builds comprehensive understanding of architecture | laLlamaChat.chat.globalReactTemplate |
| DEEP_REACT_AGENT | Cross-File Dependency Expert | Analyzes attached files, then expands to dependencies using ChromaDB search; resolves external references | laLlamaChat.chat.deepReactTemplate |
ReAct flows use the Thought/Action/Observation format:
- Each
Thoughtreasons about what to search next - Each
Actioncallslalamachat_agent_search(terms)to query ChromaDB Observationshows the search results- Loop continues until sufficient context is gathered, then emits
Final Answer
🛠️ Customizing Prompt Templates
You can override the system prompt and user prompt for each conversation flow to customize the LLM's behavior. Edit any of the following settings in your settings.json:
{
"laLlamaChat.chat.directLlmTemplate": {
"systemPrompt": "Your custom system prompt here for DIRECT_LLM mode",
"userPrompt": "Your custom user prompt template with {{user_query}} placeholder"
},
"laLlamaChat.chat.globalReactTemplate": {
"systemPrompt": "Your custom system prompt here for GLOBAL_REACT_AGENT mode",
"userPrompt": "Your custom user prompt template"
},
"laLlamaChat.chat.localRagTemplate": {
"systemPrompt": "Your custom system prompt here for LOCAL_RAG mode",
"userPrompt": "Your custom user prompt template with {{target_files}} placeholder"
},
"laLlamaChat.chat.deepReactTemplate": {
"systemPrompt": "Your custom system prompt here for DEEP_REACT_AGENT mode",
"userPrompt": "Your custom user prompt template with {{target_files}} placeholder"
}
}
Available placeholders:
{{user_query}}— replaced with the user's message{{target_files}}— replaced with attached file contents (LOCAL_RAG and DEEP_REACT only)
Legacy RAG Mode & Specific Files Mode Templates
These templates control the formatting of retrieved context and target files display (used internally for non-ReAct flows):
RAG mode (laLlamaChat.chat.ragModeTemplate):
{
"executionMode": {
"header": "<execution_mode>",
"scope": "SCOPE: Global Project Analysis (RAG).",
"instruction": "Synthesize the retrieved fragments to answer the query."
},
"retrievedContext": {
"header": "<retrieved_context>",
"footer": "</retrieved_context>",
"fragmentFormat": "Fragment {index} | Source: {path}{distance}\n```\n{content}\n```"
},
"query": {
"label": "User Query: {prompt}"
}
}
Specific files mode (laLlamaChat.chat.specificFilesModeTemplate):
{
"executionMode": {
"header": "<execution_mode>",
"scope": "SCOPE: Selected Specific Files.",
"instruction": "Answer based only on the code inside <target_files>."
},
"targetFiles": {
"header": "<target_files>",
"footer": "</target_files>",
"fileFormat": "File: {name}\nType: {type}\nExtension: {extension}\n```\n{content}\n```"
},
"query": {
"label": "User Query: {prompt}"
}
}
Default Prompt Content
For detailed information about the default system prompts used in each conversation flow (Principal Software Engineer roles, ReAct format requirements, etc.), see ARCHITECTURE.md §6.3bis and §6.7.
Legacy Spanish Keys
Legacy Spanish keys (modoEjecucion, archivosObjetivo, contextoRecuperado, consulta) are accepted for backward compatibility in RAG mode and specific files mode templates.
Donate
If this extension helps your workflow and you want to support its growth, you can donate here:
I want to bring new capabilities to this plugin, including:
- Content replacement directly in files
- Sub-agent management
- Graphical/UI improvements
- New chat interaction flows
🤝 Contributing
- Fork the repository and create a feature branch.
- Install dependencies:
npm install. - Run the build in watch mode:
npm run watch. - Run tests:
npm test. - Validate the build:
npm run compilemust exit with code0. - Open a Pull Request against
main.
Code conventions
- TypeScript strict mode is enforced.
- All new logic must include unit tests under
src/test/mirroring the source structure. - ESLint rules must pass.
- No comments in source code.
Data structures
Session storage (globalState key: laLlamaChatSessions)
[
{
"id": "1718615000000",
"title": "How does streaming work?",
"createdAt": 1718615000000,
"messages": [
{
"role": "user",
"content": {
"text": "How does streaming work?",
"filesMetadata": [
{
"name": "stream.ts:8-10",
"content": "const reader = body.getReader();",
"isAutomatic": true
}
]
}
},
{
"role": "assistant",
"content": {
"text": "Streaming works by reading chunks from the response body...",
"time": "1.42",
"tokens": 128
}
}
]
}
]
Request sent to llama.cpp
{
"model": "local",
"messages": [
{ "role": "system", "content": "You are a Principal Software Engineer..." },
{
"role": "user",
"content": "--- ATTACHED FILE: stream.ts:8-10 ---\nconst reader = body.getReader();\n--- END FILE ---\n\nUser instruction:\nHow does streaming work?"
},
{ "role": "assistant", "content": "Streaming works by reading chunks..." },
{
"role": "user",
"content": "--- ATTACHED FILE: stream.ts ---\nfull file content here\n--- END FILE ---\n\nUser instruction:\nCan you explain the buffer logic?"
}
],
"temperature": 0.2,
"max_tokens": 2048,
"stream": true
}
llama.cpp server props (GET /props)
{
"model_path": "/models/qwen2.5-coder-7b-instruct-q8_0.gguf",
"n_ctx": 32768,
"n_ctx_train": 131072,
"n_embd": 4096
}
Development
npm run compile # typecheck + lint + build
npm run watch # watch mode
npm run test # run unit tests
Tests
Unit tests cover: session timing/state persistence, editor context labels, payload deduplication, conversation flow resolution, prompt template normalization/interpolation, token counting and memory pruning thresholds, ChromaDB config defaults/overrides, llamaServerConfig command/url generation, and LlamaAdapter server props extraction.
📄 License
Distributed under the GNU GPL v3.0 License. See LICENSE for full text.
Third-Party Notice
- This project includes DOMPurify (
media/purify.min.js) under the Apache-2.0 / MPL-2.0 dual license.
Built with ❤️ for developers who value privacy, performance and full control over their toolchain.