Z.AI for GitHub Copilot Chat
What Is This?Z.AI for GitHub Copilot Chat is a VS Code extension that registers Z.AI GLM series models — including GLM-5.2, GLM-5.1, GLM-5, and GLM-4.7 — into GitHub Copilot Chat via the official VS Code Language Model Chat Provider API. This lets you pick and use Z.AI GLM models directly from the Copilot Chat model picker, just like selecting GPT-4 or Claude — no extra Copilot Pro/Enterprise subscription required. Simply enter your Z.AI API key.
✨ Features
🔬 Deep ResearchThe extension registers Z.AI's remote MCP servers for Web Search and Web Reader, making them available natively to Copilot Agent. The How it worksZ.AI's Web Search and Web Reader are MCP servers (not REST endpoints). They are billed against the GLM Coding Plan's shared monthly MCP quota, not the general API balance — so no top-up is needed.
|
| Command | Description |
|---|---|
Z.AI: Manage Provider |
Manage API key, refresh models, or test connection |
Z.AI: Set API Key |
Store or update your Z.AI API key |
Z.AI: Setup MCP Servers |
Write the user's mcp.json with Z.AI Web Search + Web Reader (one-time, for @z-research) |
Z.AI: Show Quota |
Open a detailed markdown report of all quota windows |
Z.AI: Toggle Quota View |
Switch the status bar between 5-hour and weekly display |
Z.AI: Diagnostics |
Show a markdown report of all registered Z.AI models |
Note: The native BYOK flow via Language Models (gear icon ⚙) is recommended.
Coding Plan quota
When your API key belongs to a Z.AI Coding Plan subscription, the extension shows a quota indicator $(graph) Z · NN% on the right side of the status bar:
- Hover the indicator to see a graphical SVG donut chart with two concentric rings — the outer ring for the weekly quota, the inner ring for the rolling 5-hour quota. Each ring is colour-coded: blue (normal), yellow (≥80%), red (≥95%). Below the chart: usage percentages and reset countdowns.
- Click the indicator to toggle the status-bar text between the 5-hour and weekly view.
- The indicator background turns yellow at 80% usage and red at 95%.
- Z.AI: Manage Provider → Show Quota opens a detailed markdown report with all quota windows.
The quota is fetched from https://api.z.ai/api/monitor/usage/quota/limit and auto-refreshes every 5 minutes (configurable via zai.quotaRefreshInterval).
If quota data is unavailable (e.g. no API key set, or the key doesn't belong to a Coding Plan), the status bar shows a persistent
$(graph) Z.AI quotaitem with a tooltip linking to Z.AI: Set API Key.
Settings
| Setting | Type | Default | Description |
|---|---|---|---|
zai.temperature |
number |
0.2 |
Sampling temperature for chat completions (0–2) |
zai.maxTokens |
number |
0 |
Max output token override — 0 uses the per-model bundled maximum |
zai.maxInputTokens |
number |
0 |
Context window override — 0 uses the per-model bundled context size |
zai.debugReasoning |
boolean |
false |
Write provider reasoning_content to Output → Z.AI for debugging |
zai.requestTimeout |
number |
180000 |
Connection timeout in ms. Auto-scaled 1.5× for 200K flagship models (glm-5.1/5/4.7) and capped at 300000ms. Inactivity timer scales the same way (90–180s window). |
zai.maxRetries |
number |
2 |
Automatic retries on transient network errors (fetch failed, timeout, 5xx, 429) with exponential backoff (1s → 2s → max 10s + jitter). |
zai.defaultModel |
string |
"" |
Model id to mark as the default selection in the Copilot Chat model picker (e.g. glm-5.2). Leave empty to not mark any model as the default — users can still pick any model manually. |
zai.showUsageStatusBar |
boolean |
true |
Show the latest Z.AI usage summary (prompt→output tokens) in the VS Code status bar after each response. |
zai.showQuotaStatusBar |
boolean |
true |
Show the Z.AI Coding Plan quota (5-hour / weekly) in the VS Code status bar. Hover for a graphical SVG donut chart; click to toggle between windows. |
zai.quotaRefreshInterval |
number |
5 |
How often (in minutes) to refresh the Z.AI Coding Plan quota. 0 disables automatic refresh. |
zai.experimentalContextIndicator |
boolean |
false |
Experimental: attempt to fill the Copilot Chat context indicator with real Z.AI token usage. Depends on VS Code internals. |
zai.research.maxSources |
number |
100 |
Max sources fetched during a @z-research run when deep mode is triggered. Lower to reduce cost/latency. |
zai.research.maxIterations |
number |
5 |
Max query-expansion iterations before synthesis (1–10). |
zai.research.concurrency |
number |
3 |
Parallel MCP calls during search + read phases. Higher is faster but may hit the Z.AI MCP rate limit (~3-5 req/s safe). |
zai.research.cacheTTL |
number |
3600 |
Cache TTL in seconds for Z.AI search + read results. 0 disables caching. |
zai.research.synthesisModel |
string |
glm-5.2 |
Z.AI model used for planning queries and synthesising the final report. Use a high-context model (e.g. glm-5.2 with 1M context) for deep research. |
zai.research.webSearchToolName |
string |
web_search_prime |
VS Code tool name for the Z.AI Web Search MCP server. The default matches the snake_case form VS Code exposes (e.g. mcp_mcp-web-searc_web_search_prime). Override if VS Code's MCP tool name format changes. |
zai.research.webReaderToolName |
string |
webReader |
VS Code tool name for the Z.AI Web Reader MCP server. Default matches the camelCase form VS Code exposes. Override if VS Code's MCP tool name format changes. |
Troubleshooting
"Request timed out for glm-5.1" / "Connection timed out after …"
Flagship 200K-context models (glm-5.1, glm-5, glm-5-turbo, glm-4.7) have noticeably higher cold-start latency than smaller models — they can take 60–120s to send the first token on long or busy sessions.
The extension already mitigates this automatically:
zai.requestTimeoutdefaults to 180000ms (3 min) — was 120000ms in 0.1.x- The effective connection timeout is auto-scaled to 1.5× for 200K flagship models (so 180s base → 270s)
- The inactivity timer auto-scales the same way, with a 90s minimum floor (was 30s)
If you still hit timeouts:
- Retry — Z.AI servers sometimes spike under load; the same prompt may succeed in a few seconds
- Increase
zai.requestTimeoutin Settings (e.g. 300000 = 5 min max) - Try a faster model like
glm-4.5-flashorglm-4.7-flashfor code-completion / quick-edit tasks - Clear chat history to reduce input token count — large prefill is the main driver of cold-start latency
- Check the Z.AI Output channel — every request logs
[Timeout config: model=X flagship=Y multiplier=Z× connectionTimeout=…]so you can confirm which budget was applied
If the issue persists with zai.requestTimeout = 300000 and a small context, the Z.AI API itself is the bottleneck — try a different Z.AI region/plan or contact Z.AI support.
"Z.AI model not selectable in the model picker" / "Can't pin Z.AI model"
The Z.AI extension only sends the official LanguageModelChatInformation fields to VS Code. Non-API fields like category and isUserSelectable are not declared in the public VS Code API and can cause the picker to misbehave or even crash (see doc/vscode-126-chatmodel-picker-crash.md for the original incident).
If the model picker doesn't show your Z.AI models or they can't be pinned:
- Make sure Z.AI models are enabled in the picker. Open the picker, search for "Z.AI", and click the eye (👁) icon to enable visibility. The eye icon is in the Language Models view (gear icon ⚙ → Z.AI) and toggles whether the model is listed in the picker.
- Pin a model as default. Set
zai.defaultModelin your user settings (e.g.glm-5.2). The extension marks that model asisDefault: trueso VS Code highlights it in the picker and seeds new chat sessions with it. - Reload the window after changing
zai.defaultModel(the model list is cached per-window).
If the picker still misbehaves:
- Open Developer Tools (
Cmd+Shift+P→ "Developer: Toggle Developer Tools") and look for console errors when you open the picker. - Check Output → Z.AI for any error logs from the model provider.
- File an issue with the console error and your VS Code version.
"@z-research: MCP servers are not connected yet"
The @z-research participant needs Z.AI's Web Search + Web Reader MCP servers registered with VS Code. To fix:
- Run setup once: open the Command Palette and run Z.AI: Setup MCP Servers. This writes the servers to your user
mcp.json(macOS:~/Library/Application Support/Code/User/mcp.json, Windows:%APPDATA%\Code\User\mcp.json, Linux:~/.config/Code/User/mcp.json). - Click Reload in the prompt to restart VS Code.
- Verify in the MCP view (Activity Bar → MCP). Both
zai-web-search-primeandzai-web-readershould show as Running. - Re-run
@z-research <topic>.
The participant will display a clear error with the currently-available tool names if MCP isn't connected yet.
"@z-research: No usable sources were found"
Several possible causes:
- Search queries returned empty results — the topic may be too niche. Try rephrasing with concrete keywords.
- All top URLs were filtered as junk — if every returned URL was social-media or an asset CDN, the filter dropped them all. Try a more specific topic with named entities (e.g. "World Archery 2024 registration rules" instead of "archery registration").
- Every read timed out — check the Output channel for
[mcp-tools] Timeout (30000ms)entries. If many, your Z.AI MCP servers may be rate-limited or unreachable.
The diagnostic log is in Output → Z.AI Research and includes every search query, every read, and the parsed result count.
"@z-research takes 4+ minutes"
That's the normal end-to-end time for a deep-mode run. The wall-clock time is bounded by:
- Search phase — bounded by
zai.research.concurrency(default 3) and the 30s per-call timeout. - Read phase — up to ~25 source reads in parallel, again 30s timeout each.
- Synth phase — 3–5 LLM calls (1 reduce + N chunk summaries) on the synthesis model. With
glm-5.2(1M context), this is fast.
To shorten: use quick mode (omit deep / thorough / menyeluruh keywords from your prompt), lower zai.research.maxSources, or pick a smaller synthesis model.
"@z-research hits MCP rate limit (-429)"
The participant automatically retries with exponential backoff (1s / 2s / 4s) on rate-limit responses. If you see persistent 429s in the log:
- Lower
zai.research.concurrencyto2(default is 3). - The Z.AI Coding Plan has a monthly MCP quota (Lite=100, Pro=1K, Max=4K calls). Check your usage in the Z.AI dashboard.
"@z-research search query stuck/hangs"
Each MCP call has a 30s per-call timeout. A hung call is logged as [mcp-tools] Timeout (30000ms) for search "..." — giving up on this query and the orchestrator continues. If you see many timeouts:
- Z.AI's MCP server may be having a transient issue — try again in a minute.
- The query itself may be problematic. If a particular query consistently times out, consider rewording it in your prompt.
Models
The extension fetches the live model list from:
https://api.z.ai/api/coding/paas/v4/models
Because the Z.AI API returns model IDs only, a bundled metadata table provides context window and max output tokens per model. If the live fetch fails, the bundled list is used as a fallback.
VS Code and Copilot read separate input/output metadata fields for UI display. GLM models can have very large output limits, so the extension advertises a small response reserve to keep the Language Models table, model picker tooltip, and chat context indicator consistent while still sending each model's full bundled max output limit to the Z.AI API.
Bundled model limits
| Model | Context window | Max output tokens | Vision |
|---|---|---|---|
glm-4.7 |
200K (204,800) | 128K (131,072) | ❌ |
glm-5 |
200K (204,800) | 128K (131,072) | ❌ |
glm-5.1 |
200K (204,800) | 128K (131,072) | ❌ |
glm-4.5-air |
128K (131,072) | 96K (98,304) | ❌ |
glm-4.5-flash |
128K (131,072) | 96K (98,304) | ❌ |
glm-5v-turbo |
200K (204,800) | 128K (131,072) | ✅ |
glm-4.6v |
128K (131,072) | 32K (32,768) | ✅ |
glm-4.6v-flash |
128K (131,072) | 32K (32,768) | ✅ |
Set zai.maxInputTokens or zai.maxTokens to a non-zero value to override the bundled defaults globally.
All models use the OpenAI-compatible chat completions endpoint:
https://api.z.ai/api/coding/paas/v4/chat/completions
Development
# Install dependencies
npm install
# Compile TypeScript
npm run compile
# Watch mode
npm run watch
Press F5 in VS Code to launch an Extension Development Host with the extension loaded.
To package a .vsix for local install:
npm run package
Tests
Two test suites, both using Node's built-in test runner:
# Quota module
npx tsx --test src/test/quota.test.ts
# Deep research modules
npm test
The npm test runner covers 9 vscode-free pure modules: mcpInputBuilders (5), mcpResponseParser (15), mcpRateLimit (6), mcpTimeout (5), mcpToolNameResolver (9), junkUrlFilter (10), ranker (5), budget (5), cache (5), plus URL dedup (5). 75 tests, 100% pass rate, runs in ~600ms.
Contributing
Issues and pull requests are welcome. Please open an issue first for significant changes so we can discuss the approach.
Contributors
- @nik13513513 (Alex Kor) — Z.AI Coding Plan quota tracking, graphical SVG donut tooltip, quota auth error handling, and test suite ([PR #2](https://github.com/ltmoerdani/zai-copilot-chat/pull/2), [PR #3](https://github.com/ltmoerdani/zai-copilot-chat/pull/3))
License
MIT — see LICENSE for details.