interact-mcpMCP server for browser interaction and desktop window analysis. Gives agents the ability to navigate, click, type, scroll, and drag in a headless browser — plus capture and analyze any desktop window — with optional vision analysis. Instead of screenshot → analyze → act loops, each tool returns a text summary of what changed. Vision analysis is optional and controlled by the caller via Install
Or add to a project:
Playwright browsers are auto-installed on first run. Desktop window analysis requires X11 +
ConfigurationAll settings are environment variables with the
API key resolutionAPI keys are resolved automatically from standard provider environment variables based on the model prefix. No interact-mcp-specific key variables are needed.
Vision model examples
MCP client setupClaude Desktop
VS Code / CopilotCreate OpenAI only:
Gemini only:
Multi-provider (OpenAI images + Gemini video):
Tools (6)The browser session is persistent across all tool calls. Navigate once, then interact — each call picks up where the last left off.
|
| Type | Fields | Optional |
|---|---|---|
click |
selector OR x+y |
wait |
type_text |
selector, text |
clear_first (default: true), wait |
scroll |
— | direction (default: down), amount (default: 3), wait |
drag |
from_x, from_y, to_x, to_y |
wait |
navigate |
url |
wait |
evaluate_js |
script |
wait |
Observations (read current state, no diff):
| Type | Fields | Optional |
|---|---|---|
screenshot |
— | scope, query |
wait_for |
selector |
state (visible/hidden/attached/detached), timeout (ms) |
Single action:
run_actions(actions=[{"type": "click", "selector": "button[type=submit]", "wait": "networkidle"}])
Multi-step with mixed actions and observations:
run_actions(actions=[
{"type": "navigate", "url": "http://localhost:8000/login"},
{"type": "type_text", "selector": "#email", "text": "user@example.com"},
{"type": "type_text", "selector": "#password", "text": "secret"},
{"type": "click", "selector": "button[type=submit]", "wait": "networkidle"},
{"type": "wait_for", "selector": ".dashboard"},
{"type": "screenshot", "scope": ".welcome-banner", "query": "What does it say?"}
], query="Is the user logged in?")
screenshot(query?, scope?)
Capture the current page or a specific element. With query, returns vision analysis.
screenshot()
screenshot(scope="#hero-table", query="Are there alignment issues?")
get_page_state(scope?)
Get the current page URL, title, accessibility tree, focused element, and visible text. Use scope to focus on a specific element.
get_page_state()
get_page_state(scope=".sidebar")
Desktop window tools
These work on X11 desktops (Linux). They capture real desktop windows — not just the headless browser.
list_desktop_windows()
List all visible desktop windows with their names and dimensions.
list_desktop_windows()
analyze_window(title, query?)
Capture a desktop window by title substring and analyze it with vision (if configured).
analyze_window(title="Chrome")
analyze_window(title="Visual Studio Code", query="What file is currently open?")
analyze_window(title="Slack", query="Are there any unread messages?")
How tracking works
The browser session is a single persistent Playwright page. Each tool call operates on the same page state:
navigate("http://localhost:8000")→ reads page title + contentrun_actions(actions=[{"type": "click", "selector": "#sign-in"}])→ reads what changedrun_actions(actions=[{"type": "type_text", "selector": "#email", "text": "user@example.com"}])→ reads what changedscreenshot(query="Is the form filled correctly?")→ gets visual confirmationrun_actions(actions=[{"type": "click", "selector": "button[type=submit]", "wait": "networkidle"}])→ reads what changed
Or combine steps 2-5 into one call:
run_actions(actions=[
{"type": "click", "selector": "#sign-in", "wait": "networkidle"},
{"type": "type_text", "selector": "#email", "text": "user@example.com"},
{"type": "type_text", "selector": "#password", "text": "secret"},
{"type": "click", "selector": "button[type=submit]", "wait": "networkidle"},
{"type": "screenshot", "scope": ".dashboard"}
], query="Did login succeed?")
Typical workflow: interact with a local server
navigate("http://localhost:8000", wait="networkidle")
run_actions(actions=[{"type": "annotate"}])
run_actions(actions=[
{"type": "click", "selector": "nav a[href='/settings']", "wait": "networkidle"},
{"type": "type_text", "selector": "#api-key", "text": "sk-test-123"},
{"type": "click", "selector": "button[type=submit]", "wait": "networkidle"},
{"type": "screenshot", "scope": ".settings-form"}
], query="Did the settings save?")
Scoped inspection
navigate("http://localhost:8000")
screenshot(scope="#hero-table", query="Are there any alignment issues in the table?")
get_page_state(scope=".sidebar")
Batch workflow: login + navigate
run_actions(actions=[
{"type": "navigate", "url": "http://localhost:8000/login"},
{"type": "type_text", "selector": "#username", "text": "admin"},
{"type": "type_text", "selector": "#password", "text": "secret"},
{"type": "click", "selector": "button[type=submit]"}
], query="Did login succeed? What page are we on now?")
Screenshot dumping
Set INTERACT_MCP_SCREENSHOT_DUMP_DIR to a folder path and every PageState capture will save a timestamped PNG there. Filenames are {timestamp}_{url_host}.png. Screenshots are still consumed and analyzed normally — dumping is additive.
INTERACT_MCP_SCREENSHOT_DUMP_DIR=./debug-screenshots uvx interact-mcp
No system prompt
Vision calls do not include a system prompt. The query parameter you pass to any tool becomes the user-facing prompt the model receives alongside the page data. You control all framing.