Server data from the Official MCP Registry
Windows 11 desktop automation MCP server. Win32, UIA, CDP/Chrome, 56 tools. Windows-only.
Windows 11 desktop automation MCP server. Win32, UIA, CDP/Chrome, 56 tools. Windows-only.
Valid MCP server (1 strong, 1 medium validity signals). 3 known CVEs in dependencies (0 critical, 3 high severity) Package registry verified. Imported from the Official MCP Registry.
3 files analyzed · 4 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: DESKTOP_TOUCH_AUTO_GUARD
Environment variable: DESKTOP_TOUCH_PERCEPTION_RESOURCES
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-harusame64-desktop-touch-mcp": {
"env": {
"DESKTOP_TOUCH_AUTO_GUARD": "your-desktop-touch-auto-guard-here",
"DESKTOP_TOUCH_PERCEPTION_RESOURCES": "your-desktop-touch-perception-resources-here"
},
"args": [
"-y",
"@harusame64/desktop-touch-mcp"
],
"command": "npx"
}
}
}From the project's GitHub README.
Stop pasting screenshots. Let Claude see and control your desktop directly.
An MCP server that gives Claude eyes and hands on Windows — 56 tools covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
Applies MPEG P-frame diffing to window capture: only changed windows are sent after the first frame, cutting token usage by ~60–80% in typical automation loops.
run_macro batches multiple operations into a single API call; diffMode sends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.lensId for a window or browser tab, pass it to action tools, and get guard-checked post.perception feedback after each action. It reduces repeated screenshot / get_context calls and prevents wrong-window typing or stale-coordinate clicks.GetWindowTextW for window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.detail="image" (~443 tok) / detail="text" (~100–300 tok) / diffMode=true (~160 tok). Send pixels only when you actually need to see them.dotByDot=true captures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. With origin+scale passed to mouse_click, the server converts coords for you — eliminating off-by-one / scale bugs.grayscale=true (~50% size), dotByDotMaxDimension=1280 (auto-scaled with coord preservation), and windowTitle + region sub-crops help exclude browser chrome and other irrelevant pixels. Typical reduction for heavy captures: 50–70%.detail="text" on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR. hints.chromiumGuard + hints.ocrFallbackFired flag the path taken.detail="text" returns button names and clickAt coords as JSON. Claude can click the right element without ever looking at a screenshot.dock_window snaps any window to a screen corner with always-on-top. Set DESKTOP_TOUCH_DOCK_TITLE='@parent' to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.| OS | Windows 10 / 11 (64-bit) |
| Node.js | v20+ recommended (tested on v22+) |
| PowerShell | 5.1+ (bundled with Windows) |
| Claude CLI | claude command must be available |
Note: nut-js native bindings require the Visual C++ Redistributable. Download from Microsoft if not already installed.
npx -y @harusame64/desktop-touch-mcp
The npm launcher downloads the latest desktop-touch-mcp-windows.zip from GitHub Releases on first run and caches it under %USERPROFILE%\.desktop-touch-mcp. Later runs reuse the cached release unless a newer GitHub Release is available.
Add to ~/.claude.json under mcpServers:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"]
}
}
}
No system prompt needed. The command reference is automatically injected into Claude via the MCP initialize response's instructions field.
Clients that require an HTTP endpoint (GPT Desktop, VS Code Copilot, Cursor, etc.) can use the built-in Streamable HTTP transport:
npx -y @harusame64/desktop-touch-mcp --http
# or with a custom port:
npx -y @harusame64/desktop-touch-mcp --http --port 8080
The server starts at http://127.0.0.1:23847/mcp (localhost only). Register the URL in your MCP client settings. A health check is available at http://127.0.0.1:<port>/health.
In HTTP mode the system tray icon shows the active URL and provides quick-copy and open-in-browser shortcuts.
git clone https://github.com/Harusame64/desktop-touch-mcp.git
cd desktop-touch-mcp
npm install
npm install runs the prepare script, which compiles TypeScript to dist/. No separate build step is required.
For a local checkout, register the built server directly:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "node",
"args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
}
}
}
Note: Replace
D:/path/to/desktop-touch-mcpwith the actual path where you cloned this repository.
📖 Full command reference:
docs/system-overview.md— every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
| Tool | Description |
|---|---|
screenshot | Main capture. Supports detail, dotByDot, dotByDotMaxDimension, grayscale, region sub-crop, diffMode |
screenshot_background | Capture a background window without focusing it (PrintWindow API) |
screenshot_ocr | Windows.Media.Ocr on a window; returns word-level text + screen clickAt coords |
get_screen_info | Monitor layout, DPI, cursor position |
scroll_capture | Full-page stitch by scrolling (MAE overlap detection + 10% fallback) |
| Tool | Description |
|---|---|
get_windows | List all windows in Z-order |
get_active_window | Info about the focused window |
focus_window | Bring a window to foreground by partial title match |
dock_window | Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
| Tool | Description |
|---|---|
mouse_move / mouse_click / mouse_drag | Move, click, drag. doubleClick / tripleClick (line-select). Accept speed and homing parameters |
scroll | Scroll in any direction. Accepts speed and homing parameters |
get_cursor_position | Current cursor coordinates |
| Tool | Description |
|---|---|
keyboard_type | Type text. use_clipboard=true bypasses IME (required for em-dash / smart quotes). replaceAll=true sends Ctrl+A before typing. Non-ASCII symbols trigger clipboard mode automatically (opt-out: forceKeystrokes=true) |
keyboard_press | Key combos (ctrl+c, alt+f4, etc.) |
| Tool | Description |
|---|---|
get_ui_elements | Full UIA element tree for a window |
click_element | Click a button by name or automationId — no coordinates needed |
set_element_value | Write directly to a text field |
scope_element | High-res zoom crop of an element + its child tree |
| Tool | Description |
|---|---|
browser_launch | Launch Chrome/Edge/Brave with --remote-debugging-port and wait for the CDP endpoint (idempotent) |
browser_connect | Connect to Chrome/Edge via CDP; lists open tabs with active:true/false |
browser_find_element | CSS selector → exact physical screen coords |
browser_click_element | Find DOM element + click in one step |
browser_eval | Evaluate JS expression in the browser tab |
browser_fill_input | Fill React/Vue/Svelte controlled inputs via CDP — works where browser_eval value assignment doesn't update framework state |
browser_get_dom | Get outerHTML of element or document.body |
browser_get_interactive | Enumerate links / buttons / inputs + ARIA toggles with state.{checked,pressed,selected,expanded}; each element includes viewportPosition |
browser_get_app_state | SPA state extractor — one CDP call that scans __NEXT_DATA__, __NUXT_DATA__, __REMIX_CONTEXT__, __APOLLO_STATE__, GitHub react-app embeddedData, JSON-LD, window.__INITIAL_STATE__ |
browser_search | Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
browser_navigate | Navigate via CDP Page.navigate; waitForLoad:true (default) returns once readyState==='complete' |
browser_disconnect | Close cached CDP WebSocket sessions |
All browser_* tools that touch the DOM accept includeContext:false to omit the trailing activeTab: / readyState: lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
| Tool | Description |
|---|---|
workspace_snapshot | All windows: thumbnails + UI summaries in one call |
workspace_launch | Launch an app and auto-detect the new window |
| Tool | Description |
|---|---|
pin_window / unpin_window | Always-on-top toggle |
run_macro | Execute up to 50 steps sequentially in one MCP call |
| Tool | Description |
|---|---|
clipboard_read | Read the current Windows clipboard text (non-text payloads return empty string) |
clipboard_write | Write text to the Windows clipboard; full Unicode / emoji / CJK support |
| Tool | Description |
|---|---|
notification_show | Show a Windows system tray balloon notification — useful to alert the user when a long-running task finishes |
| Tool | Description |
|---|---|
scroll_to_element | Scroll a named element into the viewport without computing scroll amounts. Chrome path: selector + block alignment. Native path: name + windowTitle via UIA ScrollItemPattern |
smart_scroll | SmartScroll — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns pageRatio, ancestors[], and hash-verified scrolled |
For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.
# Launch Chrome in CDP mode
chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp
browser_launch() → launch Chrome/Edge/Brave in debug mode (idempotent)
browser_connect() → list open tabs + get tabIds
browser_find_element("#submit") → CSS selector → physical screen coords
browser_click_element("#submit") → find + click in one step (auto-focuses browser)
browser_eval("document.title") → evaluate JS, returns result
browser_fill_input("#email", "user@example.com") → fill React/Vue/Svelte controlled input (state-safe)
browser_get_dom("#main", maxLength=5000)→ outerHTML, truncated to maxLength chars
browser_get_interactive() → links/buttons/inputs + ARIA toggles + viewportPosition per element
browser_get_app_state() → one-shot SPA state (Next/Nuxt/Remix/Apollo/GitHub react-app/Redux SSR)
browser_search(by="text", pattern="...")→ grep DOM with confidence ranking
browser_navigate("https://example.com") → navigate via CDP (no address bar interaction)
browser_disconnect() → clean up WebSocket sessions
For chained calls in the same tab, pass includeContext:false to omit the activeTab/readyState annotation (~150 tok/call saved). Boolean / object params accept the LLM-friendly string spellings ("true", "{}").
Coordinates returned by browser_find_element account for the browser chrome (tab strip + address bar height) and devicePixelRatio, so they can be passed directly to mouse_click without any scaling.
Recommended web workflow:
browser_connect() → browser_get_dom() → browser_find_element(selector) → browser_click_element(selector)
Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_DOCK_TITLE": "@parent",
"DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
"DESKTOP_TOUCH_DOCK_WIDTH": "480",
"DESKTOP_TOUCH_DOCK_HEIGHT": "360",
"DESKTOP_TOUCH_DOCK_PIN": "true"
}
}
}
}
| Env var | Default | Notes |
|---|---|---|
DESKTOP_TOUCH_DOCK_TITLE | (unset = off) | @parent walks the MCP process tree to find the hosting terminal — immune to title / branch / project changes. Or use a literal substring. |
DESKTOP_TOUCH_DOCK_CORNER | bottom-right | top-left / top-right / bottom-left / bottom-right |
DESKTOP_TOUCH_DOCK_WIDTH / HEIGHT | 480 / 360 | px ("480") or ratio of work area ("25%") — 4K/8K auto-adapts |
DESKTOP_TOUCH_DOCK_PIN | true | Always-on-top toggle |
DESKTOP_TOUCH_DOCK_MONITOR | primary | Monitor id from get_screen_info |
DESKTOP_TOUCH_DOCK_SCALE_DPI | false | If true, multiply px values by dpi / 96 (opt-in per-monitor scaling) |
DESKTOP_TOUCH_DOCK_MARGIN | 8 | Screen-edge padding (px) |
DESKTOP_TOUCH_DOCK_TIMEOUT_MS | 5000 | Max wait for the target window to appear |
Input routing gotcha: when a pinned window is active (e.g. Claude CLI),
keyboard_type/keyboard_presssend keys to it, not the app you wanted to type into. Always callfocus_window(title=...)before keyboard operations, then verifyisActive=trueviascreenshot(detail='meta').
| Tool | Description |
|---|---|
perception_register | Register a live perception lens on a window or browser tab. Returns a lensId to pass to action tools |
perception_read | Force-refresh the lens and return a full perception envelope when attention is dirty/stale/blocked |
perception_forget | Release a lens when the workflow ends or the target was replaced |
perception_list | List active lenses so Claude can reuse or clean up existing tracking |
Reactive Perception Graph is desktop-touch's low-cost situational awareness layer. It keeps the target identity, focus, rect, readiness, and guard state alive across actions so Claude does not need to re-check everything with a screenshot after every small move.
# Register a lens on the target window or browser tab
perception_register({name:"editor", target:{kind:"window", match:{titleIncludes:"Notepad"}}})
→ {lensId:"perc-1", ...}
# Pass lensId to action tools. Guards run before the action;
# compact feedback arrives in post.perception after the action.
keyboard_type({text:"hello", windowTitle:"Notepad", lensId:"perc-1"})
→ post.perception: {attention:"ok", guards:{...}, latest:{target:{title, rect, foreground}}}
# If the app restarts or focus moves away, guards fail closed before unsafe input:
keyboard_type({text:"x", lensId:"perc-1"})
→ {ok:false, code:"GuardFailed", suggest:["Re-register lens for the new process instance"]}
lensId is opt-in on all action tools (keyboard_type, keyboard_press, mouse_click, mouse_drag, click_element, set_element_value, browser_click_element, browser_navigate, browser_eval). Omitting lensId preserves existing behavior exactly.
When Claude calls screenshot(detail='text') to read coordinates and then mouse_click seconds later, the target window may have moved. The homing system corrects this automatically.
| Tier | How to enable | Latency | What it does |
|---|---|---|---|
| 1 | Always-on (if cache exists) | <1ms | Applies (dx, dy) offset when window moved |
| 2 | Pass windowTitle hint | ~100ms | Auto-focuses window if it went behind another |
| 3 | Pass elementName/elementId + windowTitle | 1–3s | UIA re-query for fresh coords on resize |
# Tier 1 only (automatic)
mouse_click(x=500, y=300)
# Tier 1 + 2: also bring window to front if hidden
mouse_click(x=500, y=300, windowTitle="Notepad")
# Tier 1 + 2 + 3: also re-query UIA if window resized
mouse_click(x=500, y=300, windowTitle="Notepad", elementName="Save")
# Traction control OFF — no correction
mouse_click(x=500, y=300, homing=false)
The homing parameter is available on mouse_click, mouse_move, mouse_drag, and scroll. The cache is updated automatically on every screenshot(), get_windows(), focus_window(), and workspace_snapshot() call.
mouse_click image-local coords (origin + scale)When you take a dotByDot screenshot with dotByDotMaxDimension, the response prints the origin and scale values. Instead of computing screen coords manually, copy them into mouse_click:
# Screenshot response:
# origin: (0, 120) | scale: 0.6667
# To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)
mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
# Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)
This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, x/y remain absolute screen pixels (unchanged behavior).
screenshot key parametersdetail="image" — PNG/WebP pixels (default)
detail="text" — UIA element JSON + clickAt coords (no image, ~100–300 tok)
detail="meta" — Title + region only (cheapest, ~20 tok/window)
dotByDot=true — 1:1 WebP; image_px + origin = screen_px
dotByDotMaxDimension=N — cap longest edge (response includes scale for coord math)
grayscale=true — ~50% smaller for text-heavy captures (code/AWS console)
region={x,y,w,h} — with windowTitle: window-local coords (exclude browser chrome)
without: virtual screen coords
diffMode=true — I-frame first call, P-frame (changed windows only) after (~160 tok)
ocrFallback="auto" — detail='text' auto-fires Windows OCR on uiaSparse or empty
Recommended Chrome combo (50–70% data reduction):
screenshot(windowTitle="Chrome",
dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
region={x:0, y:120, width:1920, height:900}) # skip browser chrome
Recommended workflow:
workspace_snapshot() → full orientation (resets diff buffer)
screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
mouse_click(x, y) → click directly, no math needed
screenshot(diffMode=true) → check only what changed (~160 tok)
Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.
checkFailsafe() runs before every tool handlerworkspace_launch blocklist:
cmd.exe, powershell.exe, pwsh.exe, wscript.exe, cscript.exe, mshta.exe, regsvr32.exe, rundll32.exe, msiexec.exe, bash.exe, wsl.exe are blocked.
Script extensions (.bat, .ps1, .vbs, etc.) are rejected. Arguments containing ;, &, |, `, $(, ${ are also rejected.
keyboard_press blocklist:
Win+R (Run dialog), Win+X (admin menu), Win+S (search), Win+L (lock screen) are blocked.
All -like patterns in the UIA bridge are sanitized with escapeLike(), which escapes wildcard characters (*, ?, [, ]) before they reach PowerShell.
workspace_launchShell interpreters are blocked by default. To allow specific executables, create an allowlist file:
File locations (searched in order):
DESKTOP_TOUCH_ALLOWLIST environment variable~/.claude/desktop-touch-allowlist.jsondesktop-touch-allowlist.json in the server's working directoryFormat:
{
"allowedExecutables": [
"pwsh.exe",
"C:\\Tools\\myapp.exe"
]
}
Changes take effect immediately — no restart needed.
All mouse tools (mouse_move, mouse_click, mouse_drag, scroll) accept an optional speed parameter:
| Value | Behavior |
|---|---|
| Omitted | Uses the configured default (see below) |
0 | Instant teleport — setPosition(), no animation |
1–N | Animated movement at N px/sec |
Default speed is 1500 px/sec. Change it permanently via the DESKTOP_TOUCH_MOUSE_SPEED environment variable:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_MOUSE_SPEED": "3000"
}
}
}
}
Common values: 0 = teleport, 1500 = default gentle, 3000 = fast, 5000 = very fast.
Windows foreground-stealing protection can prevent SetForegroundWindow from succeeding when another window (such as a pinned Claude CLI) is in the foreground. This causes subsequent keystrokes or clicks to land in the wrong window — a silent failure.
mouse_click, keyboard_type, keyboard_press, and terminal_send all accept a forceFocus parameter that bypasses this protection using AttachThreadInput:
{
"name": "mouse_click",
"arguments": {
"x": 500,
"y": 300,
"windowTitle": "Google Chrome",
"forceFocus": true
}
}
If the force attempt is refused despite AttachThreadInput, the response includes hints.warnings: ["ForceFocusRefused"].
Global default via environment variable:
{
"mcpServers": {
"desktop-touch": {
"env": {
"DESKTOP_TOUCH_FORCE_FOCUS": "1"
}
}
}
}
Setting DESKTOP_TOUCH_FORCE_FOCUS=1 makes forceFocus: true the default for all four tools without changing each call.
Known tradeoffs:
AttachThreadInput window, key state and mouse capture are shared between the two threads. In rapid macro sequences this can cause a race condition (rare in practice).forceFocus (or unset the env var) when the user is manually operating another app to avoid unexpected focus shifts.Action tools (mouse_click, mouse_drag, keyboard_type, keyboard_press, click_element, set_element_value, browser_click_element, browser_navigate) automatically guard each action when you pass windowTitle / tabId:
post.perception.status on every response — including failures — so the LLM can recover without a screenshotDisabling auto guard — set DESKTOP_TOUCH_AUTO_GUARD=0 to restore v0.11.12 behavior (no auto guard):
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_AUTO_GUARD": "0"
}
}
}
}
When auto guard is enabled (default), post.perception.status will be one of:
| Status | Meaning |
|---|---|
ok | Guard passed — target verified |
unguarded | windowTitle not provided; action ran without guard |
target_not_found | No window matched the given title |
ambiguous_target | Multiple windows matched; use a more specific title |
identity_changed | Window was replaced (process restart / HWND change) |
unsafe_coordinates | Click coordinates are outside the target window rect |
needs_escalation | Use browser_click_element or specify windowTitle |
When unsafe_coordinates or identity_changed is returned, the response may include a suggestedFix.fixId. Pass that fixId to the relevant tool call to approve the recovery:
{ "name": "mouse_click", "arguments": { "fixId": "fix-..." } }
{ "name": "keyboard_type", "arguments": { "fixId": "fix-...", "text": "hello" } }
{ "name": "click_element", "arguments": { "fixId": "fix-..." } }
{ "name": "browser_click_element", "arguments": { "fixId": "fix-..." } }
The fix is one-shot and expires in 15 seconds. The server revalidates the target process identity before executing.
The server tracks a semantic timeline of what happened to each target window/tab. Recent events are included in:
get_history → recentTargetKeys: array of 3 most recently active target keys (compact, no event bodies)perception_read(lensId) → recentEvents: up to 10 events for that lens's target, each with tsMs, semantic, summaryEnable the MCP resources below to browse timelines:
{ "env": { "DESKTOP_TOUCH_PERCEPTION_RESOURCES": "1" } }
MCP resources available when enabled:
| URI | Content |
|---|---|
perception://target/{targetKey}/timeline | Semantic event timeline for a target |
perception://targets/recent | Most recently active target keys |
perception://lens/{lensId}/summary | Lens attention/guard state |
Manual lenses (created via perception_register) are now evicted by least-recently-used instead of insertion order. Using perception_read, evaluatePreToolGuards, or buildEnvelopeFor on a lens promotes it. The hard limit of 16 active lenses is unchanged.
Pass withPerception: true to receive a structured JSON response with post.perception instead of raw text:
{ "name": "browser_eval", "arguments": { "expression": "document.title", "withPerception": true } }
Returns { ok: true, result: "...", post: { perception: { status: "ok", ... } } }.
mouse_drag now guards both start and end coordinates. Drags that cross window boundaries (or reach the desktop wallpaper) are blocked by default. To allow intentional cross-window or range-selection drags:
{ "name": "mouse_drag", "arguments": { "startX": 100, "startY": 100, "endX": 900, "endY": 900, "allowCrossWindowDrag": true } }
| Limitation | Detail | Workaround |
|---|---|---|
| Games / video players may return black or hang in background capture | DirectX fullscreen apps may not work even with PW_RENDERFULLCONTENT | Retry with screenshot_background(fullContent=false); if still black, use foreground screenshot |
| UIA call overhead | ~300ms per call via PowerShell; workspace_snapshot uses a 2s timeout internally | Batch with workspace_snapshot upfront, then use diffMode for incremental checks |
| Chrome / WinUI3 UIA elements are empty | Chromium exposes only limited UIA | screenshot(detail='text') auto-detects Chromium and falls back to Windows OCR (hints.chromiumGuard=true). For richer DOM access use browser_connect + browser_find_element |
Chromium title-regex misses when sites rewrite document.title | Guard relies on the - Google Chrome suffix being present; some sites push it off the end of a long title | Title is treated as plain Chrome (UIA runs). OCR path is still reachable via ocrFallback='always' or when UIA returns <5 elements (uiaSparse) |
browser_* CDP tools need Chrome launched with --remote-debugging-port | If Chrome is already running on the default profile without the flag, browser_launch / browser_connect fail. The CDP E2E suite (tests/e2e/browser-cdp.test.ts) will also fail in that state | Close Chrome first, then browser_launch will relaunch it in debug mode, or start Chrome manually with --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp |
| Layer buffer TTL | Buffer auto-clears after 90s of inactivity → next diffMode becomes an I-frame | After long waits, call workspace_snapshot to explicitly reset the buffer |
keyboard_type / keyboard_press follow focus | When dock_window(pin=true) keeps another window on top (e.g. Claude CLI), keystrokes may be absorbed by that window | Call focus_window(title=...) first and verify isActive=true via screenshot(detail='meta') before sending keys |
keyboard_type em-dash / smart quotes in Chrome/Edge | Non-ASCII punctuation (em-dash —, en-dash –, smart quotes "" '') can be intercepted as keyboard accelerators, shifting focus to the address bar | Always use use_clipboard=true when the text contains such characters |
browser_eval on React / Vue / Svelte inputs | Setting element.value = ... or dispatching synthetic events does not update the framework's internal state | Use browser_fill_input(selector, value) — it uses native prototype setter + InputEvent which does update React/Vue/Svelte state |
| Mode | Tokens | Use case |
|---|---|---|
screenshot (768px PNG) | ~443 tok | General visual check |
screenshot(dotByDot=true) window | ~800 tok | Precise clicking (no coordinate math) |
screenshot(diffMode=true) | ~160 tok | Post-action diff |
screenshot(detail="text") | ~100–300 tok | UI interaction (no image) |
workspace_snapshot | ~2000 tok | Full session orientation |
MIT
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Web content fetching and conversion for efficient LLM usage
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.