@clawhub-marmutapp-fbd5b6b695
Eyes AND hands for OpenClaw — capture, AI vision, OCR, recording, voice dictation, and full GUI automation via 72 MCP tools. Use when the agent needs to see...
---
name: superbased
description: Eyes AND hands for OpenClaw — capture, AI vision, OCR, recording, voice dictation, and full GUI automation via 72 MCP tools. Use when the agent needs to see the user's screen OR drive their desktop (click / type / scroll / drag / form-fill / sequence) AND when the user mentions screenshots, screen recording, visual regression, OCR, voice transcription, or asks to automate a UI workflow.
---
SuperBased gives OpenClaw agents both eyes (screen capture, AI vision, OCR) and hands (full GUI automation with humanization v2 + CAPTCHA-solving guidance) on the user's desktop. The actual capabilities are exposed through 72 MCP tools served by the SuperBased MCP server (`superbased mcp`); this skill bundle teaches the agent **when** to reach for which tool.
## Two-step install (run once)
```bash
# 1. Install this skills bundle from ClawHub
openclaw skills install superbased
# 2. Register the SuperBased MCP server
openclaw mcp set superbased '{"command":"superbased","args":["mcp"]}'
# 3. (Pre-req) the SuperBased CLI on PATH
npm install -g superbased
```
Optional: install the SuperBased desktop app from [superbased.app](https://superbased.app) for a GUI to browse captures, configure providers, and manage the gallery. When the desktop app is running, `superbased mcp` auto-bridges to it via a PID file at `~/.superbased/`, so OpenClaw and the desktop share state.
## When to use SuperBased
Trigger SuperBased when the user's request involves any of:
- **Seeing what's on screen** — "look at this", "what's on my screen", "describe what I'm seeing", "read this dialog"
- **Verifying a UI change** — "did the button update?", "is the error gone?"
- **Reading content that's hidden behind scroll** — "what are all the settings?", "walk me through the sidebar"
- **Visual regression testing** — "record a baseline of the login flow", "did anything change visually?"
- **Watching for issues during long-running processes** — "monitor my deploy for errors", "let me know if anything fails"
- **Extracting text from images / screen** — "OCR this", "extract the text from this region"
- **Voice input** — "transcribe what I'm about to say", "type via dictation"
- **Compressing large text into images** — "send this 5K-token block as one image"
- **Annotating / redacting screenshots** — "highlight the broken thing", "redact the API key before sharing"
- **Driving the desktop UI** — "click that button", "type into the email field", "fill out this form", "press Cmd+S"
- **Multi-step workflow automation** — "open File menu, pick Open, type the path, press Enter, screenshot the result"
- **Solving in-flow CAPTCHA challenges** — "this drag puzzle is blocking me", "select all squares with traffic lights"
- **Fighting bot detection** — when an automation flow on a hardened site needs cursor-trajectory humanization
## Sub-skills (use these as the agent's working knowledge)
The 11 SKILL.md files in this bundle each cover one trigger category. Read the relevant one first when the user request matches its description:
| File | Use when |
|---|---|
| [skills/screenshot/SKILL.md](./skills/screenshot/SKILL.md) | Capturing the screen at the right resolution / window / region |
| [skills/visual-qa/SKILL.md](./skills/visual-qa/SKILL.md) | Record-baseline → make-changes → record-again → diff workflow |
| [skills/monitor/SKILL.md](./skills/monitor/SKILL.md) | Proactive screen watching during deploys, tests, builds |
| [skills/walkthrough/SKILL.md](./skills/walkthrough/SKILL.md) | Reading a scrollable section end-to-end via `superbased_scroll_capture` |
| [skills/compress/SKILL.md](./skills/compress/SKILL.md) | Converting large text to token-efficient images |
| [skills/redact/SKILL.md](./skills/redact/SKILL.md) | Auto-redacting secrets / PII before sharing |
| [skills/dictation/SKILL.md](./skills/dictation/SKILL.md) | Voice input, audio transcription, speech-to-text |
| [skills/annotate/SKILL.md](./skills/annotate/SKILL.md) | Highlighting areas, marking regressions, drawing on captures |
| [skills/gui-automation/SKILL.md](./skills/gui-automation/SKILL.md) | Click / type / scroll / drag / form-fill / sequence — driving the desktop |
| [skills/captcha-solving/SKILL.md](./skills/captcha-solving/SKILL.md) | reCAPTCHA / Cloudflare Turnstile / drag puzzles / rotation puzzles / image grids |
| [skills/humanization/SKILL.md](./skills/humanization/SKILL.md) | Picking the right `humanize` profile (off / light / human / paranoid) per call |
## The 72 MCP tools at a glance
Capture & View (5): `superbased_screenshot`, `_capture_image`, `_capture`, `_gallery_image`, `_window_list`
AI & OCR (8): `superbased_ai`, `_ai_usage`, `_ocr`, `_transcribe`, `_compress_text`, `_project`, `_workspace_sync`, `_stt_status`
Gallery (2): `superbased_gallery`, `_gallery_update`
Privacy & Annotations (2): `superbased_redact`, `_annotate`
Dictation & Voice (2): `superbased_dictate`, `_dictation_history`
Recording & Visual QA (7): `superbased_recording`, `_sessions`, `_describe_frames`, `_narrate`, `_diff`, `_baseline`, `_export`
Settings, Auth & System (6): `superbased_settings`, `_presets`, `_auth`, `_license`, `_health`, `_clipboard`
GUI Automation (40): `superbased_ui_dump`, `_scroll_capture`, `_scroll_to`, `_sequence`, `_click`, `_type`, `_hotkey`, `_scroll`, `_drag`, `_drag_file`, `_hover`, `_context_menu_select`, `_form_fill`, `_dialog_handle`, `_open_url`, `_find_in_page`, `_tab_management`, `_tray_click`, `_virtual_desktop`, `_window_state`, `_resize_window`, `_focus_window`, `_window_bounds`, `_find_title_bar_drag_region`, `_display_list`, `_launch_app`, `_find_image`, `_capture_template`, `_pixel_color`, `_ax_invoke`, `_accessibility_tree`, `_locate`, `_wait`, `_wait_for`, `_mouse_position`, `_dry_run`, `_replay`, `_doctor_gui_automation`, `_undo_last`, `_tools`
## Safety rails (for the GUI automation surface)
Before any state-modifying GUI action (click, type, drag, sequence, form_fill, etc.):
1. The master toggle (Settings > GUI Automation > Enabled) must be on. Run `superbased_doctor_gui_automation` to verify.
2. Per-action toggles (click, type, hotkey, scroll, drag, hover) must each be enabled.
3. Every state-modifying call must pass `confirm: true` — the server refuses without it.
4. Protected-apps blocklist + NDJSON audit log are server-side; users can audit every action you took.
## When to bump humanization
Default `humanize: 'light'` is enough for most consumer sites. Bump to `'human'` for sites with active bot detection (Cloudflare-fronted, reCAPTCHA-gated). Bump to `'paranoid'` for hardened targets (banking, ticketing, social media bot crackdowns). See `skills/humanization/SKILL.md` for the full picker.
## Links
- [SuperBased](https://superbased.app) — Desktop app + npm CLI
- [npm: superbased](https://www.npmjs.com/package/superbased) — The CLI providing the MCP server
- [Source-of-truth Claude Code plugin](https://github.com/marmutapp/superbased-claude-code-plugin) — Where shared content (skills) is mastered
- [OpenClaw](https://openclaw.ai) + [ClawHub](https://clawhub.ai) — The runtime + registry
FILE:README.md
# SuperBased — Eyes AND Hands for OpenClaw
Screenshot capture, AI vision, OCR, screen recording, voice dictation, **and full GUI automation with humanization v2** — all via 72 MCP tools, directly inside [OpenClaw](https://openclaw.ai).
This is a [ClawHub](https://clawhub.ai) skill bundle that ships proactive guidance for SuperBased's MCP toolkit. The actual 72 tools come from the SuperBased MCP server — the skill bundle tells the OpenClaw agent **when** to use them and **how**.
## Install (two steps)
### 1. Install the skill bundle from ClawHub
```bash
openclaw skills install superbased
```
This downloads the 11 SKILL.md files into your OpenClaw workspace's `skills/` directory.
### 2. Register the SuperBased MCP server
```bash
openclaw mcp set superbased '{"command":"superbased","args":["mcp"]}'
```
This points OpenClaw at the SuperBased CLI for the actual tool calls.
### 3. (One-time) install the SuperBased CLI
```bash
npm install -g superbased
```
This is the binary that the MCP server invocation runs.
### 4. (Optional) install the SuperBased desktop app
The desktop app for Windows/macOS gives you a GUI for browsing captures, configuring providers, and managing the gallery. When the desktop app is running, `superbased mcp` auto-detects it via the PID file at `~/.superbased/` and acts as a stdio↔HTTP bridge — so OpenClaw and the desktop share the same gallery / sessions / settings.
Download from [superbased.app](https://superbased.app).
## Skills (11)
| Skill | When OpenClaw Uses It |
|-------|-----------------------|
| **screenshot** | OpenClaw needs to see what's on the user's screen |
| **visual-qa** | Visual regression testing: record baseline → make changes → record again → diff |
| **monitor** | Proactive screen watching during deploys, tests, builds |
| **compress** | Large text content (>500 tokens) that would be cheaper as an image |
| **redact** | Screenshots that may contain API keys, tokens, or PII before sharing |
| **dictation** | Voice input, audio transcription, or speech-to-text |
| **annotate** | Highlighting areas, marking regressions, creating annotated screenshots |
| **walkthrough** | Multi-frame product walkthrough: capture, narrate, export |
| **gui-automation** | "Click that", "type into this", "fill the form" — drives the desktop with click/type/hotkey/scroll/drag/form-fill/sequence |
| **captcha-solving** | reCAPTCHA / Cloudflare Turnstile / drag puzzles / rotation puzzles / image grids |
| **humanization** | Sites with bot detection — picks the right humanization profile (off/light/human/paranoid) |
## Humanization v2
GUI automation actions (`click`, `type`, `drag`, `hover`) ship with a humanization layer to reduce the bot-detection signal: sin-shaped velocity envelope on cursor walks, gaussian click-target jitter, gamma-distributed pre-click settle dwell, 50–110 ms click hold variation, 45–95 ms key hold, wired typo simulation with QWERTY same-row neighbors, pre-click tremor on the target element, occasional 2–4× micro-pauses, per-process cross-session salt mixed into seeds, inter-action catch-up pause, and opt-in idle cursor drift.
Four profiles selectable per call: `humanize: 'off' | 'light' | 'human' | 'paranoid'`. Default `light`. Bump to `human` or `paranoid` for sites with active bot detection — see the **humanization** skill.
## CAPTCHA solving
Plugin ships proactive guidance for the four CAPTCHA classes: image grids (vision identifies, batched click sequence), drag puzzles (one-motion drag with `humanize: 'light'`), rotation puzzles (calibrate-then-execute), and checkbox-only Turnstile. Plus the honest "what humanization can't defeat" list (server-side fingerprinting, audio CAPTCHAs, hCaptcha enterprise mode). See the **captcha-solving** skill.
## MCP Tools (72)
The 72 tools come from the SuperBased MCP server. Categories: Capture & View (5), AI & OCR (8), Gallery (2), Privacy & Annotations (2), Dictation & Voice (2), Recording & Visual QA (7), Settings/Auth/System (6), and **GUI Automation (40)**.
See [the source-of-truth Claude Code plugin README](https://github.com/marmutapp/superbased-claude-code-plugin#mcp-tools-72) for the full categorized list with collapsibles.
## Why two install steps?
ClawHub registers **skills** (when/how to use a tool) and **plugins** (TypeScript code), but does NOT register MCP servers directly. The clean split:
- **Skills bundle (this package)** — published to ClawHub, installable via `openclaw skills install superbased`. Tells the agent when to reach for SuperBased and what's possible.
- **MCP server (`superbased mcp`)** — registered separately via `openclaw mcp set superbased '...'`. Provides the actual 72 tools.
There is no built-in OpenClaw plugin from us. ClawHub doesn't list MCP servers via the plugin wrapper, so a wrapper plugin would just add ceremony without unlocking discoverability. If we ship one later, it'll be for OpenClaw-specific lifecycle hooks (e.g. auto-capture on chat-app message events).
## Verifying the install
```bash
openclaw mcp list # superbased should be listed
openclaw mcp show superbased # see the registered config
openclaw skills list # superbased skills should appear
openclaw skills check # validates the local skill environment
superbased --version # confirms CLI is on PATH
```
## Links
- [SuperBased](https://superbased.app) — Desktop app + npm CLI
- [npm: superbased](https://www.npmjs.com/package/superbased) — The CLI providing the MCP server
- [Source-of-truth Claude Code plugin](https://github.com/marmutapp/superbased-claude-code-plugin) — Where shared content (skills) is mastered
- [OpenClaw](https://openclaw.ai) + [ClawHub](https://clawhub.ai) — The runtime + registry
FILE:mcp-config-snippet.json
{
"command": "superbased",
"args": ["mcp"]
}
FILE:skill-bundle.json
{
"slug": "superbased",
"name": "SuperBased",
"version": "2.0.2",
"description": "Screen capture, AI vision, OCR, recording, dictation, and full GUI automation (72 MCP tools) with humanization v2 + CAPTCHA-solving guidance — gives OpenClaw agents eyes on the screen and hands on the desktop.",
"homepage": "https://superbased.app",
"repository": "https://github.com/marmutapp/superbased-openclaw-plugin",
"license": "MIT",
"author": {
"name": "Gaja AI",
"email": "[email protected]"
},
"tags": [
"screenshot",
"screen-capture",
"ocr",
"ai-vision",
"screen-recording",
"visual-testing",
"token-compression",
"dictation",
"monitor",
"gui-automation",
"click-automation",
"form-fill",
"captcha",
"humanization",
"mcp"
],
"skills": [
"screenshot",
"visual-qa",
"monitor",
"compress",
"redact",
"dictation",
"annotate",
"walkthrough",
"gui-automation",
"captcha-solving",
"humanization"
],
"requirements": {
"mcpServer": {
"name": "superbased",
"transport": "stdio",
"command": "superbased",
"args": [
"mcp"
],
"installHint": "npm install -g superbased"
},
"platforms": [
"windows",
"macos",
"linux"
],
"node": ">=20"
}
}
FILE:skills/annotate/SKILL.md
---
name: annotate
description: Add visual annotations to screenshots for bug reports, code reviews, and documentation. Use when highlighting areas of interest, marking regressions, or creating annotated screenshots.
---
Use `superbased_annotate` to add programmatic annotations to captures.
**Annotation types:**
- `rectangle`: bounding box outline around an area
- `highlight`: semi-transparent colored overlay
- `blur`: pixelate a region (useful for redacting manually)
- `text`: text label at a position
- `arrow`: arrow pointing from one location to another
**Workflow:**
1. Capture a screenshot or get a `captureId` from gallery
2. Call `superbased_annotate` with the `captureId` and an `annotations` array
3. Set `save: true` to modify the original, or `false` (default) to return annotated image without changing the original
**Example for bug reporting:**
```
superbased_annotate captureId=42 annotations=[
{type:"rectangle", x:100, y:200, width:300, height:50, color:"#ff0000"},
{type:"text", x:100, y:190, text:"Bug: misaligned button", color:"#ff0000"},
{type:"arrow", x:250, y:180, toX:250, toY:200, color:"#ff0000"}
]
```
FILE:skills/captcha-solving/SKILL.md
---
name: captcha-solving
description: Solve in-flow CAPTCHA challenges (reCAPTCHA, Cloudflare Turnstile, drag-puzzles, rotation puzzles, image grids) using vision + GUI automation. Use when an automation flow gets blocked by a verification challenge.
---
When a GUI automation flow hits a CAPTCHA, don't give up — most consumer-facing CAPTCHAs are solvable with the SuperBased vision + automation toolkit. Pick the pattern that matches the challenge type.
## Pattern 1: Image-grid challenges (reCAPTCHA, Cloudflare Turnstile)
"Select all squares with traffic lights / crosswalks / buses".
1. `superbased_screenshot` — capture the challenge.
2. `superbased_ai` — ask the model "Which of these 9 tiles contain a `<X>`? Return as a list of 1-indexed grid positions." (the underlying vision call returns coordinates reliably for typical 3×3 / 4×4 grids).
3. `superbased_ui_dump` on the challenge widget — get the per-tile `center.{x,y}`.
4. `superbased_sequence` — one batched click sequence selecting the matched tiles, ending with a click on Verify and a final `screenshot` step.
Critical: do NOT click tiles one-by-one with separate approvals; batch into a single sequence so the click cadence matches what a human does (reCAPTCHA scores cadence).
## Pattern 2: Drag puzzles (slider-to-fit, "drag piece to gap")
1. `superbased_screenshot` — capture the puzzle.
2. `superbased_ai` — "What is the horizontal pixel offset from the puzzle piece's current position to the gap?" Vision returns a delta.
3. `superbased_drag` with `humanize: 'light'` — drag the slider/piece by that delta in **one motion**. Never split into multiple sub-drags — drop velocity is the main bot signal here. The `light` profile gives a sin-shaped velocity envelope that reads as human; `'off'` will fail almost every time.
4. End with `superbased_screenshot` to confirm the puzzle accepted the drop.
## Pattern 3: Rotation puzzles ("rotate the image upright")
This is the **calibrate-then-execute** pattern:
1. `superbased_screenshot` — capture the puzzle.
2. `superbased_ai` — "What's the rotation angle in degrees needed to make this image upright?" Vision returns an angle.
3. Look up the puzzle widget's geometry via `superbased_ui_dump` — you need the rotation handle's start position and the rotation axis.
4. `superbased_drag` from the handle's start to the calibrated end position, **in one motion**, `humanize: 'light'`.
5. `superbased_screenshot` to verify acceptance.
The "calibrate first, then execute in one drag" pattern is what makes this work — re-calibrating mid-drag (multiple sub-drags) signals automation.
## Pattern 4: Checkbox-only Turnstile / "I'm not a robot"
Often passes on the first click if the cursor approach was sufficiently human:
1. `superbased_click` on the checkbox with `humanize: 'human'` (bump from default `light`).
2. `superbased_wait_for` with a predicate matching the success state (checkmark visible) or escalation (image grid appears — fall back to Pattern 1).
## What humanization CANNOT defeat (be honest with the user)
- **Server-side device fingerprinting** — cookies, IP reputation, TLS fingerprint, browser fingerprint. SuperBased operates the user's real browser, so it inherits the user's reputation. If the user's IP is on a residential proxy / VPN that's been flagged, no humanization will help.
- **Audio CAPTCHAs** — SuperBased can detect the audio button but the audio decoding pipeline is not built into the toolkit. The user has to solve those manually.
- **hCaptcha enterprise mode** — uses behavioral signals SuperBased can't fully mimic (mouse trajectory variance, typing rhythm, focus events). May work; may not. Try Pattern 1 with `humanize: 'paranoid'` once; if it loops, escalate to the user.
- **Phone / SMS / email verification** — not a CAPTCHA, but the same place users hit a wall. Surface to the user and ask them to complete the step.
## Important: don't loop blindly
If a CAPTCHA fails, capture, summarize what you tried (which pattern, which humanize profile, what the model said), and ask the user before retrying. A second-and-third automated CAPTCHA attempt looks more like a bot than a single failed attempt.
FILE:skills/compress/SKILL.md
---
name: compress
description: Compress large text into token-efficient images using the Token Compression Engine
---
When dealing with large text content (logs, code, documents) that exceeds ~500 tokens, use `superbased_compress_text` to convert it into optimized images that cost fewer tokens.
**Theme selection:**
- `"terminal"` -- CLI output, build logs, server logs
- `"dark"` -- source code, config files
- `"paper"` -- documentation, articles, prose
- `"light"` -- general text, mixed content
- `"high-contrast"` -- accessibility, presentations
**Parameters:**
- `preset: "auto"` -- let the engine pick optimal resolution
- `columns: "auto"` -- auto-detect best column layout
- `render_style: "auto"` -- auto-detect code vs document vs terminal
**When to use:**
- Pasting large file contents into context
- Sharing build/test output
- Including documentation in conversations
- Any text block where image tokens < text tokens (typically >500 tokens)
FILE:skills/dictation/SKILL.md
---
name: dictation
description: Voice input and audio transcription. Use when the user wants to speak instead of type, transcribe audio files, or work with voice recordings.
---
**Live microphone recording:**
Use `superbased_dictate` with `mic: true` and `duration` in seconds (default 10). Set `cleanup: true` to remove filler words and duplicates.
**Audio file transcription:**
- With cleanup: `superbased_dictate` with `audioPath`
- Raw Whisper output: `superbased_transcribe` with `audioPath`
**Transcription history:**
Use `superbased_dictation_history` to query past transcriptions (default limit 20).
**When to use dictate vs transcribe:**
- `superbased_dictate`: adds filler word removal, deduplication, and supports mic recording
- `superbased_transcribe`: raw Whisper output only, for when exact transcription is needed
Supported audio formats: wav, mp3, webm.
FILE:skills/gui-automation/SKILL.md
---
name: gui-automation
description: Drive the user's desktop with click / type / hotkey / scroll / drag / form-fill primitives. Use when the user asks you to "click X", "type into Y", "fill out this form", "automate this workflow", or any task that needs you to operate their actual UI rather than just describe it.
---
When the user wants you to ACT on their screen — click a button, type into a field, fill a form, drag a slider, send a hotkey — use the SuperBased GUI automation tools instead of just describing what they should do.
## Safety rails (verify before first action in a session)
1. **Master toggle**: Settings > GUI Automation > Enabled. If `superbased_doctor_gui_automation` reports `enabled: false`, surface that to the user and ask them to flip it on. Do not attempt to bypass.
2. **Always pass `confirm: true`** on any tool that modifies UI state (click / type / hotkey / scroll / drag / sequence / form_fill / dialog_handle / context_menu_select / ax_invoke / tab_management / virtual_desktop / tray_click). Tools refuse to fire without it.
3. **Per-action toggles**: each action class (click, type, hotkey, scroll, drag, hover) has its own enable/disable. The doctor tool reports which are off.
4. **Protected apps blocklist** + **NDJSON audit log** are server-side; you don't manage them, but they're why the user can audit what you did.
## The reliability pyramid (use the most reliable target you can find)
Order matters — always pick the first option that works for the target element:
1. **`automationId`** (Windows: AutomationId / macOS: AXIdentifier) — set by the app developer, never changes between layout shifts. Most reliable.
2. **`role` + `name`** — e.g. `role: "button", name: "Submit"`. Survives layout changes; can break if app re-labels.
3. **Visible label / OCR text** — what the user sees. Brittle on minor wording tweaks but works on apps with no AX surface.
4. **`coords: { x, y }`** — last resort. Survives nothing; only use when AX has no entry for the element (canvas widgets, custom-rendered controls, web-embedded iframes the AX layer skips).
Always call `superbased_ui_dump` first on a new app — it returns the AX tree with `automationId` / `role` / `name` / `center.{x,y}` so you can pick the right targeting strategy without guessing.
## The "always end with screenshot" rule for `superbased_sequence`
When you compose multiple steps via `superbased_sequence`, the last step **must** be a `screenshot` step. Why: without a final screenshot you have no proof the sequence reached the intended end-state, and the audit log shows N actions with no visible outcome. The screenshot step is cheap (one extra capture) and gives you/the user the verification frame.
```json
{
"confirm": true,
"steps": [
{ "type": "click", "name": "File", "role": "menuitem" },
{ "type": "click", "name": "Open...", "role": "menuitem" },
{ "type": "type", "text": "/path/to/file.txt" },
{ "type": "hotkey", "keys": "Enter" },
{ "type": "screenshot", "resolution": "half" }
]
}
```
## Decision guide (which tool for which task)
| User asks for... | Use |
|---|---|
| "Click that button" | `superbased_click` (look up via `_ui_dump` first) |
| "Type into that field" | `superbased_click` to focus, then `superbased_type` |
| "Fill out this form" | `superbased_form_fill` with `{label: value}` map |
| "Press Cmd+S / Ctrl+Tab" | `superbased_hotkey` |
| "Scroll down to the End User License" | `superbased_scroll_to` with the target text |
| "Walk me through the Settings page" | `superbased_scroll_capture` (one approval, all frames) |
| "Drag the slider to the gap" | `superbased_drag` (set `humanize: 'light'` for puzzles) |
| "Right-click and pick Inspect" | `superbased_context_menu_select` |
| "Confirm/dismiss this dialog" | `superbased_dialog_handle` |
| "Open https://..." | `superbased_open_url` |
| "Multi-step workflow" | `superbased_sequence` with screenshot last step |
| "Click that thing in the system tray" | `superbased_tray_click` |
| "Switch to virtual desktop 2" | `superbased_virtual_desktop` |
| Element has no AX entry, only a visible icon | `superbased_find_image` (template match) |
## When something fails
- If a click misses, run `superbased_ui_dump` again and inspect the actual `center.{x,y}` and `role` — the app may have re-rendered.
- If `superbased_doctor_gui_automation` reports a per-action toggle off, don't retry; tell the user to enable it.
- If the action needs a wait for UI to settle, use `superbased_wait_for` (waits for a predicate) over `superbased_wait` (blind sleep).
- For "I want to see what would happen but not actually do it", `superbased_dry_run` simulates without firing input events.
## Sites with bot detection
If the target is a webapp with active bot detection (e.g. Cloudflare-fronted, reCAPTCHA-gated), see the **humanization** skill for picking the right `humanize` profile. Default `light` is enough for most consumer sites; bump to `human` or `paranoid` for hardened targets.
For CAPTCHA challenges that block your flow, see the **captcha-solving** skill.
FILE:skills/humanization/SKILL.md
---
name: humanization
description: Pick the right humanize profile for GUI automation on sites with bot detection. Use when actions on a real webapp need to evade automation fingerprinting (Cloudflare-fronted sites, social media, banking, ticketing).
---
GUI automation tools (`click`, `type`, `drag`, `hover`, `sequence`) accept a `humanize` parameter with four profiles. Pick the right one for the target — over-humanizing is slow, under-humanizing gets you flagged.
## The 4 profiles
| Profile | When to use |
|---|---|
| `'off'` | **Internal tools, dev environments, your own app under test.** No humanization — fastest, deterministic. Will fail any consumer site with bot detection. |
| `'light'` | **Default.** Most consumer sites without aggressive detection. Sin-shaped velocity envelope, basic Gaussian click jitter, gamma-distributed pre-click dwell, 50–110 ms click hold variation. Adds ~50–200 ms per action. |
| `'human'` | **Sites with active bot detection** — anything Cloudflare-fronted, reCAPTCHA-gated, or hCaptcha-protected. Adds pre-click tremor on the target element + occasional 2–4× micro-pauses + per-process cross-session salt mixed into seeds (so two runs don't have identical inter-arrival times). Adds ~200–500 ms per action. |
| `'paranoid'` | **Hardened targets** — banking, ticketing (Ticketmaster), social media bot crackdowns. Everything in `human` plus rare "distraction" pauses (1–3 s gaps that look like the user got distracted), wider Gaussian jitter, slower velocity envelope. Adds ~500ms–2s per action. |
## How to choose
```
Internal/dev tool → 'off'
Consumer webapp without obvious bot detection → 'light' (default — don't override)
Site behind Cloudflare / has CAPTCHA gates / is in the "obviously cares about bots" category → 'human'
Banking / ticketing / known-hardened anti-bot target → 'paranoid'
```
When in doubt, start at the default `'light'` and bump up only if you get blocked. Going straight to `'paranoid'` when `'light'` would have worked just makes the automation slow.
## Per-call override
You don't pick one profile for the whole session. Set it per call:
```json
{ "tool": "superbased_click", "name": "Submit", "confirm": true, "humanize": "human" }
```
For `superbased_sequence`, set it on each step that needs it (or set on the sequence root and override per-step).
## Other humanization knobs (for advanced cases)
- **`typoProb`** on `superbased_type` — probability of a typo per character (then immediately corrects with backspace + correct char). QWERTY same-row neighbors. Default 0; bump to 0.01–0.03 for sites that score typing perfection as bot-like.
- **`humanInputIdleDrift`** in app settings — opt-in cursor drift while the agent is idle (between sequence steps the agent isn't actively moving). Off by default; enable for long-running sessions on hardened targets.
- **`humanize: 'light'` is required for CAPTCHA drops** — see the **captcha-solving** skill. Drag puzzles fail with `'off'` because drop velocity is the main signal.
## What humanization is for (and what it's not)
**Is for:** lowering the input-side bot signal — cursor trajectories, click timing, keystroke cadence, drag drop velocity. These are the things behavioral fingerprinting watches.
**Is NOT for:** server-side fingerprinting (TLS, cookies, IP reputation, browser fingerprint). SuperBased drives the user's real browser, so the network-side reputation is the user's reputation. If their IP is on a flagged subnet, no humanization profile will help.
If the user complains "still getting flagged with `paranoid`", the next step is investigating their IP / browser fingerprint / account history, not bumping the profile higher (there is no higher).
FILE:skills/monitor/SKILL.md
---
name: monitor
description: Proactive screen monitoring with AI analysis. Use when watching for errors during deploys, long-running tests, build processes, or any scenario where the screen should be watched for issues over time.
---
When you need to watch the user's screen for changes and automatically flag issues, use monitor mode:
1. Start: `superbased_recording` with `action: "start"`, `mode: "monitor"`
2. Configure the AI analysis with `analysisPrompt` tailored to the scenario:
- Deploy watching: "Flag any errors, failed health checks, or red status indicators"
- Test runs: "Flag test failures, assertion errors, or unexpected output"
- Build processes: "Flag compilation errors, warnings, or build failures"
- General: "Flag any errors, warnings, or unexpected states"
**Token budget controls:**
- `analyzeEvery: 5` -- AI analyzes every N significant frames (increase to reduce cost)
- `analyzeInterval: 30` -- minimum seconds between AI calls
- `analysisDetail: "low"` -- 512px images (use "high" for 1024px when reading small text)
**Retrieving results:**
- AI alerts arrive as MCP notifications during the session
- `superbased_recording` with `action: "get_analysis"` and `sessionId` retrieves all AI findings
- Optionally pass `since: N` to only get analyses after frame N
Stop the session with `superbased_recording` `action: "stop"` when done.
FILE:skills/redact/SKILL.md
---
name: redact
description: Auto-redact sensitive information from screenshots before sharing
---
Before sharing screenshots that may contain sensitive data, use `superbased_redact` to automatically detect and blur secrets.
**Parameters:**
- `secrets: true` -- redacts API keys, tokens, passwords, connection strings (default on)
- `pii: true` -- redacts emails, phone numbers, names, addresses (default off, enable when needed)
- Provide either `captureId` (from a gallery capture) or `screenshotPath` (file on disk)
**When to use:**
- Before sharing terminal screenshots that may show environment variables or API keys
- Before sharing browser screenshots with logged-in sessions
- When the user asks to share a screenshot externally
- Any capture of settings, config files, or dashboards
The tool returns the redacted image with sensitive regions blurred and a count of regions found.
FILE:skills/screenshot/SKILL.md
---
name: screenshot
description: Auto-capture the screen when Claude needs to see what the user sees. Use when debugging visual issues, verifying UI changes, reading on-screen content, or answering questions about what's visible.
---
When you need to see the user's screen to answer a question, debug a visual issue, or verify a UI change, use the `superbased_capture_image` tool.
**Default parameters:**
- `mode: "fullscreen"`
- `resolution: "half"` (saves ~4x tokens vs full resolution)
**Resolution guide:**
- General overview, layout checks: `resolution: "half"` (~691 tokens for 1080p)
- Reading small text or fine details: `resolution: "high"`
- Pixel-perfect comparisons: `resolution: "full"` (~2,765 tokens for 1080p)
- Just checking presence/layout: `resolution: "quarter"` (~173 tokens)
**Targeting a specific window:**
1. Call `superbased_window_list` to see all open windows
2. Call `superbased_capture_image` with `window: "substring"` -- SuperBased activates the window (restores if minimized), captures, then restores focus
**Reading clipboard images:**
If the user says "look at what I copied", use `superbased_clipboard` with `action: "readImage"`.
After capturing, describe what you observe and relate it to the user's question.
FILE:skills/visual-qa/SKILL.md
---
name: visual-qa
description: Visual regression testing workflow using recording sessions and diff comparison
---
Use this workflow to detect visual regressions:
**Baseline phase:**
1. `superbased_recording` with `action: "start"`, `name: "<workflow-name>-baseline"`, `profile: "automated_test"`
2. Walk through the UI flow, capturing key states with `superbased_recording` `action: "capture"` at each step
3. `superbased_recording` with `action: "stop"` -- save the session ID
4. `superbased_baseline` with `action: "set"`, `workflowName`, and `sessionId`
**After changes:**
1. `superbased_recording` with `action: "start"`, `name: "<workflow-name>-current"`, `profile: "automated_test"`
2. Repeat the same UI flow, capturing at the same steps
3. `superbased_recording` with `action: "stop"` -- save the session ID
**Comparison:**
1. `superbased_baseline` with `action: "get"` and `workflowName` to retrieve the baseline session ID
2. `superbased_diff` with `baselineSessionId` and `currentSessionId`
3. Report: frames that changed, what changed, overall similarity, and whether the changes are expected
FILE:skills/walkthrough/SKILL.md
---
name: walkthrough
description: Walk through a scrollable section (settings panel, long list, long page) and report what's there. Use when the user asks "what's in Settings", "show me the whole page", "walk me through X", "capture the whole sidebar", or any task that requires reading content that extends below the viewport.
---
When the user wants you to READ a scrollable section end-to-end — settings panels, long forms, long lists, long scrollable content — use the `superbased_scroll_capture` MCP tool. **Do NOT chain `superbased_scroll` + screenshot calls in a loop** — each step in that chain costs its own user approval, so a 6-page section = 6 approval prompts. `superbased_scroll_capture` is ONE approval that returns all N frames back to you inline.
**When to use this skill:**
- "What are all the settings in the GUI Automation page?"
- "Walk me through the Dictation settings"
- "Show me the whole Settings page"
- "What options are in the sidebar / menu / panel?"
- Any task where the answer requires scrolling through a section you can't see all of at once
**The one-call invocation:**
```
superbased_scroll_capture
anchorX=<x inside the scroll container>
anchorY=<y inside the scroll container>
processName="<target app>" # or hwnd=..., or window="<title>"
maxPages=8 # default 8; bump to 12-16 for very long sections
confirm=true
```
**Getting the anchor coords:**
1. First: `superbased_ui_dump processName="<target>"` — returns `textElements` with screen-space `center.{x,y}`.
2. Pick a `textElement` whose `center.{x,y}` sits INSIDE the scrollable container (NOT in a sidebar / header / toolbar — those scroll separately). An element in the middle-right of the content panel is usually safe.
3. Pass those coords as `anchorX` / `anchorY`.
**What you get back:**
- N image content blocks, one per captured viewport (typically 3–6 for a typical settings page).
- Text metadata: `framesCaptured`, `pagesScrolled`, `atEnd`, `atEndReason` (`'no_movement'` / `'max_pages'` / `'error'`), `calibration` (pixels-per-tick, score, cache hit), per-frame `scrolledPx`.
**Behavior characteristics:**
- **Inline calibration** — the first scroll IS the measurement. No visible "scroll down then scroll back up" rewind. User sees natural forward scrolling only.
- **Heuristic fallback** — if measurement is noisy (solid-background anchor, sparse content), the tool falls back to 40 px/tick with `calibration.lowConfidence: true` and keeps returning frames. Never hard-fails on calibration alone.
- **atEnd detection** — stops when consecutive frames differ by < 3 px (viewport didn't move = end of content). Pixel-offset measurement, not byte similarity.
- **Windows only** — uses Win32 `mouse_event` + PrintWindow. Cross-platform coming later.
**When NOT to use:**
- Page fits in the viewport — `superbased_ui_dump` is cheaper (no scrolling needed).
- You know a specific label to find — `superbased_scroll_to` stops at the match.
- The content isn't scrollable (modal dialogs, single-page forms).
**Anti-pattern — the loop this skill replaces:**
Do NOT do this:
```
superbased_click ... # navigate to settings
superbased_scroll amount=1 unit=page # approval #1
superbased_screenshot ... # approval #2
superbased_scroll amount=1 unit=page # approval #3
superbased_screenshot ... # approval #4
... (etc, 5-10+ approvals)
```
Instead:
```
superbased_click ... # navigate to settings (1 approval)
superbased_ui_dump ... # get anchor coords (1 approval)
superbased_scroll_capture anchor=... # the whole walkthrough (1 approval, N frames back)
```