@clawhub-scrapelesshq-8442a3a69b
Cloud browser automation CLI for AI agents powered by Scrapeless. Use when the user needs to interact with websites using cloud browsers, including navigatin...
--- name: scrapeless-scraping-browser description: Cloud browser automation CLI for AI agents powered by Scrapeless. Use when the user needs to interact with websites using cloud browsers, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task with residential proxies and anti-detection features. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "use a proxy", "bypass detection", or any task requiring cloud browser automation. allowed-tools: Bash(npx scrapeless-scraping-browser-skills scrapeless-scraping-browser:*), Bash(scrapeless-scraping-browser:*) --- # Cloud Browser Automation with scrapeless-browser ## Important: Session Management with --session-id **All browser operation commands support the `--session-id` parameter to specify which Scrapeless session to use.** ### Recommended Workflow ```bash # Step 1: Create a session and save the session ID SESSION_ID=$(scrapeless-scraping-browser new-session --name "workflow" --ttl 1800 --json | jq -r '.taskId') # Step 2: Use the session ID for all operations scrapeless-scraping-browser --session-id $SESSION_ID open https://example.com scrapeless-scraping-browser --session-id $SESSION_ID snapshot -i scrapeless-scraping-browser --session-id $SESSION_ID click @e1 # Step 3: Close when done scrapeless-scraping-browser --session-id $SESSION_ID close ``` ### Automatic Session Management If you don't specify `--session-id`: 1. The CLI will query for running sessions 2. If a running session exists, it will use the latest one 3. If no running session exists, it will create a new one automatically **For production workflows, always use `--session-id` to ensure consistency.** ## Authentication Setup Before using scrapeless-browser, you MUST set up authentication: ```bash # Method 1: Config file (recommended, persistent) scrapeless-scraping-browser config set apiKey your_api_token_here # Method 2: Environment variable export SCRAPELESS_API_KEY=your_api_token_here # Verify it's set scrapeless-scraping-browser config get apiKey ``` Get your API token from https://app.scrapeless.com ## Session Management Behavior The CLI manages Scrapeless sessions with the following behavior: - **Session Creation**: First command creates a new Scrapeless session - **Session Persistence**: Sessions remain active only while connection is maintained - **Session Termination**: Sessions automatically terminate when connection closes - **Reconnection Limitation**: Cannot reconnect to terminated sessions **Important**: For multi-step workflows, consider using the TypeScript API to maintain persistent connections. ## Core Workflow Every browser automation follows this pattern: 1. **Create Session**: Create a session and save the session ID 2. **Navigate**: Use `--session-id` to navigate to URL 3. **Snapshot**: Get element refs with `--session-id` 4. **Interact**: Use refs to click, fill, select with `--session-id` 5. **Re-snapshot**: After navigation or DOM changes, get fresh refs ```bash # Set API token first scrapeless-scraping-browser config set apiKey your_token # Create session SESSION_ID=$(scrapeless-scraping-browser new-session --name "form-fill" --ttl 600 --json | jq -r '.taskId') # Start automation with session ID scrapeless-scraping-browser --session-id $SESSION_ID open https://example.com/form scrapeless-scraping-browser --session-id $SESSION_ID snapshot -i # Output: @e1 [input type="email"], @e2 [input type="password"], @e3 [button] "Submit" scrapeless-scraping-browser --session-id $SESSION_ID fill @e1 "[email protected]" scrapeless-scraping-browser --session-id $SESSION_ID fill @e2 "password123" scrapeless-scraping-browser --session-id $SESSION_ID click @e3 scrapeless-scraping-browser --session-id $SESSION_ID wait --load networkidle scrapeless-scraping-browser --session-id $SESSION_ID snapshot -i # Check result ``` ## Command Chaining Commands can be chained with `&&` in a single shell invocation: ```bash # Chain open + wait + snapshot scrapeless-scraping-browser open https://example.com && scrapeless-scraping-browser wait --load networkidle && scrapeless-scraping-browser snapshot -i # Chain multiple interactions scrapeless-scraping-browser fill @e1 "[email protected]" && scrapeless-scraping-browser fill @e2 "password123" && scrapeless-scraping-browser click @e3 ``` **When to chain:** Use `&&` when you don't need to read intermediate output. Run commands separately when you need to parse output first (e.g., snapshot to discover refs, then interact). ## Essential Commands **Note**: All commands below support the optional `--session-id <id>` parameter. ```bash # Navigation & Session scrapeless-scraping-browser new-session [options] # Create new browser session scrapeless-scraping-browser [--session-id <id>] open <url> # Navigate to URL scrapeless-scraping-browser [--session-id <id>] close # Close browser session scrapeless-scraping-browser sessions # List running sessions scrapeless-scraping-browser stop <taskId> # Stop specific session scrapeless-scraping-browser stop-all # Stop all sessions ``` ### Session Creation with Advanced Options The `new-session` command supports extensive customization options: ```bash # Basic session creation scrapeless-scraping-browser new-session --name "my-session" --ttl 1800 # Session with proxy settings scrapeless-scraping-browser new-session \ --name "proxy-session" \ --proxy-country US \ --proxy-state CA \ --proxy-city "Los Angeles" \ --ttl 3600 # Session with custom browser configuration scrapeless-scraping-browser new-session \ --name "mobile-session" \ --platform iOS \ --screen-width 375 \ --screen-height 812 \ --user-agent "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X)" \ --timezone "America/Los_Angeles" \ --languages "en,es" # Session with recording enabled scrapeless-scraping-browser new-session \ --name "recorded-session" \ --recording true \ --ttl 7200 ``` **Available Options:** - `--name <name>`: Session name for identification - `--ttl <seconds>`: Session timeout in seconds (default: 180) - `--recording <true|false>`: Enable session recording - `--proxy-country <code>`: Proxy country code (e.g., AU, US, GB, CN, JP) - `--proxy-state <state>`: Proxy state/region (e.g., NSW, CA, NY, TX) - `--proxy-city <city>`: Proxy city (e.g., sydney, newyork, london, tokyo) - `--user-agent <ua>`: Custom user agent string - `--platform <platform>`: Platform (Windows, macOS, Linux, iOS, Android) - `--screen-width <px>`: Screen width in pixels (default: 1920) - `--screen-height <px>`: Screen height in pixels (default: 1080) - `--timezone <tz>`: Timezone (default: America/New_York) - `--languages <langs>`: Comma-separated language codes (default: en) ```bash # Snapshot scrapeless-scraping-browser [--session-id <id>] snapshot -i # Interactive elements with refs (recommended) scrapeless-scraping-browser [--session-id <id>] snapshot -i -C # Include cursor-interactive elements scrapeless-scraping-browser [--session-id <id>] snapshot -s "#selector" # Scope to CSS selector # Interaction (use @refs from snapshot) scrapeless-scraping-browser [--session-id <id>] click @e1 # Click element scrapeless-scraping-browser [--session-id <id>] fill @e2 "text" # Clear and type text scrapeless-scraping-browser [--session-id <id>] type @e2 "text" # Type without clearing scrapeless-scraping-browser [--session-id <id>] press Enter # Press key scrapeless-scraping-browser [--session-id <id>] scroll down 500 # Scroll page scrapeless-scraping-browser [--session-id <id>] scroll down 500 --selector "div.content" # Scroll within element # Get information scrapeless-scraping-browser [--session-id <id>] get text @e1 # Get element text scrapeless-scraping-browser [--session-id <id>] get url # Get current URL scrapeless-scraping-browser [--session-id <id>] get title # Get page title scrapeless-scraping-browser [--session-id <id>] screenshot # Take screenshot scrapeless-scraping-browser [--session-id <id>] screenshot --full # Full page screenshot # Wait scrapeless-scraping-browser [--session-id <id>] wait @e1 # Wait for element scrapeless-scraping-browser [--session-id <id>] wait --load networkidle # Wait for network idle scrapeless-scraping-browser [--session-id <id>] wait --url "**/page" # Wait for URL pattern scrapeless-scraping-browser [--session-id <id>] wait 2000 # Wait milliseconds # Cookies & Storage scrapeless-scraping-browser [--session-id <id>] cookies # Get all cookies scrapeless-scraping-browser [--session-id <id>] cookies set <name> <val> # Set cookie scrapeless-scraping-browser [--session-id <id>] cookies clear # Clear cookies scrapeless-scraping-browser [--session-id <id>] storage local # Get localStorage scrapeless-scraping-browser [--session-id <id>] storage local set <k> <v> # Set localStorage # Multi-page scrapeless-scraping-browser [--session-id <id>] pages # List all pages/tabs scrapeless-scraping-browser [--session-id <id>] page <pageId> # Switch to page scrapeless-scraping-browser [--session-id <id>] tab new [url] # Open new tab scrapeless-scraping-browser [--session-id <id>] tab close [n] # Close tab # Live preview scrapeless-scraping-browser live [taskId] # Get live preview URL ``` scrapeless-scraping-browser get url # Get current URL scrapeless-scraping-browser get title # Get page title scrapeless-scraping-browser screenshot # Take screenshot scrapeless-scraping-browser screenshot --full # Full page screenshot # Wait scrapeless-scraping-browser wait @e1 # Wait for element scrapeless-scraping-browser wait --load networkidle # Wait for network idle scrapeless-scraping-browser wait --url "**/page" # Wait for URL pattern scrapeless-scraping-browser wait 2000 # Wait milliseconds # Cookies & Storage scrapeless-scraping-browser cookies # Get all cookies scrapeless-scraping-browser cookies set <name> <val> # Set cookie scrapeless-scraping-browser cookies clear # Clear cookies scrapeless-scraping-browser storage local # Get localStorage scrapeless-scraping-browser storage local set <k> <v> # Set localStorage # Multi-page scrapeless-scraping-browser pages # List all pages/tabs scrapeless-scraping-browser page <pageId> # Switch to page scrapeless-scraping-browser tab new [url] # Open new tab scrapeless-scraping-browser tab close [n] # Close tab # Live preview scrapeless-scraping-browser live # Get live preview URL ``` ## Common Patterns ### Form Submission ```bash scrapeless-scraping-browser config set apiKey your_token scrapeless-scraping-browser open https://example.com/signup scrapeless-scraping-browser snapshot -i scrapeless-scraping-browser fill @e1 "Jane Doe" scrapeless-scraping-browser fill @e2 "[email protected]" scrapeless-scraping-browser click @e3 scrapeless-scraping-browser wait --load networkidle ``` ### Data Extraction ```bash scrapeless-scraping-browser config set apiKey your_token scrapeless-scraping-browser open https://example.com/products scrapeless-scraping-browser snapshot -i --json scrapeless-scraping-browser get text @e5 --json ``` ### Common Session Configuration Scenarios #### Mobile Device Simulation ```bash # Simulate iPhone for mobile-specific content SESSION_ID=$(scrapeless-scraping-browser new-session \ --name "mobile-test" \ --platform iOS \ --screen-width 375 \ --screen-height 812 \ --user-agent "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X)" \ --json | jq -r '.taskId') scrapeless-scraping-browser --session-id $SESSION_ID open https://m.example.com ``` #### Geographic Content Testing ```bash # Access content from different regions SESSION_ID=$(scrapeless-scraping-browser new-session \ --name "geo-test" \ --proxy-country AU \ --proxy-city sydney \ --timezone "Australia/Sydney" \ --languages "en-AU,en" \ --json | jq -r '.taskId') scrapeless-scraping-browser --session-id $SESSION_ID open https://example.com ``` #### High-Resolution Desktop Testing ```bash # Test on high-resolution displays SESSION_ID=$(scrapeless-scraping-browser new-session \ --name "desktop-4k" \ --platform macOS \ --screen-width 3840 \ --screen-height 2160 \ --json | jq -r '.taskId') scrapeless-scraping-browser --session-id $SESSION_ID open https://example.com ``` #### Session Recording for Debugging ```bash # Enable recording for troubleshooting SESSION_ID=$(scrapeless-scraping-browser new-session \ --name "debug-session" \ --recording true \ --ttl 7200 \ --json | jq -r '.taskId') scrapeless-scraping-browser --session-id $SESSION_ID open https://example.com # Session recording will be available for review ``` ### Session Persistence **Important**: Scrapeless sessions terminate when connections close. For persistent workflows, use the TypeScript API: ```bash scrapeless-scraping-browser config set apiKey your_token # Create a session for login scrapeless-scraping-browser create --name "login-session" --ttl 1800 scrapeless-scraping-browser open https://app.example.com/login scrapeless-scraping-browser snapshot -i scrapeless-scraping-browser fill @e1 "username" scrapeless-scraping-browser fill @e2 "password" scrapeless-scraping-browser click @e3 scrapeless-scraping-browser wait --url "**/dashboard" # For subsequent operations, create a new session # (Cannot reuse previous session due to connection termination) scrapeless-scraping-browser create --name "dashboard-session" --ttl 1800 scrapeless-scraping-browser open https://app.example.com/dashboard ``` **Better Alternative**: Use TypeScript API for multi-step workflows: ```typescript import { BrowserManager } from './dist/browser.js'; const manager = new BrowserManager(); await manager.launch({ id: 'persistent-workflow', action: 'launch' }); const page = manager.getPage(); await page.goto('https://app.example.com/login'); // Login persists throughout the session await page.fill('#username', 'user'); await page.fill('#password', 'pass'); await page.click('#login'); await page.waitForURL('**/dashboard'); await page.goto('https://app.example.com/profile'); // Session and cookies maintained await manager.close(); ``` ### Using Proxies ```bash scrapeless-scraping-browser config set apiKey your_token # Use residential proxy from specific country scrapeless-scraping-browser config set proxyCountry US scrapeless-scraping-browser open https://example.com # Use custom proxy scrapeless-scraping-browser config set proxyUrl "http://user:[email protected]:8080" scrapeless-scraping-browser open https://example.com # Use proxy with state and city (v2 API) scrapeless-scraping-browser config set proxyCountry US scrapeless-scraping-browser config set proxyState CA scrapeless-scraping-browser config set proxyCity "Los Angeles" scrapeless-scraping-browser open https://example.com ``` ### Browser Fingerprinting ```bash scrapeless-scraping-browser config set apiKey your_token # Set browser fingerprint to avoid detection scrapeless-scraping-browser config set fingerprint chrome scrapeless-scraping-browser open https://example.com # Customize browser fingerprint details scrapeless-scraping-browser config set userAgent "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X)" scrapeless-scraping-browser config set platform iOS scrapeless-scraping-browser config set screenWidth 375 scrapeless-scraping-browser config set screenHeight 812 scrapeless-scraping-browser config set timezone "Asia/Shanghai" scrapeless-scraping-browser config set languages "zh-CN,en" scrapeless-scraping-browser open https://example.com ``` ### Session Recording ```bash scrapeless-scraping-browser config set apiKey your_token # Enable session recording for debugging scrapeless-scraping-browser config set sessionRecording true scrapeless-scraping-browser open https://example.com # ... perform actions ... scrapeless-scraping-browser close # Recording will be available in Scrapeless dashboard ``` ### Multiple Sessions **Note**: Due to session termination behavior, the `--session-id` parameter has limitations. For reliable multi-session workflows, create separate sessions for each task: ```bash scrapeless-scraping-browser config set apiKey your_token # Create first session for task A scrapeless-scraping-browser create --name "task-a" --ttl 1800 scrapeless-scraping-browser open https://site-a.com # Complete task A operations... # Create second session for task B scrapeless-scraping-browser create --name "task-b" --ttl 1800 scrapeless-scraping-browser open https://site-b.com # Complete task B operations... # List all running sessions scrapeless-scraping-browser sessions # Stop specific session scrapeless-scraping-browser stop <taskId> # Stop all sessions scrapeless-scraping-browser stop-all ``` **Alternative**: For complex multi-session workflows, use the TypeScript API which supports persistent connections. ## Configuration File Configuration is managed via the `config` command. All settings are stored in `~/.scrapeless/config.json`. **Priority**: Config file > Environment variable (only `SCRAPELESS_API_KEY` supports env var) Available configuration options: - `apiKey` - Your Scrapeless API token (required) - `apiVersion` - API version (v1 or v2, default: v2) - `sessionTtl` - Session timeout in seconds - `sessionName` - Session name for identification - `sessionRecording` - Enable session recording (true/false) - `proxyUrl` - Custom proxy URL - `proxyCountry` - Proxy country code - `proxyState` - Proxy state/province - `proxyCity` - Proxy city - `fingerprint` - Browser fingerprint - `debug` - Enable debug logging ## Agent Mode (JSON Output) Use `--json` for machine-readable output: ```bash scrapeless-scraping-browser snapshot -i --json scrapeless-scraping-browser get text @e1 --json scrapeless-scraping-browser is visible @e2 --json ``` ## Ref Lifecycle (Important) Refs (`@e1`, `@e2`, etc.) are invalidated when the page changes. Always re-snapshot after: - Clicking links or buttons that navigate - Form submissions - Dynamic content loading ```bash scrapeless-scraping-browser click @e5 # Navigates to new page scrapeless-scraping-browser snapshot -i # MUST re-snapshot scrapeless-scraping-browser click @e1 # Use new refs ``` ## Session Management ### Important Session Behavior **Critical**: Scrapeless sessions have specific connection requirements: - ✅ **Sessions work perfectly with persistent connections** - ❌ **Sessions automatically terminate when the connection is closed** - ❌ **Reconnecting to a terminated session will fail** ### Recommended Usage Patterns #### For Single Operations ```bash # Create and use a session for a single task scrapeless-scraping-browser create --name "single-task" --ttl 600 scrapeless-scraping-browser open https://example.com scrapeless-scraping-browser screenshot ``` #### For Multi-Step Operations For complex workflows requiring multiple steps, use the TypeScript API instead of CLI: ```typescript import { BrowserManager } from './dist/browser.js'; const manager = new BrowserManager(); await manager.launch({ id: 'workflow', action: 'launch' }); const page = manager.getPage(); await page.goto('https://example.com'); await page.screenshot({ path: 'step1.png' }); await page.goto('https://another-site.com'); await page.screenshot({ path: 'step2.png' }); await manager.close(); ``` ### Session ID Parameter Limitations The `--session-id` parameter has limitations due to Scrapeless session behavior: ```bash # This will fail if the session connection was previously closed scrapeless-scraping-browser --session-id <session-id> open https://example.com # Error: Session has been terminated and cannot be reconnected ``` **Workaround**: Create new sessions for each workflow instead of reusing session IDs. ### Session Management Commands Always close sessions when done to avoid leaked resources: ```bash scrapeless-scraping-browser close # Close current session scrapeless-scraping-browser stop <taskId> # Stop specific session by ID scrapeless-scraping-browser stop-all # Stop all sessions ``` Check running sessions: ```bash scrapeless-scraping-browser sessions # Returns: sessionId, createdAt, status, sessionName ``` ## Live Preview Get a real-time preview of your browser session via WebSocket: ```bash # Get live preview URL for current session scrapeless-scraping-browser live # Get live preview URL for specific session scrapeless-scraping-browser live abc123def456 # Returns live preview URL for browser viewing # Open this URL in your browser to view the live session ``` ## Timeouts and Slow Pages For slow websites, use explicit waits: ```bash # Wait for network activity to settle scrapeless-scraping-browser wait --load networkidle # Wait for specific element scrapeless-scraping-browser wait @e1 # Wait for URL pattern scrapeless-scraping-browser wait --url "**/dashboard" # Wait fixed duration (milliseconds) scrapeless-scraping-browser wait 5000 ``` ## Error Handling Common errors and solutions: **Authentication Error:** ```bash # Make sure API token is set scrapeless-scraping-browser config set apiKey your_token # Or use environment variable export SCRAPELESS_API_KEY=your_token ``` **Session Not Found:** ```bash # Session may have expired, create new one scrapeless-scraping-browser open https://example.com ``` **Element Not Found:** ```bash # Re-snapshot to get fresh refs scrapeless-scraping-browser snapshot -i ``` **Timeout:** ```bash # Increase session timeout (in seconds) scrapeless-scraping-browser config set sessionTtl 600 scrapeless-scraping-browser open https://example.com ``` ## Debugging Enable debug mode for detailed logs: ```bash scrapeless-scraping-browser config set debug true scrapeless-scraping-browser open https://example.com ``` Or use `--debug` flag: ```bash scrapeless-scraping-browser --debug open https://example.com ``` ## Configuration Options | Configuration | Description | |---------------|-------------| | `apiKey` | Your API token (required) | | `apiVersion` | API version (v1 or v2, default: v2) | | `sessionTtl` | Session timeout in seconds | | `sessionName` | Session name for identification | | `sessionRecording` | Enable session recording (true/false) | | `proxyUrl` | Custom proxy URL | | `proxyCountry` | Proxy country code | | `proxyState` | Proxy state/province | | `proxyCity` | Proxy city | | `fingerprint` | Browser fingerprint | | `userAgent` | Custom user agent string | | `platform` | Platform type (Windows, Linux, macOS, iOS, Android) | | `screenWidth` | Screen width in pixels | | `screenHeight` | Screen height in pixels | | `timezone` | Timezone (e.g., America/New_York, Asia/Shanghai) | | `languages` | Comma-separated language codes (e.g., en,zh-CN) | | `debug` | Enable debug logging | Set configuration using: ```bash scrapeless-scraping-browser config set <key> <value> ``` Or use environment variable for API key only: ```bash export SCRAPELESS_API_KEY=your_token ``` ## Key Differences from Local Browsers 1. **Cloud-based**: Runs on Scrapeless infrastructure, not locally 2. **Residential Proxies**: Built-in support for residential proxy rotation 3. **Anti-detection**: Automatic browser fingerprinting and stealth features 4. **Session Recording**: Optional recording of browser sessions 5. **No Installation**: No need to install Chrome/Chromium locally 6. **Scalable**: Run multiple sessions in parallel ## Limitations - Profile management is not currently supported - Browser extensions are not currently supported - Requires active internet connection - Requires valid Scrapeless API token ## Best Practices 1. **Always set API token** before running commands (via config or env var) 2. **Let automatic session management work** - the CLI will reuse sessions intelligently 3. **Use --session-id** only when you need parallel workflows 4. **Close sessions** when done to avoid charges 5. **Use config file** for persistent settings instead of environment variables 6. **Enable recording** for debugging complex flows 7. **Re-snapshot** after page changes 8. **Use --json** for programmatic access 9. **Set session timeout** appropriately for your use case (in seconds) ## Updates Check for and install updates: ```bash # Check version scrapeless-scraping-browser version # Update via npm npm update -g scrapeless-scraping-browser-skills ``` ## Support - Documentation: https://docs.scrapeless.com - API Reference: https://api.scrapeless.com/docs - GitHub Issues: https://github.com/scrapeless-ai/scraping-browser-skill/issues FILE:skill.json { "name": "scraping-browser-skill", "description": "Cloud browser automation CLI for AI agents powered by Scrapeless. Use when the user needs to interact with websites using cloud browsers, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task with residential proxies and anti-detection features.", "version": "0.1.0", "author": "Scrapeless AI", "license": "Apache-2.0", "repository": "https://github.com/scrapeless-ai/scraping-browser-skill", "keywords": [ "browser", "automation", "scraping", "ai", "agent", "cloud", "scrapeless", "cli" ], "triggers": [ "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "use a proxy", "bypass detection", "browser automation", "web scraping", "cloud browser" ], "allowedTools": [ "Bash(npx scrapeless-scraping-browser:*)", "Bash(scrapeless-scraping-browser:*)" ], "requirements": { "node": ">=18.0.0" }, "installation": { "npm": "npm install -g scrapeless-scraping-browser" } } FILE:SECURITY.md # Security Policy ## Supported Versions We actively support the following versions of scraping-browser-skill: | Version | Supported | | ------- | ------------------ | | 0.1.x | :white_check_mark: | ## Reporting a Vulnerability If you discover a security vulnerability in scraping-browser-skill, please report it responsibly: ### How to Report 1. **Email**: Send details to [email protected] 2. **Include**: - Description of the vulnerability - Steps to reproduce - Potential impact - Suggested fix (if any) ### What to Expect - **Acknowledgment**: Within 48 hours - **Initial Assessment**: Within 5 business days - **Regular Updates**: Every 7 days until resolved - **Resolution Timeline**: Critical issues within 30 days ### Security Best Practices When using scraping-browser-skill: 1. **API Key Security**: - Store API keys securely using config files with restricted permissions (0600) - Never commit API keys to version control - Use environment variables only when necessary 2. **Network Security**: - All communications with Scrapeless API use HTTPS - Proxy configurations should use secure protocols when possible 3. **Data Handling**: - Screenshots and extracted data are handled locally - No sensitive data is transmitted to third parties except Scrapeless API - Session recordings (if enabled) are stored on Scrapeless servers 4. **Access Control**: - Configuration files are created with user-only permissions - Session management prevents unauthorized access to browser instances ### Scope This security policy covers: - The scraping-browser-skill CLI tool - Integration with Scrapeless API - Local configuration and data handling - Session management and authentication ### Out of Scope - Scrapeless cloud infrastructure (report to Scrapeless directly) - Third-party dependencies (report to respective maintainers) - User-generated content or scripts ## Security Features - **Secure Configuration**: Config files created with 0600 permissions - **API Authentication**: Header-based authentication with Scrapeless API - **Session Isolation**: Each session is isolated and automatically expires - **No Local Browser**: Eliminates local browser security risks - **Encrypted Communication**: All API communication over HTTPS ## Responsible Disclosure We appreciate security researchers who: - Give us reasonable time to fix issues before public disclosure - Avoid accessing or modifying user data - Don't perform testing on production systems without permission - Follow coordinated disclosure practices Thank you for helping keep scraping-browser-skill secure! FILE:references/authentication.md # Authentication ## API Key Setup Get your API key from the [Scrapeless Dashboard](https://app.scrapeless.com). ### Method 1: Config File (Recommended) ```bash scrapeless-scraping-browser config set apiKey your_api_key_here ``` This stores the key securely in `~/.scrapeless/config.json` with restricted permissions (0600). ### Method 2: Environment Variable ```bash export SCRAPELESS_API_KEY=your_api_key_here ``` For persistence, add to your shell profile: ```bash echo 'export SCRAPELESS_API_KEY=your_api_key_here' >> ~/.zshrc source ~/.zshrc ``` ## Configuration Priority Config file > Environment variable ## Verify Authentication ```bash # Check if API key is set scrapeless-scraping-browser config get apiKey # Test with a simple command scrapeless-scraping-browser sessions ``` ## Security Best Practices 1. Never commit API keys to version control 2. Use config file method for persistent storage 3. Config files are automatically created with user-only permissions 4. Rotate API keys regularly from the Scrapeless Dashboard
Bypass website blocks and scrape web content using Scrapeless Universal Scraping API.
---
name: webunlocker
description: Bypass website blocks and scrape web content using Scrapeless Universal Scraping API.
homepage: https://www.scrapeless.com
credentials:
- X_API_TOKEN
env:
required:
- X_API_TOKEN
---
# WebUnlocker OpenClaw Skill
Use this skill to bypass website blocks and scrape web content using the Scrapeless Universal Scraping API. It supports JavaScript rendering, CAPTCHA solving, IP rotation, and intelligent request retries.
**Authentication:** Set `X_API_TOKEN` in your environment or in a `.env` file in the repo root.
**Errors:** On failure the script writes a JSON error to stderr and exits with code 1.
---
## Usage
**Command:**
```bash
python3 scripts/webunlocker.py --url "https://example.com"
```
**Examples:**
```bash
# Scrape HTML content
python3 scripts/webunlocker.py --url "https://httpbin.io/get"
# Scrape plain text
python3 scripts/webunlocker.py --url "https://example.com" --response-type plaintext
# Scrape as Markdown
python3 scripts/webunlocker.py --url "https://example.com" --response-type markdown
# Take a screenshot
python3 scripts/webunlocker.py --url "https://example.com" --response-type png
# Capture network requests
python3 scripts/webunlocker.py --url "https://example.com" --response-type network
# Extract specific content types
python3 scripts/webunlocker.py --url "https://example.com" --response-type content --content-types emails,links,images
# Use a specific country proxy
python3 scripts/webunlocker.py --url "https://example.com" --country US
# Use POST method
python3 scripts/webunlocker.py --url "https://httpbin.org/post" --method POST --data '{"key": "value"}'
# Add custom headers
python3 scripts/webunlocker.py --url "https://example.com" --headers '{"User-Agent": "Mozilla/5.0"}'
# Use custom proxy
python3 scripts/webunlocker.py --url "https://example.com" --proxy-url "http://your-proxy-url:port"
# Enable JavaScript rendering
python3 scripts/webunlocker.py --url "https://example.com" --js-render
# Enable JavaScript rendering with headless mode
python3 scripts/webunlocker.py --url "https://example.com" --js-render --headless
# Enable JavaScript rendering and wait for specific element
python3 scripts/webunlocker.py --url "https://example.com" --js-render --wait-selector "body > div > p:nth-child(3) > a"
# Bypass Cloudflare protection with JavaScript rendering
python3 scripts/webunlocker.py --url "https://example.com" --js-render
# Bypass Cloudflare Turnstile challenge
python3 scripts/webunlocker.py --url "https://2captcha.com/demo/cloudflare-turnstile-challenge" --js-render --headless --response-type markdown
```
---
## Summary
| Argument | Description | Default |
|----------|-------------|---------|
| `--url` | Target URL | Required |
| `--method` | HTTP method | GET |
| `--redirect` | Allow redirects | False |
| `--headers` | Custom headers as JSON string | None |
| `--data` | Request data as JSON string | None |
| `--response-type` | Response type (html, plaintext, markdown, png, jpeg, network, content) | html |
| `--content-types` | Content types to extract (comma-separated) | None |
| `--country` | Country code for proxy | ANY |
| `--proxy-url` | Custom proxy URL | None |
| `--js-render` | Enable JavaScript rendering | False |
| `--headless` | Run browser in headless mode | False |
| `--wait-selector` | Wait for element with this selector to appear | None |
**Output:** All commands return JSON objects with the scraped content or Cloudflare bypass results.
---
## Response Types
### HTML
Returns the HTML content of the page as an escaped string.
### Plaintext
Returns the plain text content of the page, removing all HTML tags.
### Markdown
Returns the page content formatted as Markdown for better readability.
### PNG/JPEG
Returns a base64 encoded string of the page screenshot.
### Network
Returns all network requests made during page load, including URLs, methods, status codes, and headers.
### Content
Returns specific content types extracted from the page, such as emails, phone numbers, headings, images, audios, videos, links, hashtags, metadata, tables, and favicon.
---
## Notes
⚠️ **Timeout Policy:**
- Page load timeout: 30 seconds
- Global execution timeout: 180 seconds
⚠️ **Supported CAPTCHAs:**
- reCaptcha V2
- Cloudflare Turnstile
- Cloudflare Challenge
⚠️ **Rate Limits:**
- 429 errors indicate rate limit exceeded. Reduce request frequency or upgrade plan.
⚠️ **Billing:**
- Charges are applied on a per-request basis
- Only successful requests will be billed
FILE:requirements.txt
requests
python-dotenv
FILE:README.md
[<img width="1200" height="629" alt="20260318-100141" src="https://github.com/user-attachments/assets/3a0f7070-d6ad-4ebe-ab5e-07de5356a46a" />](https://www.scrapeless.com/en/product/universal-scraping-api)
</p>
<p align="center">
<strong>OpenClaw skill of Scrapeless Web Unlocker for Web Scraping, Cloudflare Solving, and AI Data Collection.</strong><br/>
</p>
<p align="center">
<a href="https://www.youtube.com/@Scrapeless" target="_blank">
<img src="https://img.shields.io/badge/Follow%20on%20YouTuBe-FF0033?style=for-the-badge&logo=youtube&logoColor=white" alt="Follow on YouTuBe" />
</a>
<a href="https://discord.com/invite/xBcTfGPjCQ" target="_blank">
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
</a>
<a href="https://x.com/Scrapelessteam" target="_blank">
<img src="https://img.shields.io/badge/Follow%20us%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow us on X" />
</a>
<a href="https://www.reddit.com/r/Scrapeless" target="_blank">
<img src="https://img.shields.io/badge/Join%20us%20on%20Reddit-FF4500?style=for-the-badge&logo=reddit&logoColor=white" alt="Join us on Reddit" />
</a>
<a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=githubopen" target="_blank">
<img src="https://img.shields.io/badge/Official%20Website-12A594?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Official Website"/>
</a>
</p>
---
# 🤖 Scrapeless Openclaw WebUnlocker Skill
A skill for the Scrapeless platform that enables you to solve website blocks and scrape web content using the Scrapeless Universal Scraping API. It supports JavaScript rendering, CAPTCHA solving, IP rotation, and intelligent request retries.
## Overview
The **Web Unlocker Skill** allows developers and AI agents to **access and extract data from websites that normally block automated traffic**.
Built on top of the **Scrapeless Universal Scraping API**, this skill automatically handles common bot protections such as **Cloudflare, CAPTCHA challenges, IP blocking, and JavaScript rendering**, making it easy to retrieve clean web data from difficult targets.
Instead of managing proxy pools, headless browsers, and bypass logic yourself, Web Unlocker provides a **simple API interface to reliably fetch web pages at scale**.
This makes it ideal for **web scraping, data pipelines, AI training datasets, market intelligence, and automation workflows**.
## ❓ Why Use Web Unlocker
Modern websites deploy increasingly sophisticated bot detection systems such as:
- Cloudflare protection
- CAPTCHA challenges
- Browser fingerprint detection
- IP reputation blocking
- JavaScript-rendered content
Traditional scraping tools or headless browsers often fail against these protections.
**Web Unlocker solves this by combining:**
- Stealth browser infrastructure
- Proxy rotation
- CAPTCHA solving
- Intelligent retry mechanisms
👉 Developers only need to send a request — the platform handles the rest.
---
## ✨ Key Features
**🤖 Automatic CAPTCHA Solving**
- Supports reCAPTCHA, Cloudflare Turnstile and Cloudflare challenge pages.
**🌐 JavaScript Rendering**
- Execute full browser rendering for modern frameworks such as **React, Next.js, and Vue**.
**🌍 Global Proxy Infrastructure**
- Built-in proxy rotation and country selection for higher success rates and geo-targeted scraping.
**📦 Multiple Response Formats**
- Retrieve data in various formats:
- HTML
- Plain text
- Markdown
- Screenshots (PNG / JPEG)
- Network requests
- Structured extracted content
**🔁 Intelligent Retry System**
- Automatically retries failed requests using optimized routing.
---
## 🧩 Use Cases
**📊 Web Scraping & Data Extraction**
- Collect structured data from e-commerce, search engines, directories, and public websites.
**🤖 AI Training Data Collection**
- Gather high-quality datasets for LLM training, AI evaluation, or synthetic data generation.
**📈 Market Intelligence**
- Monitor competitors, pricing data, product catalogs, and industry signals.
**🔍 SEO & AI Search Monitoring**
- Track how websites appear across search engines and AI-powered search platforms.
**⚙️ Automation & AI Agents**
- Integrate web data directly into **AI agents, workflows, or automation platforms.
---
## Installation
1. Clone the repository:
```bash
git clone https://github.com/scrapeless-ai/webunlocker-skill.git
```
2. Install dependencies for WebUnlocker:
```bash
cd webunlocker-skill
pip install -r requirements.txt
```
## ⚙️ Environment Configuration
1. **Manual installation**: Place the skill in OpenClaw’s `.openclaw/skills` directory.
2. Create a `.env` file in the root directory based on the `.env.example` file:
```bash
cp .env.example .env
```
3. Add your Scrapeless API token to the `.env` file:
```
X_API_TOKEN=your_api_token_here
```
You can obtain an API token from the [Scrapeless website](https://www.scrapeless.com).
## Usage Examples
```bash
# Scrape HTML content
python3 scripts/webunlocker.py --url "https://httpbin.io/get"
# Scrape as Markdown
python3 scripts/webunlocker.py --url "https://example.com" --response-type markdown
# Take a screenshot
python3 scripts/webunlocker.py --url "https://example.com" --response-type png
# Extract specific content types
python3 scripts/webunlocker.py --url "https://example.com" --response-type content --content-types emails,links,images
# Use a US proxy
python3 scripts/webunlocker.py --url "https://example.com" --country US
# Use POST method
python3 scripts/webunlocker.py --url "https://httpbin.org/post" --method POST --data '{"key": "value"}'
# Add custom headers
python3 scripts/webunlocker.py --url "https://example.com" --headers '{"User-Agent": "Mozilla/5.0"}'
# Use custom proxy
python3 scripts/webunlocker.py --url "https://example.com" --proxy-url "http://your-proxy-url:port"
# Enable JavaScript rendering
python3 scripts/webunlocker.py --url "https://example.com" --js-render
# Bypass Cloudflare Turnstile challenge
python3 scripts/webunlocker.py --url "https://2captcha.com/demo/cloudflare-turnstile-challenge" --js-render --headless --response-type markdown
```
## Output Structure
Web Unlocker supports multiple response formats depending on your needs.
| Response Type | Description |
|--------------|------------|
| HTML | Full rendered page HTML |
| Plaintext | Clean text without HTML tags |
| Markdown | Structured Markdown content |
| PNG / JPEG | Page screenshots |
| Network | All network requests during page load |
| Content | Extract specific data types such as emails, links, or images |
## Common Issues
### Rate Limits
If you encounter 429 errors, you've exceeded the rate limit. Reduce request frequency or upgrade your Scrapeless plan.
### Timeouts
- Page load timeout: 30 seconds
- Global execution timeout: 180 seconds
### CAPTCHA Solving
WebUnlocker automatically handles reCaptcha V2, Cloudflare Turnstile, and Cloudflare Challenge, but complex CAPTCHAs may occasionally fail.
### Billing
- Charges are applied on a per-request basis
- Only successful requests will be billed
## 🔗 Related resources
- [Scrapeless LLM Scraper](https://docs.scrapeless.com/en/llm-chat-scraper/quickstart/introduction/)
- [Scrapeless Universal Scraping API](https://docs.scrapeless.com/en/universal-scraping-api/)
## 📬 Contact Us
For questions, suggestions, or collaboration inquiries, feel free to contact us via:
- Email/Slack: [email protected]
- Official Website: https://www.scrapeless.com
- Community Forum: [Browser Labs Discord](https://discord.com/invite/xBcTfGPjCQ)
FILE:scripts/webunlocker.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
WebUnlocker Tool for OpenClaw
Bypass website blocks and scrape web content using Scrapeless Universal Scraping API.
"""
import os
import sys
import time
import json
import argparse
import logging
from typing import Optional, Dict, Any, List
from dotenv import load_dotenv
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Load environment variables from .env file
load_dotenv()
try:
import requests
except ImportError:
print(json.dumps({
'error': 'Missing dependency',
'message': 'requests library not found. Please install it: pip install requests'
}), file=sys.stderr)
sys.exit(1)
class WebUnlocker:
"""WebUnlocker for Scrapeless Universal Scraping API"""
def __init__(self, api_token: Optional[str] = None):
self.api_token = api_token or os.getenv('X_API_TOKEN')
if not self.api_token:
raise ValueError(
"API token not found. Please set X_API_TOKEN in .env file or environment variable."
)
self.api_base_url = 'https://api.scrapeless.com'
self.endpoint = '/api/v2/unlocker/request'
self.headers = {
'Content-Type': 'application/json',
'x-api-token': self.api_token
}
logger.info(f"WebUnlocker initialized with API base URL: {self.api_base_url}")
def _build_request_payload(
self,
actor: str,
url: str,
method: str = 'GET',
redirect: bool = False,
headers: Optional[Dict[str, str]] = None,
data: Optional[Dict[str, Any]] = None,
response_type: Optional[str] = None,
content_types: Optional[List[str]] = None,
country: str = 'ANY',
proxy_url: Optional[str] = None,
js_render: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
input_params = {
'url': url,
'method': method,
'redirect': redirect
}
if headers:
input_params['headers'] = headers
if data:
input_params['data'] = data
# Set response_type in top-level for non-jsRender requests
if response_type and not js_render:
input_params['response_type'] = response_type
# content_types is only used with response_type='content'
if content_types and response_type == 'content':
input_params['content_types'] = content_types
if js_render:
input_params['jsRender'] = js_render
payload = {
'actor': actor,
'input': input_params
}
# Build proxy configuration
proxy_config = {}
if country:
proxy_config['country'] = country
if proxy_url:
proxy_config['url'] = proxy_url
if proxy_config:
payload['proxy'] = proxy_config
return payload
def execute(
self,
actor: str,
url: str,
method: str = 'GET',
redirect: bool = False,
headers: Optional[Dict[str, str]] = None,
data: Optional[Dict[str, Any]] = None,
response_type: Optional[str] = None,
content_types: Optional[List[str]] = None,
country: str = 'ANY',
proxy_url: Optional[str] = None,
js_render: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
try:
logger.info(f"Sending request to {actor} for URL: {url}")
payload = self._build_request_payload(
actor=actor,
url=url,
method=method,
redirect=redirect,
headers=headers,
data=data,
response_type=response_type,
content_types=content_types,
country=country,
proxy_url=proxy_url,
js_render=js_render
)
response = requests.post(
f"{self.api_base_url}{self.endpoint}",
headers=self.headers,
json=payload,
timeout=180 # Global execution timeout as per documentation
)
if response.status_code == 200:
result = response.json()
logger.info(f"Request completed successfully")
return {
'success': True,
'data': result,
'status': 'success'
}
elif response.status_code == 400:
raise ValueError(f"Invalid request: {response.json()}")
elif response.status_code == 429:
raise RuntimeError("Rate limit exceeded")
else:
raise RuntimeError(f"Unexpected status code {response.status_code}: {response.text}")
except Exception as e:
return {
'success': False,
'error': type(e).__name__,
'message': str(e)
}
def main():
parser = argparse.ArgumentParser(
description='WebUnlocker Tool',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python3 webunlocker.py --url "https://httpbin.io/get"
python3 webunlocker.py --url "https://example.com" --response-type plaintext
python3 webunlocker.py --url "https://example.com" --response-type png
python3 webunlocker.py --url "https://httpbin.org/post" --method POST --data '{"key": "value"}'
python3 webunlocker.py --url "https://example.com" --headers '{"User-Agent": "Mozilla/5.0"}'
"""
)
parser.add_argument('--url', required=True, help='Target URL')
parser.add_argument('--method', default='GET', choices=['GET', 'POST', 'PUT', 'DELETE'], help='HTTP method (default: GET)')
parser.add_argument('--redirect', action='store_true', help='Allow redirects')
parser.add_argument('--headers', type=str, help='Custom headers as JSON string')
parser.add_argument('--data', type=str, help='Request data as JSON string')
parser.add_argument('--response-type', default='html', choices=['html', 'plaintext', 'markdown', 'png', 'jpeg', 'network', 'content'], help='Response type (default: html)')
parser.add_argument('--content-types', type=str, help='Content types to extract (comma-separated)')
parser.add_argument('--country', default='ANY', help='Country code for proxy (default: ANY)')
parser.add_argument('--proxy-url', type=str, help='Custom proxy URL')
parser.add_argument('--js-render', action='store_true', help='Enable JavaScript rendering')
parser.add_argument('--headless', action='store_true', help='Run browser in headless mode')
parser.add_argument('--wait-selector', type=str, help='Wait for element with this selector to appear')
args = parser.parse_args()
try:
unlocker = WebUnlocker()
# Parse headers and data if provided
headers = None
if args.headers:
headers = json.loads(args.headers)
data = None
if args.data:
data = json.loads(args.data)
# Parse content types if provided
content_types = None
if args.content_types:
content_types = args.content_types.split(',')
# Build jsRender configuration
js_render = None
if args.js_render:
js_render = {
'enabled': True,
'headless': args.headless,
'waitUntil': 'load',
'instructions': [
{
'wait': 5000, # Wait 5 seconds for page to load
'waitFor': {
'0': args.wait_selector if args.wait_selector else 'body',
'1': 30000 # 30 seconds timeout for selector
}
}
],
'response': {
'type': args.response_type
}
}
# Add response options if needed
if args.wait_selector:
js_render['response']['options'] = {
'selector': args.wait_selector
}
# Add content_types if response type is 'content'
if args.content_types and args.response_type == 'content':
if 'options' not in js_render['response']:
js_render['response']['options'] = {}
js_render['response']['options']['outputs'] = args.content_types
result = unlocker.execute(
actor='unlocker.webunlocker',
url=args.url,
method=args.method,
redirect=args.redirect,
headers=headers,
data=data,
response_type=args.response_type,
content_types=content_types,
country=args.country,
proxy_url=args.proxy_url,
js_render=js_render
)
# Output result as JSON
print(json.dumps(result, indent=2, ensure_ascii=False))
# Exit with error code if failed
if not result.get('success'):
sys.exit(1)
except Exception as e:
print(json.dumps({
'success': False,
'error': type(e).__name__,
'message': str(e)
}, indent=2), file=sys.stderr)
sys.exit(1)
def webunlocker_scrape(
url: str,
method: str = 'GET',
redirect: bool = False,
headers: Optional[Dict[str, str]] = None,
data: Optional[Dict[str, Any]] = None,
response_type: str = 'html',
content_types: Optional[List[str]] = None,
country: str = 'ANY',
proxy_url: Optional[str] = None,
js_render: Optional[Dict[str, Any]] = None,
api_token: Optional[str] = None
) -> Dict[str, Any]:
"""Scrape web content using WebUnlocker
Args:
url: Target URL
method: HTTP method (default: GET)
redirect: Whether to allow redirects (default: False)
headers: Custom headers
data: Request data
response_type: Response type (html, plaintext, markdown, png, jpeg, network, content)
content_types: Content types to extract (e.g., emails,links,images)
country: Country code for proxy (default: ANY)
proxy_url: Custom proxy URL
js_render: JavaScript rendering configuration
api_token: API token for Scrapeless API
Returns:
Dict with success status and result data
"""
unlocker = WebUnlocker(api_token=api_token)
return unlocker.execute(
actor='unlocker.webunlocker',
url=url,
method=method,
redirect=redirect,
headers=headers,
data=data,
response_type=response_type,
content_types=content_types,
country=country,
proxy_url=proxy_url,
js_render=js_render
)
def cloudflare_bypass(
url: str,
country: str = 'ANY',
proxy_url: Optional[str] = None,
js_render: Optional[Dict[str, Any]] = None,
api_token: Optional[str] = None
) -> Dict[str, Any]:
"""Bypass Cloudflare protection
Args:
url: Target URL
country: Country code for proxy (default: ANY)
proxy_url: Custom proxy URL
js_render: JavaScript rendering configuration
api_token: API token for Scrapeless API
Returns:
Dict with success status and result data
"""
unlocker = WebUnlocker(api_token=api_token)
return unlocker.execute(
actor='unlocker.webunlocker',
url=url,
country=country,
proxy_url=proxy_url,
js_render=js_render
)
if __name__ == '__main__':
main()Scrape AI chat conversations from ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
---
name: llm-chat-scraper
description: Scrape AI chat conversations from ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
homepage: https://www.scrapeless.com
credentials:
- X_API_TOKEN
env:
required:
- X_API_TOKEN
---
# LLM Chat Scraper OpenClaw Skill
Use this skill to scrape AI chat conversations from various LLM models via the Scrapeless API. The skill supports ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
**Authentication:** Set `X_API_TOKEN` in your environment or in a `.env` file in the repo root.
**Errors:** On failure the script writes a JSON error to stderr and exits with code 1.
---
## Tools
### 1. ChatGPT Scraper
Scrape ChatGPT responses with optional web search enrichment. Returns JSON object with `result_text`, `model`, `links`, `citations`, and more.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py chatgpt --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py chatgpt --query "Most reliable proxy service for data extraction"
python3 s
Optional: `--country` fcripts/llm_chat_scraper.py chatgpt --query "AI trends in 2024" --web-search
python3 scripts/llm_chat_scraper.py chatgpt --query "Best programming languages" --country GB
```
or location, `--web-search` to enable web search.
---
### 2. Gemini Scraper
Scrape Google Gemini responses. Returns JSON object with `result_text`, `citations`, and more.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py gemini --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py gemini --query "Recommended attractions in New York"
python3 scripts/llm_chat_scraper.py gemini --query "Best restaurants in Tokyo" --country JP
```
Optional: `--country` for location (JP and TW not supported).
---
### 3. Perplexity Scraper
Scrape Perplexity AI responses with optional web search. Returns JSON object with `result_text`, `related_prompt`, `web_results`, `media_items`.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py perplexity --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py perplexity --query "Latest AI developments"
python3 scripts/llm_chat_scraper.py perplexity --query "Quantum computing explained" --web-search
```
Optional: `--country` for location, `--web-search` to enable web search.
---
### 4. Copilot Scraper
Scrape Microsoft Copilot responses across different modes (search, smart, chat, reasoning, study). Returns JSON object with `result_text`, `mode`, `links`, `citations`.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py copilot --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py copilot --query "What is machine learning?"
python3 scripts/llm_chat_scraper.py copilot --query "Explain blockchain" --mode reasoning
python3 scripts/llm_chat_scraper.py copilot --query "Best laptop 2024" --mode search
```
Optional: `--country` for location (JP and TW not supported), `--mode` for operation mode.
---
### 5. Google AI Mode Scraper
Scrape Google AI Mode responses. Returns JSON object with `result_text`, `result_md`, `result_html`, `citations`, `raw_url`.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py aimode --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py aimode --query "Best programming languages to learn"
python3 scripts/llm_chat_scraper.py aimode --query "Climate change solutions" --country GB
```
Optional: `--country` for location (JP and TW not supported).
---
### 6. Grok Scraper
Scrape xAI Grok responses with different modes (FAST, EXPERT, AUTO). Returns JSON object with `full_response`, `user_model`, `follow_up_suggestions`, `web_search_results`.
**Command:**
```bash
python3 scripts/llm_chat_scraper.py grok --query "your prompt"
```
**Examples:**
```bash
python3 scripts/llm_chat_scraper.py grok --query "Explain quantum entanglement"
python3 scripts/llm_chat_scraper.py grok --query "What's happening in AI" --mode MODEL_MODE_EXPERT
python3 scripts/llm_chat_scraper.py grok --query "Latest tech news" --mode MODEL_MODE_FAST
```
Optional: `--country` for location (JP and TW not supported), `--mode` for operation mode.
---
## Summary
| Action | Command | Argument | Example |
|--------|---------|----------|---------|
| ChatGPT | `chatgpt` | `--query` | `python3 scripts/llm_chat_scraper.py chatgpt --query "AI trends"` |
| Gemini | `gemini` | `--query` | `python3 scripts/llm_chat_scraper.py gemini --query "Best restaurants"` |
| Perplexity | `perplexity` | `--query` | `python3 scripts/llm_chat_scraper.py perplexity --query "Latest news"` |
| Copilot | `copilot` | `--query` | `python3 scripts/llm_chat_scraper.py copilot --query "Explain ML"` |
| Google AI Mode | `aimode` | `--query` | `python3 scripts/llm_chat_scraper.py aimode --query "Programming"` |
| Grok | `grok` | `--query` | `python3 scripts/llm_chat_scraper.py grok --query "Quantum physics"` |
**Output:** All commands return JSON objects with model-specific fields (see tool descriptions above).
---
## Response Fields by Model
### ChatGPT
- `result_text`: Markdown response
- `model`: Model identifier (e.g., gpt-4)
- `web_search`: Boolean indicating if search ran
- `links`: Array of supplementary links
- `citations`: Array of content references
### Gemini
- `result_text`: Markdown response
- `citations`: Array with favicon, highlights, snippet, title, url, website_name
### Perplexity
- `result_text`: Markdown response
- `related_prompt`: Array of related questions
- `web_results`: Array with name, url, snippet
- `media_items`: Array of media references
### Copilot
- `result_text`: Markdown response
- `mode`: Mode used (search/smart/chat/reasoning/study)
- `links`: Array of outbound links
- `citations`: Array with title, url
### Google AI Mode
- `result_text`: Answer body
- `result_md`: Markdown version
- `result_html`: HTML version
- `raw_url`: Original URL
- `citations`: Array with snippet, thumbnail, title, url, website_name, favicon
### Grok
- `full_response`: Response content
- `user_model`: Model used
- `follow_up_suggestions`: Array of suggested questions
- `web_search_results`: Array with preview, title, url
- `conversation`: Object with conversation metadata
---
## Notes
⚠️ **Regional Restrictions:**
- Gemini, Copilot, Google AI Mode, and Grok do not support Japan (JP) and Taiwan (TW)
⚠️ **Result Expiry:**
- Task results are available for 12 hours
⚠️ **Rate Limits:**
- 429 errors indicate rate limit exceeded. Reduce request frequency or upgrade plan.
FILE:requirements.txt
requests>=2.31.0
python-dotenv>=1.0.0
FILE:README.md
[<img width="1200" height="629" alt="img_v3_02vs_05ae6cc6-fae6-4a1e-956f-f2cdc12b043g" src="https://github.com/user-attachments/assets/47d09e83-911d-4c15-b339-8ac635b68936" />](https://docs.scrapeless.com/en/llm-chat-scraper/scrapers/chatgpt/)
</p>
<p align="center">
<strong>Scrapeless OpenClaw skill for scraping ChatGPT, Gemini, Perplexity, and Grok responses.</strong><br/>
</p>
<p align="center">
<a href="https://www.youtube.com/@Scrapeless" target="_blank">
<img src="https://img.shields.io/badge/Follow%20on%20YouTuBe-FF0033?style=for-the-badge&logo=youtube&logoColor=white" alt="Follow on YouTuBe" />
</a>
<a href="https://discord.com/invite/xBcTfGPjCQ" target="_blank">
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
</a>
<a href="https://x.com/Scrapelessteam" target="_blank">
<img src="https://img.shields.io/badge/Follow%20us%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow us on X" />
</a>
<a href="https://www.reddit.com/r/Scrapeless" target="_blank">
<img src="https://img.shields.io/badge/Join%20us%20on%20Reddit-FF4500?style=for-the-badge&logo=reddit&logoColor=white" alt="Join us on Reddit" />
</a>
<a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=githubopen" target="_blank">
<img src="https://img.shields.io/badge/Official%20Website-12A594?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Official Website"/>
</a>
</p>
---
# 🤖 Scrapeless LLM Scraper OpenClaw Skill
A skill for the Scrapeless platform that allows you to scrape AI chat conversations from various LLM models via the Scrapeless API. It supports ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
## 🚀 Overview
This OpenClaw skill integrates [Scrapeless LLM Scrapers](https://docs.scrapeless.com/en/llm-chat-scraper/quickstart/introduction/) into any OpenClaw-compatible AI agent or LLM pipeline. It collects structured AI chat responses from Gemini, Perplexity, Google AI Mode, ChatGPT, Grok and Copilot.
Built for **GEO / AI SEO, AI search monitoring, LLM benchmarking, brand & market intelligence, AI content platforms, and automation agents, this skill helps developers, AI labs, and enterprise teams gather high-quality structured LLM outputs** for research, analytics, or workflow automation.
⭐ If you find this project useful, please give it a star!
## Features
- **Major LLM support**: ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
- **High concurrency & reliability:** Supports hundreds of concurrent tasks with error rates <10%.
- **Adaptive scraping:** Automatically adjusts strategies to minimize blocking and maximize success.
- **Seamless workflow integration:** Works with OpenClaw, Cursor, Trae and other AI agent platforms.
- **Structured output:** JSON responses with model info, citations, links, and optional web search enrichment.
- **Trial-ready:** Works with Scrapeless API; X_API_TOKEN(get it from https://www.scrapeless.com/) required for authentication.
- **Real-time scraping**
- **Support for web search enrichment**
- **Country-based proxy selection**
## Use Cases
This OpenClaw skill is designed for teams working on:
| Use Case | Description |
|------|-----|
| AI SEO/GEO (Generative Engine Optimization) | Monitor how brands appear in AI-generated answers. |
| AI Search Monitoring | Track responses from ChatGPT / Perplexity / Gemini. |
| LLM Response Analysis | Analyze answer structure, citations and model behavior. |
| Competitive Intelligence | Compare how different LLM platforms respond to the same query. |
## Installation
1. Clone the repository:
```bash
git clone https://github.com/scrapeless-ai/llm-chat-scraper-skill.git
```
2. Install dependencies for the LLM Chat Scraper:
```bash
cd llm-chat-scraper-skill
pip install -r requirements.txt
```
## Environment Configuration
1. Manual installation: Place the skill in OpenClaw’s `.openclaw/skills` directory.
2. Create a `.env` file in the root directory based on the `.env.example` file:
```bash
cp .env.example .env
```
3. Add your Scrapeless API token to the `.env` file:
```
X_API_TOKEN=your_api_token_here
```
You can obtain an API token from the [Scrapeless website](https://www.scrapeless.com).
## Supported Models
**ChatGPT**
- Returns markdown responses with model info, web search results, links, and citations
- Supports web search enrichment
**Gemini**
- Returns markdown responses with citations
- Note: Japan (JP) and Taiwan (TW) are not supported
**Perplexity**
- Returns markdown responses with related prompts, web results, and media items
- Supports web search enrichment
**Copilot**
- Returns markdown responses with mode information, links, and citations
- Supports different modes: search, smart, chat, reasoning, study
- Note: Japan (JP) and Taiwan (TW) are not supported
**Google AI Mode**
- Returns answer body in multiple formats with citations
- Note: Japan (JP) and Taiwan (TW) are not supported
**Grok**
- Returns full responses with model info, follow-up suggestions, and web search results
- Supports different modes: FAST, EXPERT, AUTO
- Note: Japan (JP) and Taiwan (TW) are not supported
## Usage Examples
```bash
# Scrape ChatGPT with web search
python3 scripts/llm_chat_scraper.py chatgpt --query "AI trends in 2024" --web-search
# Scrape Gemini with UK proxy
python3 scripts/llm_chat_scraper.py gemini --query "Best restaurants in London" --country GB
# Scrape Perplexity
python3 scripts/llm_chat_scraper.py perplexity --query "Latest tech news"
# Scrape Copilot in reasoning mode
python3 scripts/llm_chat_scraper.py copilot --query "Explain quantum computing" --mode reasoning
# Scrape Google AI Mode
python3 scripts/llm_chat_scraper.py aimode --query "Climate change solutions"
# Scrape Grok in expert mode
python3 scripts/llm_chat_scraper.py grok --query "What's happening in AI" --mode MODEL_MODE_EXPERT
```
⚙️ Optional Parameters
- `country`
- `web-search`
- `mode`
## Output Structure
| Model | Key Fields | Description |
| -------------- | ------------------------- | ---------------------- |
| ChatGPT | `result_text`, `model`, `web_search`, `links`, `citations` | Returns markdown responses with model info, web search results, links, and citations. |
| Gemini | `result_text`, `citations` | Returns markdown responses with integrated citations. |
| Perplexity | `result_text`, `related_prompt`, `web_results`, `media_items` | Enriched responses with related prompts, web results, and media items. |
| Copilot | `result_text`, `mode`, `links`, `citations` | Supports search/chat/reasoning/study modes with links and citations. |
| Google AI Mode | `result_text`, `result_md`, `result_html`, `citations`, `raw_url` | Returns full structured answer body in multiple formats with citations. |
| Grok | `full_response`, `user_model`, `follow_up_suggestions`, `web_search_results` | Full responses for Expert/FAST/AUTO modes with follow-up suggestions and search results. |
## Common Issues
**Rate Limits**
If you encounter 429 errors, you've exceeded the rate limit. Reduce request frequency or upgrade your Scrapeless plan.
**Regional Restrictions**
Some LLM models (Gemini, Copilot, Google AI Mode, Grok) do not support Japan (JP) and Taiwan (TW).
**Result Expiry**
Task results are available for 12 hours.
## Related resources
- [Scrapeless LLM Scraper](https://docs.scrapeless.com/en/llm-chat-scraper/quickstart/introduction/)
- [Scrapeless Universal Scraping API](https://docs.scrapeless.com/en/universal-scraping-api/)
## Contact Us
For questions, suggestions, or collaboration inquiries, feel free to contact us via:
- Email/Slack: [email protected]
- Official Website: https://www.scrapeless.com
- Community Forum: [Browser Labs Discord](https://discord.com/invite/xBcTfGPjCQ)
FILE:scripts/llm_chat_scraper.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
LLM Chat Scraper Tool for OpenClaw
Scrape AI chat conversations from ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.
"""
import os
import sys
import time
import json
import argparse
import logging
from typing import Optional, Dict, Any
from dotenv import load_dotenv
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Load environment variables from .env file
load_dotenv()
try:
import requests
except ImportError:
print(json.dumps({
'error': 'Missing dependency',
'message': 'requests library not found. Please install it: pip install requests'
}), file=sys.stderr)
sys.exit(1)
class LLMChatScraper:
"""LLM Chat Scraper for Scrapeless API"""
SUPPORTED_ACTORS = [
'scraper.chatgpt',
'scraper.gemini',
'scraper.perplexity',
'scraper.copilot',
'scraper.aimode',
'scraper.grok'
]
def __init__(self, api_token: Optional[str] = None):
self.api_token = api_token or os.getenv('X_API_TOKEN')
if not self.api_token:
raise ValueError(
"API token not found. Please set X_API_TOKEN in .env file or environment variable."
)
self.api_base_url = 'https://api.scrapeless.com'
self.request_endpoint = '/api/v2/scraper/request'
self.result_endpoint = '/api/v2/scraper/result'
self.headers = {
'Content-Type': 'application/json',
'x-api-token': self.api_token
}
logger.info(f"LLMChatScraper initialized with API base URL: {self.api_base_url}")
def _validate_actor(self, actor: str) -> bool:
return actor in self.SUPPORTED_ACTORS
def _build_request_payload(
self,
actor: str,
prompt: str,
country: str,
web_search: bool = False,
mode: Optional[str] = None
) -> Dict[str, Any]:
input_params = {
'prompt': prompt,
'country': country
}
if actor in ['scraper.chatgpt', 'scraper.perplexity']:
input_params['web_search'] = web_search
if actor == 'scraper.copilot':
input_params['mode'] = mode or 'search'
if actor == 'scraper.grok':
input_params['mode'] = mode or 'MODEL_MODE_AUTO'
return {
'actor': actor,
'input': input_params,
'webhook': {'url': ''}
}
def create_task(
self,
actor: str,
prompt: str,
country: str,
web_search: bool = False,
mode: Optional[str] = None
) -> str:
if not self._validate_actor(actor):
raise ValueError(f"Unsupported actor: {actor}. Supported: {', '.join(self.SUPPORTED_ACTORS)}")
logger.info(f"Creating task for {actor} with prompt: {prompt[:50]}...")
payload = self._build_request_payload(
actor=actor,
prompt=prompt,
country=country,
web_search=web_search,
mode=mode
)
response = requests.post(
f"{self.api_base_url}{self.request_endpoint}",
headers=self.headers,
json=payload,
timeout=30
)
if response.status_code == 201:
result = response.json()
task_id = result.get('task_id')
if not task_id:
raise RuntimeError("No task_id in response")
logger.info(f"Task created successfully with ID: {task_id}")
return task_id
elif response.status_code == 400:
raise ValueError(f"Invalid request: {response.json()}")
elif response.status_code == 429:
raise RuntimeError("Rate limit exceeded")
else:
raise RuntimeError(f"Unexpected status code {response.status_code}: {response.text}")
def get_task_result(self, task_id: str) -> Optional[Dict[str, Any]]:
response = requests.get(
f"{self.api_base_url}{self.result_endpoint}/{task_id}",
headers=self.headers,
timeout=30
)
if response.status_code == 200:
result = response.json()
status = result.get('status')
if status == 'success':
return result
elif status in ['pending', 'running']:
return None
elif status == 'failed':
raise RuntimeError(f"Task failed: {result.get('message', 'Unknown error')}")
elif response.status_code == 202:
return None
elif response.status_code == 410:
raise RuntimeError("Task result has expired")
elif response.status_code == 404:
raise RuntimeError(f"Task not found: {task_id}")
else:
raise RuntimeError(f"Unexpected status code {response.status_code}")
return None
def poll_for_result(
self,
task_id: str,
interval: int = 10,
max_retries: int = 30
) -> Dict[str, Any]:
logger.info(f"Polling for task {task_id} with {max_retries} max retries")
for attempt in range(1, max_retries + 1):
logger.debug(f"Attempt {attempt}/{max_retries} for task {task_id}")
result = self.get_task_result(task_id)
if result is not None:
logger.info(f"Task {task_id} completed successfully")
return {
'success': True,
'data': result.get('task_result', {}),
'status': result.get('status', 'success'),
'task_id': task_id
}
if attempt < max_retries:
logger.debug(f"Task {task_id} still pending, waiting {interval} seconds")
time.sleep(interval)
raise TimeoutError(
f"Task did not complete after {max_retries} attempts. "
f"You can manually retrieve the result using:\n"
f"curl --request GET '{self.api_base_url}{self.result_endpoint}/{task_id}' \\\n"
f" --header 'Content-Type: application/json' \\\n"
f" --header 'x-api-token: REDACTED'"
)
def execute(
self,
actor: str,
prompt: str,
country: str = 'US',
web_search: bool = False,
mode: Optional[str] = None,
poll_interval: int = 10,
max_retries: int = 30
) -> Dict[str, Any]:
try:
task_id = self.create_task(
actor=actor,
prompt=prompt,
country=country,
web_search=web_search,
mode=mode
)
return self.poll_for_result(
task_id=task_id,
interval=poll_interval,
max_retries=max_retries
)
except Exception as e:
return {
'success': False,
'error': type(e).__name__,
'message': str(e)
}
def main():
parser = argparse.ArgumentParser(
description='LLM Chat Scraper Tool',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python3 llm_chat_scraper.py chatgpt --query "AI trends"
python3 llm_chat_scraper.py gemini --query "Best restaurants" --country US
python3 llm_chat_scraper.py perplexity --query "Latest news" --web-search
python3 llm_chat_scraper.py copilot --query "Explain ML" --mode reasoning
python3 llm_chat_scraper.py aimode --query "Programming tips"
python3 llm_chat_scraper.py grok --query "Quantum physics" --mode MODEL_MODE_EXPERT
"""
)
subparsers = parser.add_subparsers(dest='command', help='Available commands')
# ChatGPT command
chatgpt_parser = subparsers.add_parser('chatgpt', help='Scrape ChatGPT responses')
chatgpt_parser.add_argument('--query', required=True, help='Prompt to send to ChatGPT')
chatgpt_parser.add_argument('--country', default='US', help='Country code (default: US)')
chatgpt_parser.add_argument('--web-search', action='store_true', help='Enable web search')
chatgpt_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
chatgpt_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
# Gemini command
gemini_parser = subparsers.add_parser('gemini', help='Scrape Gemini responses')
gemini_parser.add_argument('--query', required=True, help='Prompt to send to Gemini')
gemini_parser.add_argument('--country', default='US', help='Country code (default: US)')
gemini_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
gemini_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
# Perplexity command
perplexity_parser = subparsers.add_parser('perplexity', help='Scrape Perplexity responses')
perplexity_parser.add_argument('--query', required=True, help='Prompt to send to Perplexity')
perplexity_parser.add_argument('--country', default='US', help='Country code (default: US)')
perplexity_parser.add_argument('--web-search', action='store_true', help='Enable web search')
perplexity_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
perplexity_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
# Copilot command
copilot_parser = subparsers.add_parser('copilot', help='Scrape Copilot responses')
copilot_parser.add_argument('--query', required=True, help='Prompt to send to Copilot')
copilot_parser.add_argument('--country', default='US', help='Country code (default: US)')
copilot_parser.add_argument('--mode', default='search', choices=['search', 'smart', 'chat', 'reasoning', 'study'], help='Mode (default: search)')
copilot_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
copilot_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
# Google AI Mode command
aimode_parser = subparsers.add_parser('aimode', help='Scrape Google AI Mode responses')
aimode_parser.add_argument('--query', required=True, help='Prompt to send to Google AI Mode')
aimode_parser.add_argument('--country', default='US', help='Country code (default: US)')
aimode_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
aimode_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
# Grok command
grok_parser = subparsers.add_parser('grok', help='Scrape Grok responses')
grok_parser.add_argument('--query', required=True, help='Prompt to send to Grok')
grok_parser.add_argument('--country', default='US', help='Country code (default: US)')
grok_parser.add_argument('--mode', default='MODEL_MODE_AUTO', choices=['MODEL_MODE_FAST', 'MODEL_MODE_EXPERT', 'MODEL_MODE_AUTO'], help='Mode (default: MODEL_MODE_AUTO)')
grok_parser.add_argument('--poll-interval', type=int, default=10, help='Polling interval in seconds')
grok_parser.add_argument('--max-retries', type=int, default=30, help='Maximum retries')
args = parser.parse_args()
if not args.command:
parser.print_help()
sys.exit(1)
try:
scraper = LLMChatScraper()
if args.command == 'chatgpt':
result = scraper.execute(
actor='scraper.chatgpt',
prompt=args.query,
country=args.country,
web_search=args.web_search,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
elif args.command == 'gemini':
result = scraper.execute(
actor='scraper.gemini',
prompt=args.query,
country=args.country,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
elif args.command == 'perplexity':
result = scraper.execute(
actor='scraper.perplexity',
prompt=args.query,
country=args.country,
web_search=args.web_search,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
elif args.command == 'copilot':
result = scraper.execute(
actor='scraper.copilot',
prompt=args.query,
country=args.country,
mode=args.mode,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
elif args.command == 'aimode':
result = scraper.execute(
actor='scraper.aimode',
prompt=args.query,
country=args.country,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
elif args.command == 'grok':
result = scraper.execute(
actor='scraper.grok',
prompt=args.query,
country=args.country,
mode=args.mode,
poll_interval=args.poll_interval,
max_retries=args.max_retries
)
# Output result as JSON
print(json.dumps(result, indent=2, ensure_ascii=False))
# Exit with error code if failed
if not result.get('success'):
sys.exit(1)
except Exception as e:
print(json.dumps({
'success': False,
'error': type(e).__name__,
'message': str(e)
}, indent=2), file=sys.stderr)
sys.exit(1)
def scrape_llm_chat(
prompt: str,
actor: str = 'scraper.chatgpt',
country: str = 'US',
web_search: bool = False,
mode: Optional[str] = None,
poll_interval: int = 10,
max_retries: int = 30,
api_token: Optional[str] = None
) -> Dict[str, Any]:
"""Scrape LLM chat responses from various AI models
Args:
prompt: Prompt to send to the AI
actor: Scraper to use (e.g., 'scraper.gemini')
country: Country code (default: US)
web_search: Whether to enable web search (default: False)
mode: Special mode for certain scrapers (e.g., 'search' for copilot)
poll_interval: Polling interval in seconds (default: 10)
max_retries: Maximum retries (default: 30)
api_token: API token for Scrapeless API
Returns:
Dict with success status and result data
"""
scraper = LLMChatScraper(api_token=api_token)
return scraper.execute(
actor=actor,
prompt=prompt,
country=country,
web_search=web_search,
mode=mode,
poll_interval=poll_interval,
max_retries=max_retries
)
if __name__ == '__main__':
main()