ucsdzehualiu

@clawhub-ucsdzehualiu-001da531f9

5prompts

0upvotes received

0contributions

Joined 3 months ago

5 contributions in the last year

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Less

free-web-search-js

Skill

Playwright驱动的联网搜索工具，自动抓取前三条网页内容，无需API Key，支持国内Bing和海外DDG搜索。

# SKILL.md

---
name: free-web-search-js
description: Playwright 联网搜索，自动抓取内容，零 API Key
version: 28.0.0
trigger_keywords:
  - 搜索
  - 查一下
  - 找一下
  - 最新消息
  - 新闻
  - 教程
  - 是什么
  - search
  - find
tools:
  - name: search
    description: 搜索+自动抓取，国内Bing Playwright，海外DDG HTTP
    script: scripts/search.js
    parameters:
      query:
        type: string
        description: "搜索关键词"
        required: true
      max:
        type: integer
        description: "最大结果数，默认10，上限30"
        required: false
      region:
        type: string
        description: "区域: auto/cn/intl，默认auto按IP检测"
        required: false
  - name: fetch
    description: 给定URL抓取正文，HTTP优先失败自动headed兜底
    script: scripts/fetch.js
    parameters:
      urls:
        type: string
        description: "要抓取的URL，多个用空格分隔"
        required: true
      max-len:
        type: integer
        description: "单页最大字符数，默认12000"
        required: false
---

# free-web-search-js

一步式：**search** → Playwright 搜 → 自动抓内容 → 返回

## 架构

```
国内：
  Playwright 打开 Bing → 首页拿 cookie → 搜索框提交
  → 自动抓取 top 3 页面内容
  延迟：首次 3~6s（启动浏览器），后续复用更快

海外：
  纯 HTTP → DDG HTML 解析
  → 自动抓取 top 3 页面内容
  延迟：几百ms~1s
```

## 搜索引擎

| 引擎 | 协议 | 区域 | 说明 |
|------|------|------|------|
| Bing CN | Playwright 搜索框提交 | 国内 | 先访问首页拿 cookie，再搜索框输入提交 |
| 搜狗 | 纯 HTTP | 国内 | `--engine=sogou` 可选，⚠ 无 cookie 易被反爬拦截，结果不稳定 |
| DDG HTML Lite | 纯 HTTP | 海外 | html.duckduckgo.com |

### 策略

| 区域 | 搜索 | 抓取 |
|------|------|------|
| 国内 | Bing CN (Playwright) | 自动抓前 3 条 |
| 海外 | DDG HTML | 自动抓前 3 条 |

### IP 怎么判断

每次搜索时自动检测，三轮探测并行，谁先成功用谁：

| 轮次 | 探测服务 | 逻辑 |
|------|---------|------|
| 第1轮 | `myip.ipip.net` / `cip.cc` | 国内可达优先 |
| 第2轮 | `ipinfo.io` / `ipapi.co` | 国际探测 |
| 第3轮 | 试连 `cn.bing.com` | 能通大概率国内 |
| 兜底 | — | 默认国内 |

出口 IP 走代理时可能误判，用 `--region=cn` 或 `--region=intl` 手动指定。

## 去重

智能去重：域名 + 路径主干（忽略 www/m 子域、tracking 参数、尾部斜杠、.html 后缀）。

Bing 跳转 URL（`bing.com/ck/`）自动解码为直链。

## 抓取模式

搜索后自动抓取 top N 条 URL 内容（默认 3 条）。

| 层级 | 方式 | 速度 | 说明 |
|------|------|------|------|
| 第1层 | 轻量 HTTP + cheerio | ⚡ 秒出 | 不启动浏览器 |
| 第2层 | Playwright headed | 🟡 慢 | 完整浏览器，支持 JS 渲染 |

第1层增强：
- **JSON API 响应**：自动检测 Content-Type 并提取结构化内容
- **JSON-LD**：提取 `<script type="application/ld+json">` 中的 articleBody/description
- **__NEXT_DATA__**：提取 Next.js 嵌入数据
- **meta 标签**：og:description / description 兜底
- **GBK 编码**：自动检测并转换

## 安装

**前置依赖（全部必装）：**

| 依赖 | 说明 | 大小/耗时 |
|------|------|----------|
| Node.js >= 18 | 运行时 | — |
| cheerio | HTML 解析 | 小，秒装 |
| commander | CLI 参数解析 | 小，秒装 |
| iconv-lite | GBK 编码转换 | 小，秒装 |
| playwright | 浏览器自动化（Bing 搜索 + 抓取兜底） | ~50MB |
| Chromium | Playwright 专用浏览器 | **~150MB，需几分钟下载** |

安装脚本自动检测网络区域，国内使用镜像源加速：

```bash
# Windows
powershell -File scripts/setup.ps1

# Linux/macOS
bash scripts/setup.sh
```

国内镜像：
- npm: `https://registry.npmmirror.com`
- Playwright/Chromium: `https://npmmirror.com/mirrors/playwright`

手动安装：
```bash
cd skills/free-web-search-js
npm install
npx playwright install chromium    # ~150MB，需几分钟
```

验证环境：`node scripts/check-env.js`

卸载：`node scripts/uninstall.js`

## 性能优化：浏览器守护进程

搜索和抓取可复用浏览器守护进程，**提速约 70%**：

```bash
node scripts/browser-daemon.js &       # 启动
node scripts/browser-daemon.js --status # 状态
node scripts/browser-daemon.js --stop   # 停止
```

守护进程空闲 10 分钟自动退出。

## 用法

```bash
# 搜索（搜 + 自动抓前3条内容）
node scripts/search.js "白银价格"
node scripts/search.js "how to deploy docker" --max=5
node scripts/search.js "xxx" --region=cn
node scripts/search.js "xxx" --fetch=5          # 抓前5条
node scripts/search.js "xxx" --no-fetch         # 只搜不抓

# 单独抓取（给定 URL）
node scripts/fetch.js "https://example.com/page1" "https://example.com/page2"
```

## 已知限制

- **国内首次搜索较慢**：需启动 Chromium（3~6s），后续复用更快
- **Bing CN 即时答案不返回**：天气、计算器等即时卡片不走 `li.b_algo`，搜索结果为 0
- **搜狗 HTTP 不稳定**：无 cookie 纯请求易被反爬拦截，结果可能为空（`--engine=sogou` 慎用）
- **部分站点 HTTP 抓不到**：需要 JS 渲染的页面——HTTP 失败会自动 headed 重试
- **部分站点海外不可达**：国内专属站点从海外访问可能超时
- **代理干扰 IP 检测**：出口 IP 走代理时可能误判区域，用 `--region=cn/intl` 手动指定
- **海外引擎国内不可达**：DDG 在国内被墙，国内策略不使用

FILE:package.json
{
  "name": "free-web-search-js",
  "version": "28.0.0",
  "type": "module",
  "description": "Playwright 联网搜索，国内Bing/搜狗，海外DDG，自动抓取，零 API Key",
  "scripts": {
    "search": "node scripts/search.js",
    "fetch": "node scripts/fetch.js"
  },
  "dependencies": {
    "cheerio": "^1.0.0",
    "commander": "^12.0.0",
    "iconv-lite": "^0.6.3",
    "playwright": "^1.52.0"
  }
}

FILE:package-lock.json
{
  "name": "free-web-search",
  "version": "15.0.0",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
      "name": "free-web-search",
      "version": "15.0.0",
      "dependencies": {
        "cheerio": "^1.0.0",
        "commander": "^12.0.0",
        "playwright": "^1.59.1"
      },
      "optionalDependencies": {
        "playwright": "^1.59.1"
      }
    },
    "node_modules/boolbase": {
      "version": "1.0.0",
      "resolved": "https://registry.npmmirror.com/boolbase/-/boolbase-1.0.0.tgz",
      "integrity": "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww==",
      "license": "ISC"
    },
    "node_modules/cheerio": {
      "version": "1.2.0",
      "resolved": "https://registry.npmmirror.com/cheerio/-/cheerio-1.2.0.tgz",
      "integrity": "sha512-WDrybc/gKFpTYQutKIK6UvfcuxijIZfMfXaYm8NMsPQxSYvf+13fXUJ4rztGGbJcBQ/GF55gvrZ0Bc0bj/mqvg==",
      "license": "MIT",
      "dependencies": {
        "cheerio-select": "^2.1.0",
        "dom-serializer": "^2.0.0",
        "domhandler": "^5.0.3",
        "domutils": "^3.2.2",
        "encoding-sniffer": "^0.2.1",
        "htmlparser2": "^10.1.0",
        "parse5": "^7.3.0",
        "parse5-htmlparser2-tree-adapter": "^7.1.0",
        "parse5-parser-stream": "^7.1.2",
        "undici": "^7.19.0",
        "whatwg-mimetype": "^4.0.0"
      },
      "engines": {
        "node": ">=20.18.1"
      },
      "funding": {
        "url": "https://github.com/cheeriojs/cheerio?sponsor=1"
      }
    },
    "node_modules/cheerio-select": {
      "version": "2.1.0",
      "resolved": "https://registry.npmmirror.com/cheerio-select/-/cheerio-select-2.1.0.tgz",
      "integrity": "sha512-9v9kG0LvzrlcungtnJtpGNxY+fzECQKhK4EGJX2vByejiMX84MFNQw4UxPJl3bFbTMw+Dfs37XaIkCwTZfLh4g==",
      "license": "BSD-2-Clause",
      "dependencies": {
        "boolbase": "^1.0.0",
        "css-select": "^5.1.0",
        "css-what": "^6.1.0",
        "domelementtype": "^2.3.0",
        "domhandler": "^5.0.3",
        "domutils": "^3.0.1"
      },
      "funding": {
        "url": "https://github.com/sponsors/fb55"
      }
    },
    "node_modules/commander": {
      "version": "12.1.0",
      "resolved": "https://registry.npmmirror.com/commander/-/commander-12.1.0.tgz",
      "integrity": "sha512-Vw8qHK3bZM9y/P10u3Vib8o/DdkvA2OtPtZvD871QKjy74Wj1WSKFILMPRPSdUSx5RFK1arlJzEtA4PkFgnbuA==",
      "license": "MIT",
      "engines": {
        "node": ">=18"
      }
    },
    "node_modules/css-select": {
      "version": "5.2.2",
      "resolved": "https://registry.npmmirror.com/css-select/-/css-select-5.2.2.tgz",
      "integrity": "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw==",
      "license": "BSD-2-Clause",
      "dependencies": {
        "boolbase": "^1.0.0",
        "css-what": "^6.1.0",
        "domhandler": "^5.0.2",
        "domutils": "^3.0.1",
        "nth-check": "^2.0.1"
      },
      "funding": {
        "url": "https://github.com/sponsors/fb55"
      }
    },
    "node_modules/css-what": {
      "version": "6.2.2",
      "resolved": "https://registry.npmmirror.com/css-what/-/css-what-6.2.2.tgz",
      "integrity": "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA==",
      "license": "BSD-2-Clause",
      "engines": {
        "node": ">= 6"
      },
      "funding": {
        "url": "https://github.com/sponsors/fb55"
      }
    },
    "node_modules/dom-serializer": {
      "version": "2.0.0",
      "resolved": "https://registry.npmmirror.com/dom-serializer/-/dom-serializer-2.0.0.tgz",
      "integrity": "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg==",
      "license": "MIT",
      "dependencies": {
        "domelementtype": "^2.3.0",
        "domhandler": "^5.0.2",
        "entities": "^4.2.0"
      },
      "funding": {
        "url": "https://github.com/cheeriojs/dom-serializer?sponsor=1"
      }
    },
    "node_modules/domelementtype": {
      "version": "2.3.0",
      "resolved": "https://registry.npmmirror.com/domelementtype/-/domelementtype-2.3.0.tgz",
      "integrity": "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw==",
      "funding": [
        {
          "type": "github",
          "url": "https://github.com/sponsors/fb55"
        }
      ],
      "license": "BSD-2-Clause"
    },
    "node_modules/domhandler": {
      "version": "5.0.3",
      "resolved": "https://registry.npmmirror.com/domhandler/-/domhandler-5.0.3.tgz",
      "integrity": "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w==",
      "license": "BSD-2-Clause",
      "dependencies": {
        "domelementtype": "^2.3.0"
      },
      "engines": {
        "node": ">= 4"
      },
      "funding": {
        "url": "https://github.com/fb55/domhandler?sponsor=1"
      }
    },
    "node_modules/domutils": {
      "version": "3.2.2",
      "resolved": "https://registry.npmmirror.com/domutils/-/domutils-3.2.2.tgz",
      "integrity": "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw==",
      "license": "BSD-2-Clause",
      "dependencies": {
        "dom-serializer": "^2.0.0",
        "domelementtype": "^2.3.0",
        "domhandler": "^5.0.3"
      },
      "funding": {
        "url": "https://github.com/fb55/domutils?sponsor=1"
      }
    },
    "node_modules/encoding-sniffer": {
      "version": "0.2.1",
      "resolved": "https://registry.npmmirror.com/encoding-sniffer/-/encoding-sniffer-0.2.1.tgz",
      "integrity": "sha512-5gvq20T6vfpekVtqrYQsSCFZ1wEg5+wW0/QaZMWkFr6BqD3NfKs0rLCx4rrVlSWJeZb5NBJgVLswK/w2MWU+Gw==",
      "license": "MIT",
      "dependencies": {
        "iconv-lite": "^0.6.3",
        "whatwg-encoding": "^3.1.1"
      },
      "funding": {
        "url": "https://github.com/fb55/encoding-sniffer?sponsor=1"
      }
    },
    "node_modules/entities": {
      "version": "4.5.0",
      "resolved": "https://registry.npmmirror.com/entities/-/entities-4.5.0.tgz",
      "integrity": "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw==",
      "license": "BSD-2-Clause",
      "engines": {
        "node": ">=0.12"
      },
      "funding": {
        "url": "https://github.com/fb55/entities?sponsor=1"
      }
    },
    "node_modules/fsevents": {
      "version": "2.3.2",
      "resolved": "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz",
      "integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
      "hasInstallScript": true,
      "license": "MIT",
      "optional": true,
      "os": [
        "darwin"
      ],
      "engines": {
        "node": "^8.16.0 || ^10.6.0 || >=11.0.0"
      }
    },
    "node_modules/htmlparser2": {
      "version": "10.1.0",
      "resolved": "https://registry.npmmirror.com/htmlparser2/-/htmlparser2-10.1.0.tgz",
      "integrity": "sha512-VTZkM9GWRAtEpveh7MSF6SjjrpNVNNVJfFup7xTY3UpFtm67foy9HDVXneLtFVt4pMz5kZtgNcvCniNFb1hlEQ==",
      "funding": [
        "https://github.com/fb55/htmlparser2?sponsor=1",
        {
          "type": "github",
          "url": "https://github.com/sponsors/fb55"
        }
      ],
      "license": "MIT",
      "dependencies": {
        "domelementtype": "^2.3.0",
        "domhandler": "^5.0.3",
        "domutils": "^3.2.2",
        "entities": "^7.0.1"
      }
    },
    "node_modules/htmlparser2/node_modules/entities": {
      "version": "7.0.1",
      "resolved": "https://registry.npmmirror.com/entities/-/entities-7.0.1.tgz",
      "integrity": "sha512-TWrgLOFUQTH994YUyl1yT4uyavY5nNB5muff+RtWaqNVCAK408b5ZnnbNAUEWLTCpum9w6arT70i1XdQ4UeOPA==",
      "license": "BSD-2-Clause",
      "engines": {
        "node": ">=0.12"
      },
      "funding": {
        "url": "https://github.com/fb55/entities?sponsor=1"
      }
    },
    "node_modules/iconv-lite": {
      "version": "0.6.3",
      "resolved": "https://registry.npmmirror.com/iconv-lite/-/iconv-lite-0.6.3.tgz",
      "integrity": "sha512-4fCk79wshMdzMp2rH06qWrJE4iolqLhCUH+OiuIgU++RB0+94NlDL81atO7GX55uUKueo0txHNtvEyI6D7WdMw==",
      "license": "MIT",
      "dependencies": {
        "safer-buffer": ">= 2.1.2 < 3.0.0"
      },
      "engines": {
        "node": ">=0.10.0"
      }
    },
    "node_modules/nth-check": {
      "version": "2.1.1",
      "resolved": "https://registry.npmmirror.com/nth-check/-/nth-check-2.1.1.tgz",
      "integrity": "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w==",
      "license": "BSD-2-Clause",
      "dependencies": {
        "boolbase": "^1.0.0"
      },
      "funding": {
        "url": "https://github.com/fb55/nth-check?sponsor=1"
      }
    },
    "node_modules/parse5": {
      "version": "7.3.0",
      "resolved": "https://registry.npmmirror.com/parse5/-/parse5-7.3.0.tgz",
      "integrity": "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==",
      "license": "MIT",
      "dependencies": {
        "entities": "^6.0.0"
      },
      "funding": {
        "url": "https://github.com/inikulin/parse5?sponsor=1"
      }
    },
    "node_modules/parse5-htmlparser2-tree-adapter": {
      "version": "7.1.0",
      "resolved": "https://registry.npmmirror.com/parse5-htmlparser2-tree-adapter/-/parse5-htmlparser2-tree-adapter-7.1.0.tgz",
      "integrity": "sha512-ruw5xyKs6lrpo9x9rCZqZZnIUntICjQAd0Wsmp396Ul9lN/h+ifgVV1x1gZHi8euej6wTfpqX8j+BFQxF0NS/g==",
      "license": "MIT",
      "dependencies": {
        "domhandler": "^5.0.3",
        "parse5": "^7.0.0"
      },
      "funding": {
        "url": "https://github.com/inikulin/parse5?sponsor=1"
      }
    },
    "node_modules/parse5-parser-stream": {
      "version": "7.1.2",
      "resolved": "https://registry.npmmirror.com/parse5-parser-stream/-/parse5-parser-stream-7.1.2.tgz",
      "integrity": "sha512-JyeQc9iwFLn5TbvvqACIF/VXG6abODeB3Fwmv/TGdLk2LfbWkaySGY72at4+Ty7EkPZj854u4CrICqNk2qIbow==",
      "license": "MIT",
      "dependencies": {
        "parse5": "^7.0.0"
      },
      "funding": {
        "url": "https://github.com/inikulin/parse5?sponsor=1"
      }
    },
    "node_modules/parse5/node_modules/entities": {
      "version": "6.0.1",
      "resolved": "https://registry.npmmirror.com/entities/-/entities-6.0.1.tgz",
      "integrity": "sha512-aN97NXWF6AWBTahfVOIrB/NShkzi5H7F9r1s9mD3cDj4Ko5f2qhhVoYMibXF7GlLveb/D2ioWay8lxI97Ven3g==",
      "license": "BSD-2-Clause",
      "engines": {
        "node": ">=0.12"
      },
      "funding": {
        "url": "https://github.com/fb55/entities?sponsor=1"
      }
    },
    "node_modules/playwright": {
      "version": "1.59.1",
      "resolved": "https://registry.npmmirror.com/playwright/-/playwright-1.59.1.tgz",
      "integrity": "sha512-C8oWjPR3F81yljW9o5OxcWzfh6avkVwDD2VYdwIGqTkl+OGFISgypqzfu7dOe4QNLL2aqcWBmI3PMtLIK233lw==",
      "license": "Apache-2.0",
      "optional": true,
      "dependencies": {
        "playwright-core": "1.59.1"
      },
      "bin": {
        "playwright": "cli.js"
      },
      "engines": {
        "node": ">=18"
      },
      "optionalDependencies": {
        "fsevents": "2.3.2"
      }
    },
    "node_modules/playwright-core": {
      "version": "1.59.1",
      "resolved": "https://registry.npmmirror.com/playwright-core/-/playwright-core-1.59.1.tgz",
      "integrity": "sha512-HBV/RJg81z5BiiZ9yPzIiClYV/QMsDCKUyogwH9p3MCP6IYjUFu/MActgYAvK0oWyV9NlwM3GLBjADyWgydVyg==",
      "license": "Apache-2.0",
      "optional": true,
      "bin": {
        "playwright-core": "cli.js"
      },
      "engines": {
        "node": ">=18"
      }
    },
    "node_modules/safer-buffer": {
      "version": "2.1.2",
      "resolved": "https://registry.npmmirror.com/safer-buffer/-/safer-buffer-2.1.2.tgz",
      "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==",
      "license": "MIT"
    },
    "node_modules/undici": {
      "version": "7.25.0",
      "resolved": "https://registry.npmmirror.com/undici/-/undici-7.25.0.tgz",
      "integrity": "sha512-xXnp4kTyor2Zq+J1FfPI6Eq3ew5h6Vl0F/8d9XU5zZQf1tX9s2Su1/3PiMmUANFULpmksxkClamIZcaUqryHsQ==",
      "license": "MIT",
      "engines": {
        "node": ">=20.18.1"
      }
    },
    "node_modules/whatwg-encoding": {
      "version": "3.1.1",
      "resolved": "https://registry.npmmirror.com/whatwg-encoding/-/whatwg-encoding-3.1.1.tgz",
      "integrity": "sha512-6qN4hJdMwfYBtE3YBTTHhoeuUrDBPZmbQaxWAqSALV/MeEnR5z1xd8UKud2RAkFoPkmB+hli1TZSnyi84xz1vQ==",
      "deprecated": "Use @exodus/bytes instead for a more spec-conformant and faster implementation",
      "license": "MIT",
      "dependencies": {
        "iconv-lite": "0.6.3"
      },
      "engines": {
        "node": ">=18"
      }
    },
    "node_modules/whatwg-mimetype": {
      "version": "4.0.0",
      "resolved": "https://registry.npmmirror.com/whatwg-mimetype/-/whatwg-mimetype-4.0.0.tgz",
      "integrity": "sha512-QaKxh0eNIi2mE9p2vEdzfagOKHCcj1pJ56EEHGQOVxp8r9/iszLUUV7v89x9O1p/T+NlTM5W7jW6+cz4Fq1YVg==",
      "license": "MIT",
      "engines": {
        "node": ">=18"
      }
    }
  }
}

FILE:scripts/browser-daemon.js
#!/usr/bin/env node
/**
 * browser-daemon.js — 持久化 Chromium 守护进程
 *
 * 用 Playwright launchServer() 启动常驻浏览器，
 * search.js / fetch.js 通过 CDP 复用，省去每次 1.5s+ 的 launch 开销。
 *
 * 用法：
 *   启动: node scripts/browser-daemon.js          (后台运行)
 *   停止: node scripts/browser-daemon.js --stop
 *   状态: node scripts/browser-daemon.js --status
 */
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const skillRoot = path.resolve(__dirname, '..');
const ENDPOINT_FILE = path.join(skillRoot, '.browser-endpoint');

function readInfo() {
  try { return JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8')); } catch { return null; }
}

function isAlive() {
  const info = readInfo();
  if (!info) return false;
  try { process.kill(info.pid, 0); return true; } catch {
    try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
    return false;
  }
}

async function startDaemon() {
  if (isAlive()) {
    const info = readInfo();
    const uptime = ((Date.now() - info.startedAt) / 1000).toFixed(0);
    console.log(`[daemon] Already running  PID: info.pid  Uptime: uptimes`);
    console.log(`  WS: info.wsEndpoint`);
    return;
  }

  const { chromium } = await import('playwright');
  const server = await chromium.launchServer({
    headless: false,
    args: [
      '--disable-blink-features=AutomationControlled',
      '--disable-gpu',
    ],
  });

  const wsEndpoint = server.wsEndpoint();
  const info = {
    pid: process.pid,  // daemon 进程 PID（用于 isAlive 检查）
    wsEndpoint,
    startedAt: Date.now(),
  };

  fs.writeFileSync(ENDPOINT_FILE, JSON.stringify(info, null, 2));
  console.log(`[daemon] Chromium started  PID: info.pid`);
  console.log(`[daemon] WS: wsEndpoint`);
  console.log('[daemon] Running... (Ctrl+C or --stop to quit)');

  // Keep process alive
  process.on('SIGINT', async () => {
    console.log('[daemon] Stopping...');
    await server.close();
    try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
    process.exit(0);
  });
  process.on('SIGTERM', async () => {
    await server.close();
    try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
    process.exit(0);
  });
}

function stopDaemon() {
  const info = readInfo();
  if (!info) { console.log('[daemon] Not running'); return; }
  try {
    process.kill(info.pid, 'SIGTERM');
    console.log(`[daemon] Stopped  PID: info.pid`);
  } catch {
    console.log('[daemon] Process already exited');
  }
  try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
}

function showStatus() {
  if (!isAlive()) { console.log('[daemon] Not running'); return; }
  const info = readInfo();
  const uptime = ((Date.now() - info.startedAt) / 1000).toFixed(0);
  console.log(`[daemon] Running  PID: info.pid  Uptime: uptimes`);
  console.log(`  WS: info.wsEndpoint`);
}

const arg = process.argv[2];
if (arg === '--stop') stopDaemon();
else if (arg === '--status') showStatus();
else startDaemon();

FILE:scripts/check-env.js
#!/usr/bin/env node
/**
 * free-web-search-js environment check v28
 */

import { execSync } from 'child_process';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const skillRoot = path.resolve(__dirname, '..');

function main() {
  const lines = [];

  // Node.js
  let nodeOk = false;
  try {
    const v = execSync('node --version', { encoding: 'utf-8', timeout: 5000 }).trim();
    const major = parseInt(v.replace('v', '').split('.')[0]);
    nodeOk = major >= 18;
    if (nodeOk) {
      lines.push(`[OK] Node.js v (>= 18)`);
    } else {
      lines.push(`[X] Node.js >= 18 required (current: v)`);
      lines.push(`   -> https://nodejs.org`);
    }
  } catch {
    lines.push(`[X] Node.js not found`);
    lines.push(`   -> https://nodejs.org`);
  }

  // npm dependencies (全部必装)
  const nm = path.join(skillRoot, 'node_modules');
  const requiredDeps = ['cheerio', 'commander', 'iconv-lite', 'playwright'];
  let depsOk = true;

  if (!fs.existsSync(nm)) {
    lines.push(`[X] node_modules not found`);
    lines.push(`   -> cd skillRoot && npm install`);
    depsOk = false;
  } else {
    const missing = requiredDeps.filter(dep => !fs.existsSync(path.join(nm, dep)));
    if (missing.length > 0) {
      lines.push(`[X] Missing npm packages: missing.join(', ')`);
      lines.push(`   -> cd skillRoot && npm install`);
      depsOk = false;
    } else {
      lines.push(`[OK] npm packages: cheerio, commander, iconv-lite, playwright`);
    }
  }

  // Playwright Chromium browser (必装)
  let browserOk = false;
  try {
    const browserPaths = [
      process.env.LOCALAPPDATA && path.join(process.env.LOCALAPPDATA, 'ms-playwright'),
      process.env.HOME && path.join(process.env.HOME, '.cache', 'ms-playwright'),
    ].filter(Boolean);
    browserOk = browserPaths.some(p => fs.existsSync(p) && fs.readdirSync(p).length > 0);
    if (browserOk) {
      lines.push(`[OK] Playwright Chromium browser installed`);
    } else {
      lines.push(`[X] Playwright Chromium browser not installed`);
      lines.push(`   -> npx playwright install chromium`);
      depsOk = false;
    }
  } catch {
    lines.push(`[X] Playwright Chromium browser check failed`);
    lines.push(`   -> npx playwright install chromium`);
    depsOk = false;
  }

  const allOk = nodeOk && depsOk;
  lines.push('');
  if (allOk) {
    lines.push(`[OK] Environment ready`);
  } else {
    lines.push(`[X] Environment not ready, follow the -> hints above`);
  }

  console.log(lines.join('\n'));
  process.exit(allOk ? 0 : 1);
}

main();

FILE:scripts/fetch.js
#!/usr/bin/env node
/**
 * free-web-search-js fetch.js v23.0
 *
 * 两层兜底 + 增强：
 *   1. 轻量 HTTP + cheerio（快，不启动浏览器）
 *      - 支持 JSON API 响应
 *      - 提取 JSON-LD / __NEXT_DATA__ 等嵌入数据
 *      - meta 标签兜底（og:description 等）
 *   2. Playwright headed（完整浏览器，支持 JS 渲染）
 * 多 URL 并行，打不开跳过
 */
import process from 'process';
import child_process from 'child_process';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const ENDPOINT_FILE = path.resolve(__dirname, '..', '.browser-endpoint');

const TIMEOUT = 35000;
const DEFAULT_MAX_LEN = 12000;

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';

// ==================== 浏览器复用 ====================
async function getBrowser() {
  try {
    const info = JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8'));
    process.kill(info.pid, 0);
    const { chromium } = await import('playwright');
    const browser = await chromium.connectOverCDP(info.wsEndpoint);
    return { browser, shared: true };
  } catch {}
  const { chromium } = await import('playwright');
  const browser = await chromium.launch({
    headless: false,
    args: ['--disable-blink-features=AutomationControlled'],
  });
  return { browser, shared: false };
}

function releaseBrowser(browser, shared) {
  return shared ? browser.disconnect() : browser.close();
}

const PAGE_COMPAT_INIT = () => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
  const origQuery = window.navigator.permissions?.query;
  if (origQuery) {
    window.navigator.permissions.query = (params) => (
      params.name === 'notifications'
        ? Promise.resolve({ state: Notification.permission })
        : origQuery(params)
    );
  }
};

async function ensureDeps() {
  try { await import('cheerio'); } catch {
    child_process.execSync('npm install cheerio --silent', { stdio: 'inherit' });
  }
  try { await import('commander'); } catch {
    child_process.execSync('npm install commander --silent', { stdio: 'inherit' });
  }
  try { await import('iconv-lite'); } catch {
    child_process.execSync('npm install iconv-lite --silent', { stdio: 'inherit' });
  }
  try { await import('playwright'); } catch {
    console.error('[WARN] playwright 未安装，headed 兜底不可用');
  }
}

// ==================== 编码处理 ====================
async function decodeBuffer(buf, contentTypeHeader) {
  // 优先从 Content-Type 检测编码
  let charset = 'utf-8';
  if (contentTypeHeader) {
    const m = contentTypeHeader.match(/charset=([^\s;]+)/i);
    if (m) charset = m[1].toLowerCase();
  }

  if (charset === 'utf-8' || charset === 'utf8') {
    return buf.toString('utf-8');
  }
  if (charset === 'gbk' || charset === 'gb2312' || charset === 'gb18030') {
    try {
      const iconv = await import('iconv-lite');
      return iconv.default.decode(buf, 'gbk');
    } catch {
      try { return new TextDecoder('gbk').decode(buf); } catch {}
    }
  }
  // fallback: 尝试 utf-8，乱码多则试 gbk
  let text = buf.toString('utf-8');
  if ((text.match(/\ufffd/g) || []).length > 20) {
    try {
      const iconv = await import('iconv-lite');
      text = iconv.default.decode(buf, 'gbk');
    } catch {
      try { text = new TextDecoder('gbk').decode(buf); } catch {}
    }
  }
  return text;
}

// ==================== JSON 内容提取 ====================
function extractJsonContent(data, maxLen) {
  /** 从 JSON API 响应中提取有意义的文本 */
  const texts = [];

  function walk(obj, depth = 0) {
    if (depth > 8 || texts.join(' ').length > maxLen) return;
    if (typeof obj === 'string' && obj.length > 20) {
      texts.push(obj);
    } else if (Array.isArray(obj)) {
      for (const item of obj) walk(item, depth + 1);
    } else if (obj && typeof obj === 'object') {
      // 优先提取常见内容字段
      for (const key of ['content', 'text', 'body', 'description', 'summary',
        'message', 'value', 'title', 'name', 'answer', 'result']) {
        if (obj[key] && typeof obj[key] === 'string' && obj[key].length > 20) {
          texts.push(obj[key]);
        }
      }
      for (const [k, v] of Object.entries(obj)) {
        if (typeof v === 'object' && v !== null) walk(v, depth + 1);
      }
    }
  }

  walk(data);
  return texts.join(' ').replace(/\s+/g, ' ').trim().slice(0, maxLen);
}

// ==================== 嵌入数据提取 ====================
function extractEmbeddedData($, maxLen) {
  /** 提取 HTML 中嵌入的结构化数据：JSON-LD, __NEXT_DATA__, meta 等 */
  const parts = [];

  // JSON-LD
  $('script[type="application/ld+json"]').each((_, el) => {
    try {
      const data = JSON.parse($(el).text());
      if (data.description) parts.push(String(data.description));
      if (data.articleBody) parts.push(String(data.articleBody));
      if (data.text) parts.push(String(data.text));
      // 遍历 @graph
      if (Array.isArray(data['@graph'])) {
        for (const item of data['@graph']) {
          if (item.description) parts.push(String(item.description));
          if (item.articleBody) parts.push(String(item.articleBody));
        }
      }
    } catch {}
  });

  // __NEXT_DATA__ (Next.js)
  $('script#__NEXT_DATA__').each((_, el) => {
    try {
      const data = JSON.parse($(el).text());
      const text = extractJsonContent(data, maxLen);
      if (text.length > 100) parts.push(text);
    } catch {}
  });

  // meta 标签兜底
  const metaSelectors = [
    'meta[property="og:description"]',
    'meta[name="description"]',
    'meta[property="og:title"]',
    'meta[name="twitter:description"]',
  ];
  for (const sel of metaSelectors) {
    const content = $(sel).attr('content');
    if (content && content.length > 20) parts.push(content);
  }

  return parts.join(' ').replace(/\s+/g, ' ').trim().slice(0, maxLen);
}

// ==================== 第1层：轻量 HTTP ====================
async function fetchLightweight(url, maxLen) {
  console.error(`[fetch:http] url`);
  const ac = new AbortController();
  const t = setTimeout(() => ac.abort(), 15000);
  try {
    const r = await fetch(url, {
      headers: {
        'User-Agent': UA,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,application/json;q=0.8,*/*;q=0.5',
        'Accept-Language': 'zh-CN,zh;q=0.9,en-US,en;q=0.8',
      },
      redirect: 'follow', signal: ac.signal,
    });
    clearTimeout(t);
    if (!r.ok) return { status: r.status, content: '', error: `HTTP r.status` };

    const contentType = r.headers.get('content-type') || '';
    const buf = Buffer.from(await r.arrayBuffer());

    // JSON 响应：直接解析
    if (/application\/json/i.test(contentType) || (/^[\[{]/.test(buf.toString('utf-8', 0, 100)))) {
      try {
        const data = JSON.parse(buf.toString('utf-8'));
        const text = extractJsonContent(data, maxLen);
        if (text.length > 50) return { status: 200, content: text };
      } catch {}
    }

    // HTML 响应
    const html = await decodeBuffer(buf, contentType);
    const { load } = await import('cheerio');
    const $ = load(html);

    // 先提取嵌入数据（JSON-LD 等），作为补充
    const embedded = extractEmbeddedData($, maxLen);

    // 去噪音
    $('script,style,nav,header,footer,aside,iframe,noscript,.ad,.sidebar,.comment,.social,.share,.related,.breadcrumb,.pagination,.cookie,.popup').remove();

    // 正文容器
    for (const sel of ['article','.article-content','.post-content','.entry-content',
      '#article_content','.markdown-body','.news-content','.detail-body',
      '.content','.main-content','main','#content','table']) {
      const el = $(sel).first();
      if (el.length) {
        const text = el.text().replace(/\s+/g, ' ').trim();
        if (text.length > 200) {
          // 如果嵌入数据有额外信息，拼上
          let result = text;
          if (embedded && !text.includes(embedded.slice(0, 50))) {
            result = text + '\n\n[结构化数据] ' + embedded;
          }
          return { status: 200, content: result.slice(0, maxLen) };
        }
      }
    }

    // 启发式：找文本密度最高的块
    const candidates = [];
    for (const el of $('div, section, main, article').toArray()) {
      const $el = $(el);
      if ($el.children().length > 50) continue;
      const text = $el.text().replace(/\s+/g, ' ').trim();
      if (text.length > 300) {
        const linkRatio = $el.find('a').length / (text.length / 100);
        if (linkRatio < 5) candidates.push({ text, len: text.length });
      }
    }
    candidates.sort((a, b) => b.len - a.len);
    if (candidates.length > 0 && candidates[0].len > 200) {
      let result = candidates[0].text;
      if (embedded && !result.includes(embedded.slice(0, 50))) {
        result = result + '\n\n[结构化数据] ' + embedded;
      }
      return { status: 200, content: result.slice(0, maxLen) };
    }

    // 嵌入数据兜底（正文提取失败但有 JSON-LD 等）
    if (embedded.length > 100) return { status: 200, content: embedded.slice(0, maxLen) };

    const body = $('body').text().replace(/\s+/g, ' ').trim();
    if (body.length > 200) return { status: 200, content: body.slice(0, maxLen) };

    return { status: r.status, content: '', error: `内容太短(body.length字)` };
  } catch (e) {
    clearTimeout(t);
    return { status: 0, content: '', error: e.message.split('\n')[0] };
  }
}

// ==================== 第2层：Playwright headed ====================
async function fetchHeaded(url, maxLen) {
  console.error(`[fetch:headed] url`);

  let browser, shared;
  try {
    ({ browser, shared } = await getBrowser());
    const page = await browser.newPage();
    await page.addInitScript(PAGE_COMPAT_INIT);
    await page.setExtraHTTPHeaders({ 'Accept-Language': 'zh-CN,zh;q=0.9,en-US,en;q=0.8' });

    const resp = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT });
    const httpStatus = resp?.status() || 0;
    await page.waitForTimeout(4000);
    try { await page.evaluate(() => window.scrollTo(0, 300)); await page.waitForTimeout(800); } catch {}

    let content = '';
    try {
      content = await page.evaluate((max) => {
        // 提取 JSON-LD
        const ldParts = [];
        document.querySelectorAll('script[type="application/ld+json"]').forEach(el => {
          try {
            const d = JSON.parse(el.textContent);
            if (d.description) ldParts.push(String(d.description));
            if (d.articleBody) ldParts.push(String(d.articleBody));
          } catch {}
        });

        // 去噪音
        for (const sel of ['script','style','nav','header','footer','aside','iframe','noscript',
          '.ad','.ads','.sidebar','.comment','.social','.share','.related',
          '.breadcrumb','.pagination','.cookie','.popup','[role="navigation"]','[role="banner"]']) {
          document.querySelectorAll(sel).forEach(el => el.remove());
        }

        // 正文提取
        for (const sel of ['article','.article-content','.post-content','.entry-content',
          '#article_content','.markdown-body','.news-content','.detail-body',
          '.content','.main-content','main','#content','table']) {
          const el = document.querySelector(sel);
          if (el) { const text = el.innerText.replace(/\s+/g, ' ').trim(); if (text.length > 200) return text.slice(0, max); }
        }
        const candidates = [];
        for (const el of document.querySelectorAll('div, section, main, article')) {
          if (el.children.length > 50) continue;
          const text = el.innerText?.replace(/\s+/g, ' ').trim() || '';
          if (text.length > 300) { const links = el.querySelectorAll('a'); if (links.length / (text.length / 100) < 5) candidates.push({ el, len: text.length }); }
        }
        candidates.sort((a, b) => b.len - a.len);
        if (candidates.length > 0) { const text = candidates[0].el.innerText.replace(/\s+/g, ' ').trim(); if (text.length > 200) return text.slice(0, max); }
        return document.body?.innerText?.replace(/\s+/g, ' ').trim().slice(0, max) || '';
      }, maxLen);
    } catch {
      try { await page.waitForTimeout(2000); content = await page.evaluate((max) => document.body?.innerText?.replace(/\s+/g, ' ').trim().slice(0, max) || '', maxLen); } catch {}
    }

    await page.close();
    if (content.length < 50) return { status: httpStatus, content: '', error: content ? `内容太短(content.length字)` : `HTTP httpStatus` };
    return { status: httpStatus, content };
  } catch (e) {
    return { status: 0, content: '', error: e.message.split('\n')[0] };
  } finally {
    if (browser) await releaseBrowser(browser, shared).catch(() => {});
  }
}

// ==================== 单 URL：两层兜底 ====================
async function fetchUrl(url, maxLen) {
  // 第1层：轻量 HTTP
  let result = await fetchLightweight(url, maxLen);
  if (result.content) return { url, ...result };
  console.error(`[fetch:http] 失败: result.error`);

  // 第2层：Playwright headed
  result = await fetchHeaded(url, maxLen);
  return { url, ...result };
}

// ==================== main ====================
async function main() {
  await ensureDeps();
  const { program } = await import('commander');
  program
    .argument('<urls...>', '要抓取的 URL，多个并行')
    .option('--max-len <n>', '单页最大字符数', v => parseInt(v, 10), DEFAULT_MAX_LEN)
    .option('--http-only', '只用轻量 HTTP，不启动浏览器')
    .option('--headed', '跳过 HTTP，直接 headed')
    .parse(process.argv);

  const opts = program.opts();
  const maxLen = Math.max(1000, Math.min(50000, opts.maxLen || DEFAULT_MAX_LEN));
  const urls = program.args.filter(a => a.startsWith('http'));
  if (!urls.length) { console.log(JSON.stringify({ error: '未传入有效 URL' })); process.exit(1); }

  const tasks = urls.map(async (url) => {
    if (opts.httpOnly) {
      const r = await fetchLightweight(url, maxLen);
      if (r.error) console.error(`[fetch] 跳过: r.error`);
      return { url, ...r };
    }
    if (opts.headed) {
      const r = await fetchHeaded(url, maxLen);
      if (r.error) console.error(`[fetch] 跳过: r.error`);
      return { url, ...r };
    }
    const r = await fetchUrl(url, maxLen);
    if (r.error) console.error(`[fetch] 跳过: r.error`);
    return r;
  });

  const settled = await Promise.allSettled(tasks);
  const results = settled.map(r => r.status === 'fulfilled' ? r.value : { url: '?', status: 0, content: '', error: String(r.reason) });
  console.log(JSON.stringify(results, null, 2));
}

main().catch(e => { console.error('[ERROR]', e.message); process.exit(1); });

FILE:scripts/search.js
#!/usr/bin/env node
/**
 * free-web-search-js search.js v28.0
 *
 * 国内: Bing CN (Playwright 搜索框提交)
 * 海外: DDG HTML (纯 HTTP)
 * 搜完自动抓取 top N 结果内容
 */
import process from 'process';
import child_process from 'child_process';
import querystring from 'querystring';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const SKILL_ROOT = path.resolve(__dirname, '..');
const ENDPOINT_FILE = path.resolve(SKILL_ROOT, '.browser-endpoint');

const DEFAULT_MAX = 10;
const DEFAULT_FETCH = 3;
const HTTP_TIMEOUT = 10000;
const PW_TIMEOUT = 25000;

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';

function clean(s) { return String(s || '').replace(/\s+/g, ' ').trim(); }

// ==================== 依赖 ====================
async function ensureDeps() {
  try { await import('cheerio'); } catch {
    child_process.execSync('npm install cheerio --silent', { stdio: 'inherit' });
  }
  try { await import('commander'); } catch {
    child_process.execSync('npm install commander --silent', { stdio: 'inherit' });
  }
}

// ==================== IP 检测 ====================
let _inChinaCache = null;
async function detectInChina() {
  if (_inChinaCache !== null) return _inChinaCache;

  const probes = [
    (async () => {
      for (const url of ['https://myip.ipip.net', 'https://cip.cc']) {
        try {
          const r = await fetch(url, { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000) });
          if (!r.ok) continue;
          const text = await r.text();
          if (/中国|CN/i.test(text)) {
            const ip = text.match(/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)?.[1] ?? '?';
            return { inChina: true, label: `ip → CN` };
          }
        } catch {}
      }
      throw new Error('cn probe failed');
    })(),
    (async () => {
      for (const url of ['https://ipinfo.io/json', 'https://ipapi.co/json/']) {
        try {
          const r = await fetch(url, { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000) });
          if (!r.ok) continue;
          const d = await r.json();
          const cc = String(d.country || d.country_code || '').toUpperCase();
          if (!cc) continue;
          return { inChina: cc === 'CN', label: `d.ip ?? '?' → cc` };
        } catch {}
      }
      throw new Error('intl probe failed');
    })(),
    (async () => {
      const r = await fetch('https://cn.bing.com', { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000), redirect: 'manual' });
      return { inChina: r.status === 200 || r.status === 302, label: `cn.bing.com → r.status` };
    })(),
  ];

  try {
    const winner = await Promise.any(probes);
    console.error(`[地理] winner.label → '国外'`);
    _inChinaCache = winner.inChina;
    return winner.inChina;
  } catch {
    console.error('[地理] 检测失败，默认国内');
    _inChinaCache = true;
    return true;
  }
}

// ==================== URL 处理 ====================
function decodeBingUrl(url) {
  if (!url?.includes('bing.com/ck/')) return url;
  try {
    const u = new URL(url).searchParams.get('u');
    if (!u) return url;
    const stripped = u.replace(/^a[0-9]/, '');
    const b64 = stripped + '='.repeat((4 - stripped.length % 4) % 4);
    const dec = Buffer.from(b64, 'base64').toString('utf-8');
    return dec.startsWith('http') ? dec : url;
  } catch { return url; }
}

function normalizeUrl(raw) {
  let url = clean(raw);
  if (!url) return url;
  url = decodeBingUrl(url);
  try {
    const u = new URL(url);
    u.hash = '';
    for (const k of ['utm_source','utm_medium','utm_campaign','gclid','fbclid','msclkid','spm','from','ref','src']) {
      u.searchParams.delete(k);
    }
    return u.toString();
  } catch { return url; }
}

async function resolveRedirectUrl(url, timeout = 6000) {
  if (!url) return url;
  if (!/sogou\.com\/link/i.test(url)) return url;
  try {
    const r = await fetch(url, {
      method: 'GET', headers: { 'User-Agent': UA },
      redirect: 'follow', signal: AbortSignal.timeout(timeout),
    });
    if (r.url && r.url.startsWith('http') && !/sogou\.com\/link/i.test(r.url)) {
      return r.url;
    }
    const text = await r.text();
    const jsMatch = text.match(/window\.location\.replace\s*\(\s*["']([^"']+)["']/);
    if (jsMatch) return jsMatch[1];
    const metaMatch = text.match(/URL\s*=\s*['"]([^'"]+)['"]/i);
    if (metaMatch) return metaMatch[1];
  } catch {}
  return url;
}

// ==================== Playwright 浏览器管理 ====================
const PAGE_COMPAT_INIT = () => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
  const origQuery = window.navigator.permissions?.query;
  if (origQuery) {
    window.navigator.permissions.query = (params) => (
      params.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : origQuery(params)
    );
  }
};

let _browserInstance = null;

async function getBrowser() {
  if (_browserInstance) return _browserInstance;
  try {
    const info = JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8'));
    process.kill(info.pid, 0);
    const { chromium } = await import('playwright');
    const browser = await chromium.connectOverCDP(info.wsEndpoint);
    _browserInstance = { browser, shared: true };
    return _browserInstance;
  } catch {}
  const { chromium } = await import('playwright');
  const browser = await chromium.launch({
    headless: false,
    args: ['--disable-blink-features=AutomationControlled'],
  });
  _browserInstance = { browser, shared: false };
  return _browserInstance;
}

async function closeBrowser() {
  if (!_browserInstance) return;
  try {
    if (_browserInstance.shared) _browserInstance.browser.disconnect();
    else await _browserInstance.browser.close();
  } catch {}
  _browserInstance = null;
}

// ==================== 搜索引擎 ====================

async function searchBingPW(query, max) {
  console.error(`[Bing:pw] query`);
  const out = [], seen = new Set();
  const base = 'https://cn.bing.com';
  let context;
  try {
    const { browser } = await getBrowser();
    context = await browser.newContext({
      userAgent: UA,
      locale: 'zh-CN',
      viewport: { width: 1920, height: 1080 },
      extraHTTPHeaders: { 'Accept-Language': 'zh-CN,zh;q=0.9' },
    });
    await context.addInitScript(PAGE_COMPAT_INIT);

    const page = await context.newPage();

    // 先访问首页拿 cookie
    await page.goto(base + '/', { waitUntil: 'domcontentloaded', timeout: 15000 });
    await page.waitForTimeout(1500);

    // 搜索框提交
    try {
      const searchBox = await page.$('#sb_form_q');
      if (searchBox) {
        await searchBox.click();
        await searchBox.fill(query);
        await page.waitForTimeout(300);
        await Promise.all([
          page.waitForLoadState('domcontentloaded', { timeout: PW_TIMEOUT }),
          page.keyboard.press('Enter'),
        ]);
        await page.waitForTimeout(2000);
      } else {
        await page.goto(base + '/search?' + querystring.stringify({ q: query }), {
          waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
        });
        await page.waitForTimeout(1500);
      }
    } catch {
      await page.goto(base + '/search?' + querystring.stringify({ q: query }), {
        waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
      });
      await page.waitForTimeout(1500);
    }

    const results = await page.evaluate(() => {
      const items = [];
      const seen = new Set();
      const add = (title, url, snippet) => {
        if (title && url && url.startsWith('http') && !seen.has(url)) {
          seen.add(url);
          items.push({ title, url, snippet });
        }
      };

      // 1) 主结果：li.b_algo
      document.querySelectorAll('li.b_algo').forEach(el => {
        const a = el.querySelector('h2 a');
        if (!a) return;
        add(a.textContent.trim(), a.href, el.querySelector('.b_caption p')?.textContent?.trim() || '');
      });

      // 2) 答案卡片/知识面板里的链接（li.b_ans, li.b_vList, li.b_entityTP）
      if (items.length === 0) {
        document.querySelectorAll('li.b_ans, li.b_vList, li.b_entityTP, li.b_mop').forEach(el => {
          el.querySelectorAll('a[href]').forEach(a => {
            const href = a.href;
            // 跳过 Bing 内部链接
            if (!href || href.includes('bing.com') || href.includes('microsoft.com') || href.startsWith('javascript:')) return;
            add(a.textContent.trim().slice(0, 120), href, '');
          });
        });
      }

      return items;
    });
    for (const item of results) {
      const url = normalizeUrl(item.url);
      const title = clean(item.title);
      const snippet = clean(item.snippet);
      if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
        seen.add(url.toLowerCase());
        out.push({ title, url, snippet });
      }
    }

    // 3) 0 结果时补词重试（强制出网页结果而非即时卡片）
    if (out.length === 0) {
      const suffixes = [' 网站', ' 详情', ' 介绍'];
      for (const suffix of suffixes) {
        const retryQuery = query + suffix;
        console.error(`[Bing:pw] 0条，补词重试: "retryQuery"`);
        try {
          await page.goto(base + '/search?' + querystring.stringify({ q: retryQuery }), {
            waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
          });
          await page.waitForTimeout(1500);

          const retryResults = await page.evaluate(() => {
            const items = [];
            document.querySelectorAll('li.b_algo').forEach(el => {
              const a = el.querySelector('h2 a');
              if (!a) return;
              items.push({
                title: a.textContent.trim(),
                url: a.href || '',
                snippet: el.querySelector('.b_caption p')?.textContent?.trim() || '',
              });
            });
            return items;
          });
          for (const item of retryResults) {
            const url = normalizeUrl(item.url);
            const title = clean(item.title);
            const snippet = clean(item.snippet);
            if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
              seen.add(url.toLowerCase());
              out.push({ title, url, snippet });
            }
          }
          if (out.length > 0) break;
        } catch {}
      }
    }

    console.error(`[Bing:pw] out.length 条`);
  } catch (e) {
    console.error(`[Bing:pw] 错误: e.message.split('\n')[0]`);
  } finally {
    if (context) await context.close().catch(() => {});
  }
  return out.slice(0, max);
}

async function searchSogouHttp(query, max) {
  console.error(`[搜狗:http] query`);
  const out = [], seen = new Set();
  try {
    const url = 'https://www.sogou.com/web?' + querystring.stringify({ query });
    const r = await fetch(url, {
      headers: {
        'User-Agent': UA,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9',
      },
      signal: AbortSignal.timeout(HTTP_TIMEOUT), redirect: 'follow',
    });
    if (!r.ok) { console.error(`[搜狗:http] HTTP r.status`); return out; }

    const html = await r.text();
    const { load } = await import('cheerio');
    const $ = load(html);

    const rawItems = [];
    $('.vrwrap, .rb').each((_, el) => {
      const $el = $(el);
      const $a = $el.find('h3 a').first();
      if (!$a.length) return;
      const title = clean($a.text());
      let href = $a.attr('href') || '';
      if (href.startsWith('/link?')) href = 'https://www.sogou.com' + href;
      const snippet = clean($el.find('.str-text-info, .str_info').text());
      if (title && href) rawItems.push({ title, href, snippet });
    });

    const resolved = await Promise.all(rawItems.map(async (item) => ({ ...item, url: normalizeUrl(await resolveRedirectUrl(item.href)) })));
    for (const item of resolved) {
      if (item.url && item.url.startsWith('http') && !seen.has(item.url.toLowerCase())) {
        seen.add(item.url.toLowerCase());
        out.push({ title: item.title, url: item.url, snippet: item.snippet });
      }
    }
    console.error(`[搜狗:http] out.length 条`);
  } catch (e) {
    console.error(`[搜狗:http] 错误: e.message.split('\n')[0]`);
  }
  return out.slice(0, max);
}

async function searchDDGHtml(query, max) {
  console.error(`[DDG:html] query`);
  const out = [], seen = new Set();
  try {
    const r = await fetch('https://html.duckduckgo.com/html/?q=' + encodeURIComponent(query), {
      headers: { 'User-Agent': UA, 'Accept-Language': 'en-US,en;q=0.9' },
      signal: AbortSignal.timeout(HTTP_TIMEOUT), redirect: 'follow',
    });
    if (!r.ok) { console.error(`[DDG:html] HTTP r.status`); return out; }

    const html = await r.text();
    const { load } = await import('cheerio');
    const $ = load(html);

    $('.result, .web-result').each((_, el) => {
      const $el = $(el);
      const $a = $el.find('.result__title a, .result__a, h2 a').first();
      if (!$a.length) return;
      const title = clean($a.text());
      let href = $a.attr('href') || '';
      try {
        const uddg = new URL(href, 'https://duckduckgo.com').searchParams.get('uddg');
        if (uddg) href = uddg;
      } catch {}
      const snippet = clean($el.find('.result__snippet, .result__body').text());
      const url = normalizeUrl(href);
      if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
        seen.add(url.toLowerCase());
        out.push({ title, url, snippet });
      }
    });
    console.error(`[DDG:html] out.length 条`);
  } catch (e) {
    console.error(`[DDG:html] 错误: e.message.split('\n')[0]`);
  }
  return out.slice(0, max);
}

// ==================== 自动抓取 ====================
async function autoFetchUrls(results, fetchCount, maxLen) {
  if (fetchCount <= 0 || results.length === 0) return;
  const urls = results.slice(0, Math.min(fetchCount, results.length)).map(r => r.url);
  console.error(`[fetch] 自动抓取 urls.length 条...`);

  try {
    const fetchArgs = ['node', path.resolve(__dirname, 'fetch.js'), ...urls, `--max-len=maxLen`, '--headed'];
    const raw = child_process.execSync(fetchArgs.join(' '), {
      encoding: 'utf8', timeout: 60000,
      stdio: ['pipe', 'pipe', 'pipe'],
    });
    try {
      const fetched = JSON.parse(raw);
      for (let i = 0; i < Math.min(fetchCount, fetched.length); i++) {
        if (fetched[i] && fetched[i].content) {
          results[i].content = fetched[i].content.slice(0, maxLen);
        }
      }
      console.error(`[fetch] 抓取完成`);
    } catch (e) {
      console.error(`[fetch] 解析失败: e.message.split('\n')[0]`);
    }
  } catch (e) {
    console.error(`[fetch] 抓取失败: e.message.split('\n')[0]`);
  }
}

// ==================== main ====================
async function main() {
  const startTime = Date.now();
  await ensureDeps();
  const { program } = await import('commander');
  program
    .argument('[query...]', '搜索关键词')
    .option('--max <n>', '结果数 (1-30)', v => parseInt(v, 10), DEFAULT_MAX)
    .option('--region <r>', '区域: auto/cn/intl', 'auto')
    .option('--engine <e>', '引擎: auto/bing/sogou/ddg', 'auto')
    .option('--fetch <n>', '自动抓前N条URL内容 (0=不抓)', v => parseInt(v, 10), DEFAULT_FETCH)
    .option('--max-len <n>', '单页最大字符数', v => parseInt(v, 10), 6000)
    .option('--no-fetch', '禁用自动抓取')
    .parse(process.argv);

  const opts = program.opts();
  const query = clean(program.args.join(' '));
  if (!query) { console.log(JSON.stringify({ error: '未传入搜索关键词' })); process.exit(1); }

  const max = Math.max(1, Math.min(30, opts.max));
  const fetchCount = opts.fetch === true ? DEFAULT_FETCH : (opts.noFetch ? 0 : opts.fetch);

  let inChina;
  if (opts.region === 'cn') inChina = true;
  else if (opts.region === 'intl') inChina = false;
  else inChina = await detectInChina();

  const out = [], seen = new Set();

  function dedupKey(url) {
    try {
      const u = new URL(url);
      let host = u.hostname.replace(/^(www|m|mobile)\./, '');
      let p = u.pathname.replace(/\/+$/, '').replace(/\.(html?|php|aspx?)$/, '');
      return `hostp`.toLowerCase();
    } catch { return url.toLowerCase(); }
  }

  const add = (items) => {
    for (const item of items) {
      const key = dedupKey(item.url);
      if (!seen.has(key)) { seen.add(key); out.push(item); }
    }
  };

  if (inChina) {
    // 国内：根据 --engine 选择
    const engine = opts.engine === 'auto' ? 'bing' : opts.engine;
    if (engine === 'sogou') {
      console.error('[策略] 国内 → 搜狗 HTTP (⚠ 无cookie易被反爬拦截，结果可能为空)');
      add(await searchSogouHttp(query, max));
    } else {
      console.error('[策略] 国内 → Bing PW');
      add(await searchBingPW(query, max));
    }
  } else {
    console.error('[策略] 海外 → DDG HTML');
    add(await searchDDGHtml(query, max));
  }

  const results = out.slice(0, max);

  // 自动抓取
  await autoFetchUrls(results, fetchCount, opts.maxLen || 6000);

  console.log(JSON.stringify(results, null, 2));
  console.error(`[耗时] ((Date.now() - startTime) / 1000).toFixed(1)s | results.length条结果`);
  await closeBrowser();
}

main().catch(e => { console.error('[ERROR]', e.message); process.exit(1); });

FILE:scripts/setup.sh
#!/bin/bash
# free-web-search-js setup (Linux/macOS)
# v28

set -e
SKILL_ROOT="$(cd "$(dirname "$0")/.." && pwd)"

echo ""
echo "=== free-web-search-js Setup ==="
echo ""
echo "Dependencies:"
echo "  - Node.js >= 18"
echo "  - npm packages: cheerio, commander, iconv-lite, playwright"
echo "  - Playwright Chromium browser (~150MB, takes a few minutes)"
echo ""

# Node.js
if ! command -v node &>/dev/null; then
    echo "[X] Node.js not found"
    echo "   -> https://nodejs.org"
    exit 1
fi
NODE_VERSION=$(node --version)
MAJOR=$(echo "$NODE_VERSION" | sed 's/^v//' | cut -d. -f1)
if [ "$MAJOR" -lt 18 ]; then
    echo "[X] Node.js >= 18 required (current: $NODE_VERSION)"
    exit 1
fi
echo "[OK] Node.js $NODE_VERSION"

# 检测国内网络 → 选镜像源
IN_CHINA=false
echo ""
echo "Detecting network region..."
for url in "https://myip.ipip.net" "https://cip.cc"; do
    if resp=$(curl -sS --max-time 3 "$url" 2>/dev/null); then
        if echo "$resp" | grep -qi "中国\|CN"; then
            IN_CHINA=true
            break
        fi
    fi
done

if [ "$IN_CHINA" = true ]; then
    echo "[OK] 国内网络，使用镜像源加速"
    export PLAYWRIGHT_DOWNLOAD_HOST="https://npmmirror.com/mirrors/playwright"
    NPM_REGISTRY="--registry=https://registry.npmmirror.com"
else
    echo "[OK] 海外网络，使用官方源"
    NPM_REGISTRY=""
fi

# npm install
echo ""
echo "Installing npm packages (cheerio, commander, iconv-lite, playwright)..."
cd "$SKILL_ROOT"
if [ -n "$NPM_REGISTRY" ]; then
    if ! npm install $NPM_REGISTRY; then
        echo "[X] npm install failed"
        exit 1
    fi
else
    if ! npm install; then
        echo "[X] npm install failed"
        exit 1
    fi
fi
echo "[OK] npm packages installed"

# Playwright Chromium
echo ""
echo "Installing Playwright Chromium browser (~150MB, this may take a few minutes)..."
if ! npx playwright install chromium; then
    echo "[X] Playwright Chromium install failed"
    echo "   Try manually: npx playwright install chromium"
    exit 1
fi
echo "[OK] Playwright Chromium installed"

echo ""
echo "[OK] Setup complete!"
echo "   Verify: node scripts/check-env.js"

FILE:scripts/_batch_test.js
#!/usr/bin/env node
/**
 * 批量测试：多个query，记录耗时、结果数、去重后数
 */
import { execSync } from 'child_process';

const queries = [
  '今日黄金价格',
  '俄乌冲突最新消息',
  '怎么做红烧肉',
  '上海明天天气',
  '感冒吃什么药',
  '量子计算',
  '北京',
  '今日铜价',
];

console.log('Query'.padEnd(30) + 'Results  Time    Engines');
console.log('-'.repeat(65));

for (const q of queries) {
  const t = Date.now();
  try {
    const raw = execSync(`node scripts/search.js "q" --max=10`, {
      encoding: 'utf8',
      timeout: 120000,
      stdio: ['pipe', 'pipe', 'pipe'],
    });
    const elapsed = ((Date.now() - t) / 1000).toFixed(1);
    const results = JSON.parse(raw);
    
    // 从stderr提取引擎信息（这里简化，只看结果数）
    console.log(q.padEnd(30) + `results.length`.padEnd(9) + `elapseds`.padEnd(8));
  } catch (e) {
    const elapsed = ((Date.now() - t) / 1000).toFixed(1);
    console.log(q.padEnd(30) + 'FAIL'.padEnd(9) + `elapseds`.padEnd(8) + e.message.split('\n')[0].slice(0, 30));
  }
}

FILE:scripts/_batch_test2.js
#!/usr/bin/env node
/**
 * 批量测试（进程内）：直接调search函数，不spawn子进程
 */
import querystring from 'querystring';

const queries = [
  '今日黄金价格',
  '俄乌冲突最新消息',
  '怎么做红烧肉',
  '上海明天天气',
  '感冒吃什么药',
  '量子计算',
  '北京',
  '今日铜价',
];

// 动态import search.js的函数太复杂，直接用时间戳包装exec
import { exec } from 'child_process';

async function runOne(q) {
  const { spawn } = await import('child_process');
  return new Promise((resolve) => {
    const t = Date.now();
    const p = spawn('node', ['scripts/search.js', q, '--max=10'], {
      cwd: import.meta.dirname,
    });
    let stdout = '', stderr = '';
    p.stdout.on('data', d => stdout += d);
    p.stderr.on('data', d => stderr += d);
    p.on('close', (code) => {
      const elapsed = ((Date.now() - t) / 1000).toFixed(1);
      if (code !== 0) {
        resolve({ q, ok: false, elapsed, error: `exit code` });
        return;
      }
      try {
        const results = JSON.parse(stdout);
        const bingMatch = stderr.match(/\[Bing:pw\] (\d+) 条/);
        const baiduMatch = stderr.match(/\[百度:pw\] (\d+) 条/);
        resolve({
          q, ok: true, elapsed,
          count: results.length,
          bing: bingMatch ? parseInt(bingMatch[1]) : 0,
          baidu: baiduMatch ? parseInt(baiduMatch[1]) : 0,
        });
      } catch (e) {
        resolve({ q, ok: false, elapsed, error: 'parse error' });
      }
    });
    p.on('error', e => {
      const elapsed = ((Date.now() - t) / 1000).toFixed(1);
      resolve({ q, ok: false, elapsed, error: e.message.slice(0, 30) });
    });
  });
}

console.log('Query'.padEnd(24) + 'Results  Bing  Baidu  Time');
console.log('-'.repeat(60));

const allResults = [];
for (const q of queries) {
  const r = await runOne(q);
  allResults.push(r);
  if (r.ok) {
    console.log(r.q.padEnd(24) + `r.count`.padEnd(9) + `r.bing`.padEnd(6) + `r.baidu`.padEnd(7) + `r.elapseds`);
  } else {
    console.log(r.q.padEnd(24) + 'FAIL'.padEnd(9) + ''.padEnd(6) + ''.padEnd(7) + `r.elapseds ` + r.error);
  }
}

// 汇总
const okResults = allResults.filter(r => r.ok);
const avgTime = okResults.reduce((s, r) => s + parseFloat(r.elapsed), 0) / okResults.length;
const avgCount = okResults.reduce((s, r) => s + r.count, 0) / okResults.length;
console.log('-'.repeat(60));
console.log(`平均: avgCount.toFixed(1)条  avgTime.toFixed(1)s  (okResults.length/allResults.length 成功)`);

FILE:scripts/_bench.js
import { execSync } from 'child_process';
const t = Date.now();
const p = execSync('node scripts/search.js "今日黄金价格" --max=8', {
  encoding: 'utf8',
  stdio: ['pipe', 'pipe', 'pipe'],
  timeout: 60000,
  cwd: import.meta.dirname,
});
console.log('耗时:', ((Date.now() - t) / 1000).toFixed(1), '秒');
console.log('结果数:', JSON.parse(p).length);

FILE:scripts/_bench2.js
const start = Date.now();
process.argv = ['node', 'scripts/search.js', '今日黄金价格', '--max=8'];
import('./search.js').catch(() => {}).finally(() => {
  // search.js自己会process.exit，这里不一定能跑到
});

FILE:scripts/_debug_baidu_box.js
#!/usr/bin/env node
/**
 * 调试：看百度首页搜索框选择器
 */
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';

const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({ userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 } });
await context.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});

const page = await context.newPage();
await page.goto('https://www.baidu.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);

// 列出所有input
const inputs = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('input')).map(el => ({
    id: el.id,
    name: el.name,
    type: el.type,
    className: el.className,
    placeholder: el.placeholder,
  }));
});
console.log('Inputs:', JSON.stringify(inputs, null, 2));

// 试搜索
const query = '今日黄金价格';
const searchBox = await page.$('#kw') || await page.$('input[name="wd"]');
if (searchBox) {
  console.log('找到搜索框:', await searchBox.evaluate(el => ({ id: el.id, name: el.name })));
  await searchBox.fill(query);
  await page.waitForTimeout(300);
  await page.keyboard.press('Enter');
  await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
  await page.waitForTimeout(2000);
  
  const results = await page.evaluate(() => {
    const items = [];
    document.querySelectorAll('.result h3 a, .c-container h3 a').forEach(a => {
      items.push(a.textContent.trim().slice(0, 50));
    });
    return items;
  });
  console.log('\n百度搜索结果前5条:');
  results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));
} else {
  console.log('未找到搜索框');
}

await browser.close();

FILE:scripts/_debug_baidu_pw.js
#!/usr/bin/env node
/**
 * 用Playwright搜百度，看结果
 */
const { chromium } = await import('playwright');

const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const page = await browser.newPage();
await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});

// 先访问百度首页
await page.goto('https://www.baidu.com', { waitUntil: 'domcontentloaded', timeout: 10000 });
await page.waitForTimeout(1000);

// 搜索
const query = '今日黄金价格';
console.log('Baidu search:', query);
await page.goto('https://www.baidu.com/s?wd=' + encodeURIComponent(query), {
  waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(2000);

const results = await page.evaluate(() => {
  const items = [];
  document.querySelectorAll('.result h3 a, .c-container h3 a').forEach(a => {
    items.push({
      title: a.textContent.trim().slice(0, 60),
      href: a.href,
    });
  });
  return items;
});

const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('含kekegold:', html.includes('kekegold'));

console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
  console.log(`  i+1. r.title`);
  console.log(`     r.href.slice(0, 80)`);
});

await browser.close();

FILE:scripts/_debug_bing.js
#!/usr/bin/env node
/**
 * 调试：看Bing CN返回的原始搜索结果是什么
 */
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36';

const query = '今日黄金价格';
console.log('Query:', query);

const url = 'https://cn.bing.com/search?' + new URLSearchParams({ q: query });
console.log('URL:', url);

const r = await fetch(url, {
  headers: {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  },
  redirect: 'follow',
});

console.log('Status:', r.status);
const html = await r.text();
console.log('HTML length:', html.length);

// 提取结果
const { load } = await import('cheerio');
const $ = load(html);

const results = [];
$('li.b_algo').each((i, el) => {
  const $el = $(el);
  const $a = $el.find('h2 a');
  if (!$a.length) return;
  const title = $a.text().trim();
  const href = $a.attr('href') || '';
  const snippet = $el.find('.b_caption p').text().trim();
  
  results.push({
    index: i + 1,
    title: title.slice(0, 60),
    href: href.slice(0, 80),
    snippet: snippet.slice(0, 60)
  });
});

console.log('\\n=== Bing CN Results ===');
results.slice(0, 10).forEach(r => {
  console.log(`r.index. r.title`);
  console.log(`   href: r.href`);
  console.log(`   snippet: r.snippet`);
  console.log('');
});

// 检查第一页内容里有没有金投网
const hasCngold = html.includes('cngold.org') || html.includes('金投网');
const hasSina = html.includes('finance.sina') || html.includes('新浪财经');
console.log('HTML contains cngold.org/金投网:', hasCngold);
console.log('HTML contains finance.sina/新浪财经:', hasSina);

FILE:scripts/_debug_bing2.js
#!/usr/bin/env node
/**
 * 逐步排查Bing CN搜索结果差异的原因
 * 对比不同请求头/cookie组合下的结果
 */
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

async function testBing(label, url, headers) {
  try {
    const r = await fetch(url, { headers, redirect: 'follow', signal: AbortSignal.timeout(10000) });
    const html = await r.text();
    const { load } = await import('cheerio');
    const $ = load(html);
    
    const results = [];
    $('li.b_algo').each((i, el) => {
      const $a = $(el).find('h2 a');
      if ($a.length) results.push($a.text().trim().slice(0, 50));
    });
    
    const hasCngold = html.includes('cngold');
    const hasSina = html.includes('finance.sina');
    const has16fan = html.includes('16fan');
    
    console.log(`\n=== label ===`);
    console.log(`Status: r.status, HTML: html.length bytes`);
    console.log(`含金投网: hasCngold, 含新浪: hasSina, 含十六番: has16fan`);
    console.log(`前3条:`);
    results.slice(0, 3).forEach((t, i) => console.log(`  i+1. t`));
  } catch (e) {
    console.log(`\n=== label === FAILED: e.message`);
  }
}

// Test 1: skill当前的方式（最简header）
await testBing('1. 当前skill方式(简header)', 
  'https://cn.bing.com/search?q=' + encodeURIComponent(query),
  {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  }
);

// Test 2: 加更多浏览器标准header
await testBing('2. 完整浏览器header',
  'https://cn.bing.com/search?q=' + encodeURIComponent(query),
  {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Cache-Control': 'max-age=0',
    'Sec-Ch-Ua': '"Chromium";v="136", "Google Chrome";v="136", "Not-A.Brand";v="99"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
  }
);

// Test 3: 用www.bing.com而不是cn.bing.com
await testBing('3. www.bing.com + zh-CN',
  'https://www.bing.com/search?q=' + encodeURIComponent(query) + '&setlang=zh-CN&cc=cn',
  {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  }
);

// Test 4: cn.bing.com + FORM=R5FD1 (Bing CN标准参数)
await testBing('4. cn.bing.com + FORM=R5FD1',
  'https://cn.bing.com/search?q=' + encodeURIComponent(query) + '&FORM=R5FD1',
  {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  }
);

// Test 5: 先访问cn.bing.com首页拿cookie，再搜索
console.log('\n=== 5. 先拿cookie再搜索 ===');
try {
  // 先访问首页
  const homeR = await fetch('https://cn.bing.com/', {
    headers: { 'User-Agent': UA, 'Accept': 'text/html' },
    redirect: 'follow', signal: AbortSignal.timeout(5000),
  });
  const homeHtml = await homeR.text();
  console.log('首页 status:', homeR.status, 'size:', homeHtml.length);
  
  // 提取set-cookie
  // Note: Node.js fetch doesn't expose Set-Cookie easily, but let's check
  console.log('首页 headers:', Object.fromEntries(homeR.headers.entries()));
  
  // 再搜索
  await testBing('5a. 拿cookie后搜索',
    'https://cn.bing.com/search?q=' + encodeURIComponent(query) + '&FORM=R5FD1',
    {
      'User-Agent': UA,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'zh-CN,zh;q=0.9',
    }
  );
} catch (e) {
  console.log('Cookie test failed:', e.message);
}

FILE:scripts/_debug_bing3.js
#!/usr/bin/env node
/**
 * 用undici的cookie jar测试Bing CN搜索
 * 看带cookie后结果是否不同
 */
import pkg from 'undici';
const { CookieJar, fetch: undiciFetch } = pkg;

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

const jar = new CookieJar();

// Step 1: 访问Bing首页，让cookie jar收集cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
const homeR = await undiciFetch('https://cn.bing.com/', {
  headers: { 'User-Agent': UA, 'Accept': 'text/html' },
  redirect: 'follow',
  signal: AbortSignal.timeout(5000),
}, { dispatcher: jar });
console.log('首页 status:', homeR.status);

// 看cookie jar里有什么
const cookies = await jar.getCookies('https://cn.bing.com');
console.log('Cookie数量:', cookies.length);
cookies.forEach(c => console.log(`  c.key=String(c.value).slice(0, 30)...`));

// Step 2: 带cookie搜索
console.log('\nStep 2: 带cookie搜索...');
const searchR = await undiciFetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
  headers: {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  },
  redirect: 'follow',
  signal: AbortSignal.timeout(10000),
}, { dispatcher: jar });

const html = await searchR.text();
console.log('搜索 status:', searchR.status, 'HTML:', html.length, 'bytes');

const { load } = await import('cheerio');
const $ = load(html);

const results = [];
$('li.b_algo').each((i, el) => {
  const $a = $(el).find('h2 a');
  if ($a.length) results.push($a.text().trim().slice(0, 60));
});

console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));

FILE:scripts/_debug_bing_cookie.js
#!/usr/bin/env node
/**
 * 用undici的Agent + cookie支持测试Bing CN
 * Node.js 24 内置undici，可以用setGlobalDispatcher带cookie
 */
import { Agent, setGlobalDispatcher, fetch } from 'undici';

// 用带cookie的dispatcher
const agent = new Agent({ connect: { rejectUnauthorized: true } });
setGlobalDispatcher(agent);

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

// 手动管理cookie
const cookies = new Map();

function extractCookies(response, url) {
  const setCookie = response.headers.getSetCookie?.() || [];
  for (const c of setCookie) {
    const [kv] = c.split(';');
    const [k, ...v] = kv.split('=');
    cookies.set(k.trim(), v.join('='));
  }
}

function cookieHeader(url) {
  if (cookies.size === 0) return '';
  return Array.from(cookies.entries()).map(([k,v]) => `k=v`).join('; ');
}

// Step 1: 访问Bing首页拿cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
const homeR = await fetch('https://cn.bing.com/', {
  headers: {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
  },
  redirect: 'follow',
  signal: AbortSignal.timeout(5000),
});
const homeHtml = await homeR.text();
extractCookies(homeR, 'https://cn.bing.com');
console.log('首页 status:', homeR.status);
console.log('Cookie:', cookieHeader('https://cn.bing.com').slice(0, 100));

// Step 2: 带cookie搜索
console.log('\nStep 2: 带cookie搜索...');
const searchR = await fetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
  headers: {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cookie': cookieHeader('https://cn.bing.com'),
  },
  redirect: 'follow',
  signal: AbortSignal.timeout(10000),
});
const html = await searchR.text();
extractCookies(searchR, 'https://cn.bing.com');

const { load } = await import('cheerio');
const $ = load(html);

const results = [];
$('li.b_algo').each((i, el) => {
  const $a = $(el).find('h2 a');
  if ($a.length) results.push({ title: $a.text().trim().slice(0, 60), url: $a.attr('href') });
});

console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((r, i) => console.log(`  i+1. r.title\n     r.url?.slice(0, 80)`));

FILE:scripts/_debug_bing_full.js
#!/usr/bin/env node
/**
 * 排查Bing CN结果差异：
 * 1. 编码问题（URL编码 vs UTF-8）
 * 2. Cookie问题（先访问首页拿cookie）
 * 3. 反爬问题（Playwright加强伪装）
 */

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

// ===== Test 1: 编码问题 =====
console.log('=== Test 1: 编码对比 ===');
const url1 = 'https://cn.bing.com/search?q=' + encodeURIComponent(query);
const url2 = 'https://cn.bing.com/search?q=' + query;  // 不编码，让fetch自动处理
console.log('encodeURIComponent:', url1);
console.log('raw UTF-8:', url2);
console.log('');

// ===== Test 2: 用Playwright加强伪装 =====
console.log('=== Test 2: Playwright加强伪装 ===');
const { chromium } = await import('playwright');

const browser = await chromium.launch({
  headless: false,
  args: [
    '--disable-blink-features=AutomationControlled',
    '--disable-features=IsolateOrigins,site-per-process',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-web-security',
  ],
});

const context = await browser.newContext({
  userAgent: UA,
  locale: 'zh-CN',
  viewport: { width: 1920, height: 1080 },
  // 模拟真实浏览器环境
  extraHTTPHeaders: {
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
  },
});

// 注入反检测脚本
await context.addInitScript(() => {
  // 隐藏webdriver
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  // 添加chrome对象
  window.chrome = { runtime: {}, loadTimes: function(){}, csi: function(){} };
  // 修改permissions
  const origQuery = window.navigator.permissions?.query;
  if (origQuery) {
    window.navigator.permissions.query = (params) => (
      params.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : origQuery(params)
    );
  }
  // 修改plugins
  Object.defineProperty(navigator, 'plugins', {
    get: () => [1, 2, 3, 4, 5],
  });
  // 修改languages
  Object.defineProperty(navigator, 'languages', {
    get: () => ['zh-CN', 'zh', 'en-US', 'en'],
  });
});

const page = await context.newPage();

// 先访问Bing首页，让浏览器自然拿cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);

// 检查cookie
const cookies = await context.cookies('https://cn.bing.com');
console.log('Cookie数量:', cookies.length);
cookies.forEach(c => console.log(`  c.name=c.value.slice(0, 30)...`));

// Step 2: 在首页搜索框输入搜索（模拟真实用户行为）
console.log('\nStep 2: 在搜索框输入搜索...');
try {
  const searchBox = await page.$('#sb_form_q');
  if (searchBox) {
    await searchBox.click();
    await searchBox.fill(query);
    await page.waitForTimeout(500);
    // 按Enter搜索
    await page.keyboard.press('Enter');
    await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
    await page.waitForTimeout(3000);
    console.log('通过搜索框搜索成功');
  } else {
    console.log('搜索框未找到，直接URL搜索');
    await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
      waitUntil: 'domcontentloaded', timeout: 15000,
    });
    await page.waitForTimeout(3000);
  }
} catch (e) {
  console.log('搜索框搜索失败，fallback到URL:', e.message.slice(0, 50));
  await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
    waitUntil: 'domcontentloaded', timeout: 15000,
  });
  await page.waitForTimeout(3000);
}

// 提取结果
const results = await page.evaluate(() => {
  const items = [];
  document.querySelectorAll('li.b_algo').forEach(el => {
    const a = el.querySelector('h2 a');
    if (a) items.push({
      title: a.textContent.trim().slice(0, 60),
      url: a.href,
      snippet: el.querySelector('.b_caption p')?.textContent?.trim().slice(0, 60) || '',
    });
  });
  return items;
});

const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('含汇率表:', html.includes('huilvbiao'));
console.log('含金价网:', html.includes('jinjia') || html.includes('94723'));
console.log('含kekegold:', html.includes('kekegold'));

console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
  console.log(`  i+1. r.title`);
  console.log(`     r.url?.slice(0, 80)`);
});

// 检查当前URL
console.log('\n当前页面URL:', page.url());

await browser.close();

FILE:scripts/_debug_bing_pw.js
#!/usr/bin/env node
/**
 * 用Playwright真实浏览器搜Bing CN，看结果是否不同
 */
const { chromium } = await import('playwright');

const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const page = await browser.newPage();
await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});

// 访问Bing CN搜索
const query = '今日黄金价格';
console.log('Navigating to Bing CN...');
await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
  waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(2000);

const results = await page.evaluate(() => {
  const items = [];
  document.querySelectorAll('li.b_algo').forEach(el => {
    const a = el.querySelector('h2 a');
    if (a) items.push({
      title: a.textContent.trim().slice(0, 60),
      url: a.href,
      snippet: el.querySelector('.b_caption p')?.textContent?.trim().slice(0, 60) || '',
    });
  });
  return items;
});

const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));

console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
  console.log(`  i+1. r.title`);
  console.log(`     r.url`);
});

await browser.close();

FILE:scripts/_debug_cookie_combos.js
#!/usr/bin/env node
/**
 * 测试：Playwright拿cookie → fetch带cookie + form=QBLH参数搜Bing
 */
const { chromium } = await import('playwright');

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

// Step 1: Playwright拿cookie
console.log('Step 1: Playwright拿cookie...');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({
  userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 },
});
await context.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});
const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
const cookies = await context.cookies('https://cn.bing.com');
const cookieStr = cookies.map(c => `c.name=c.value`).join('; ');
console.log(`拿到 cookies.length 个cookie`);
await browser.close();

// Step 2: fetch带cookie + 不同URL参数组合
const tests = [
  ['cookie + form=QBLH', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH`],
  ['cookie + form=QBLH + cvid', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH&sp=-1&lq=0&pq=&sc=12-0&qs=n&sk=&cvid=crypto.randomUUID().replace(/-/g,'').slice(0,32)`],
  ['cookie + FORM=R5FD1', `https://cn.bing.com/search?q=encodeURIComponent(query)&FORM=R5FD1`],
  ['cookie only', `https://cn.bing.com/search?q=encodeURIComponent(query)`],
  ['no cookie + form=QBLH', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH`],
];

for (const [label, url] of tests) {
  const useCookie = !label.startsWith('no cookie');
  try {
    const headers = {
      'User-Agent': UA,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'zh-CN,zh;q=0.9',
    };
    if (useCookie) headers['Cookie'] = cookieStr;
    
    const r = await fetch(url, { headers, redirect: 'follow', signal: AbortSignal.timeout(8000) });
    const html = await r.text();
    const { load } = await import('cheerio');
    const $ = load(html);
    const results = [];
    $('li.b_algo').each((i, el) => {
      const $a = $(el).find('h2 a');
      if ($a.length) results.push($a.text().trim().slice(0, 50));
    });
    
    console.log(`\n=== label ===`);
    console.log(`含金投网: html.includes('cngold'), 含十六番: html.includes('16fan')`);
    console.log(`前3: results.slice(0, 3).join(' | ')`);
  } catch (e) {
    console.log(`\n=== label === FAILED: e.message`);
  }
}

FILE:scripts/_debug_cookie_fetch.js
#!/usr/bin/env node
/**
 * 测试：Playwright拿cookie → fetch带cookie搜Bing
 */
const { chromium } = await import('playwright');

const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';

// Step 1: Playwright拿cookie
console.log('Step 1: Playwright拿cookie...');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({
  userAgent: UA,
  locale: 'zh-CN',
  viewport: { width: 1920, height: 1080 },
});
await context.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});

const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);

const cookies = await context.cookies('https://cn.bing.com');
const cookieStr = cookies.map(c => `c.name=c.value`).join('; ');
console.log(`拿到 cookies.length 个cookie，总长 cookieStr.length`);

await browser.close();

// Step 2: fetch带cookie搜Bing
console.log('\nStep 2: fetch带cookie搜索...');
const r = await fetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
  headers: {
    'User-Agent': UA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cookie': cookieStr,
  },
  redirect: 'follow',
  signal: AbortSignal.timeout(10000),
});

const html = await r.text();
console.log('Status:', r.status, 'HTML:', html.length);

const { load } = await import('cheerio');
const $ = load(html);

const results = [];
$('li.b_algo').each((i, el) => {
  const $a = $(el).find('h2 a');
  if ($a.length) results.push($a.text().trim().slice(0, 60));
});

console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));

FILE:scripts/_debug_ddg.js
#!/usr/bin/env node
/**
 * 调试DDG HTML Lite
 */
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = 'python tutorial';

console.log('Test 1: DDG HTML Lite POST');
try {
  const r = await fetch('https://html.duckduckgo.com/html/', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/x-www-form-urlencoded',
      'User-Agent': UA,
    },
    body: 'q=' + encodeURIComponent(query),
    signal: AbortSignal.timeout(10000),
  });
  console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
  const html = await r.text();
  const { load } = await import('cheerio');
  const $ = load(html);
  const results = [];
  $('.result, .web-result').each((i, el) => {
    const $a = $(el).find('.result__title a, .result__a, h2 a').first();
    if ($a.length) results.push($a.text().trim().slice(0, 50));
  });
  console.log('Results:', results.length);
  results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));
} catch (e) {
  console.log('Failed:', e.message);
}

console.log('\nTest 2: DDG HTML Lite GET');
try {
  const r = await fetch('https://html.duckduckgo.com/html/?q=' + encodeURIComponent(query), {
    headers: { 'User-Agent': UA },
    signal: AbortSignal.timeout(10000),
  });
  console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
  const html = await r.text();
  const { load } = await import('cheerio');
  const $ = load(html);
  const results = [];
  $('.result, .web-result').each((i, el) => {
    const $a = $(el).find('.result__title a, .result__a, h2 a').first();
    if ($a.length) results.push($a.text().trim().slice(0, 50));
  });
  console.log('Results:', results.length);
  results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));
} catch (e) {
  console.log('Failed:', e.message);
}

console.log('\nTest 3: Bing International');
try {
  const r = await fetch('https://www.bing.com/search?q=' + encodeURIComponent(query), {
    headers: {
      'User-Agent': UA,
      'Accept-Language': 'en-US,en;q=0.9',
    },
    signal: AbortSignal.timeout(10000),
    redirect: 'follow',
  });
  console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
  const html = await r.text();
  const { load } = await import('cheerio');
  const $ = load(html);
  const results = [];
  $('li.b_algo').each((i, el) => {
    const $a = $(el).find('h2 a');
    if ($a.length) results.push($a.text().trim().slice(0, 50));
  });
  console.log('Results:', results.length);
  results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));
} catch (e) {
  console.log('Failed:', e.message);
}

FILE:scripts/_debug_searchbox.js
#!/usr/bin/env node
/**
 * 调试：Bing搜索框输入中文后实际搜了什么
 */
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';

const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({ userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 } });
await context.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  window.chrome = { runtime: {} };
});

const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);

// 输入搜索
const query = '怎么做红烧肉';
const searchBox = await page.$('#sb_form_q');
await searchBox.click();
await searchBox.fill(query);
await page.waitForTimeout(500);

// 看搜索框的值
const inputValue = await page.evaluate(() => document.getElementById('sb_form_q').value);
console.log('搜索框值:', inputValue);

// 按Enter
await page.keyboard.press('Enter');
await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
await page.waitForTimeout(2000);

// 看最终URL
console.log('最终URL:', page.url());

// 看结果
const results = await page.evaluate(() => {
  const items = [];
  document.querySelectorAll('li.b_algo').forEach(el => {
    const a = el.querySelector('h2 a');
    if (a) items.push(a.textContent.trim().slice(0, 50));
  });
  return items;
});

console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(`  i+1. t`));

await browser.close();

FILE:scripts/_time_one.js
#!/usr/bin/env node
/**
 * 单个query测试，输出耗时+结果数
 */
const q = process.argv[2] || '今日黄金价格';
const max = process.argv[3] || '10';

// 改process.argv让search.js执行
process.argv = [process.argv[0], 'scripts/search.js', q, '--max=' + max];

const t = Date.now();
try {
  await import('./search.js');
} catch {}
// search.js会process.exit，如果没exit：
console.error('\n总耗时:', ((Date.now() - t) / 1000).toFixed(1), '秒');

ClawHub Frontend Backend+2

U@clawhub-ucsdzehualiu-001da531f9

prompt-optimizer-en

Skill

Iterative prompt optimizer for complex tasks. Strictly implements ACON's two-stage iterative optimization + APE automatic prompt engineering. Only triggers w...

---
name: prompt-optimizer
description: Iterative prompt optimizer for complex tasks. Strictly implements ACON's two-stage iterative optimization + APE automatic prompt engineering. Only triggers when user explicitly requests it, actively collects feedback after optimization, supports multi-round iteration until satisfied.
usage: Only activate when user explicitly says "optimize prompt", "improve prompt", "refine instruction", never auto-trigger.
author: Based on arXiv:2510.00615 (ACON), arXiv:2211.01910 (APE)
license: MIT
tags:
  - prompt-optimization
  - acon
  - ape
  - iterative
  - complex-tasks
---

## Atomic Optimization Methodology

### 🔬 Stage 1: Input Parsing & Critical Signal Extraction (ACON Paper §3.1)
**Input**: User's original prompt
**Operations**:
1. Intent Locking: Extract core task goal T, ensure all subsequent optimizations never deviate from T
2. Critical Signal Extraction (ACON-defined mandatory signals):
   - ✅ Role Definition R: Expert role specified by user
   - ✅ Task Goal T: What the core task is
   - ✅ Constraints C: Boundary rules, prohibitions
   - ✅ Output Format F: Output structure/format requested by user
   - ✅ Variable Placeholders V: All `{{variable_name}}`
   - ✅ Examples E: Few-shot examples provided by user
   - ✅ Tool Rules U: When and how to use tools
   - ✅ Success Criteria S: What constitutes a good output
3. Baseline Measurement: Record original prompt token length L₀

---

### 🚀 Stage 2: APE Utility Enhancement (arXiv:2211.01910 Automatic Prompt Engineering)
**Goal**: Turn vague prompts into expert-level instructions, improve utility
**Operations (Strict Order)**:
1. Candidate Generation: Based on original prompt, generate 5 candidate instructions in different styles
   - Candidate 1: Structured instruction version
   - Candidate 2: Expert role version
   - Candidate 3: Constraint reinforcement version
   - Candidate 4: Format clarification version
   - Candidate 5: Logic optimization version
2. Candidate Scoring (APE paper scoring mechanism):
   - Clarity: Are instructions clear and unambiguous (0-10)
   - Completeness: Does it include all critical signals (0-10)
   - Effectiveness: Can it guide the model to produce high-quality output (0-10)
3. Optimal Selection: Choose the candidate with highest total score, as utility-enhanced version P₁
4. Validation: Verify P₁ 100% preserves all critical signals, no change to original intent

---

### 📦 Stage 3: ACON Compression Optimization (ACON Paper §3.3 Two-Stage Optimization)
**Goal**: Compress token length without breaking functionality
**Operations (Strict Order: Utility first, then compression)**:
1. Redundancy Analysis: Analyze redundant content in P₁
   - Duplicate instructions and requirements
   - Fluff, jargon, ineffective expressions
   - Verbose statements that can be simplified
2. Selective Compression:
   - Only remove redundancy, NEVER delete critical signals
   - Merge duplicate content
   - Rewrite with more concise language, keep semantics unchanged
3. Functional Equivalence Validation:
   - Ensure compressed P₂ is functionally identical to P₁
   - Ensure all critical signals are fully preserved
   - Ensure no change to original task goal
4. Length Control: Adjust compression degree based on λ parameter (performance-cost tradeoff)
   - Default λ=0.5: Balanced mode
   - If user feedback "too long", automatically increase λ to 0.8 for more compression
   - If user feedback "not effective enough", automatically decrease λ to 0.2 to reduce compression

---

### 📤 Stage 4: Output & Feedback Collection
**Operations**:
1. Output optimized prompt P₂, wrapped in code block for easy copying
2. Actively ask for user feedback:
   ```
   Optimization complete. Does this version meet your needs?
   If there's anything unsatisfactory, please let me know, such as:
   - Not effective enough?
   - Still too long?
   - Some constraints/formats not preserved?
   - Other issues?
   I'll continue iterating based on your feedback.
   ```

---

### 🔄 Stage 5: Iterative Optimization (ACON Paper's R-round Iteration Mechanism)
**When user provides feedback, execute the following**:
1. Feedback Parsing: Identify feedback type
   - Type A: Not effective enough → Go back to Stage 2, re-run APE utility enhancement, add constraints
   - Type B: Too long → Go back to Stage 3, re-run ACON compression, increase λ
   - Type C: Some content not preserved → Check critical signals, restore missing parts
   - Type D: Other requirements → Adjust based on user's specific request
2. Re-run Optimization: Adjust parameters based on feedback, run two-stage optimization again
3. Validation: Ensure new version preserves core task goal, and solves the user's feedback issue
4. Output new optimized version, ask for feedback again
5. Repeat until user indicates satisfaction

---

## Strict Rules (Guarantee Effectiveness)
- ✅ Every step has validation, ensure no damage to original functionality
- ✅ Critical signals are NEVER deleted, 100% preserved
- ✅ Strictly follow "utility first, then compression" order, never reverse
- ✅ Each iteration re-validates, ensure it gets better with each round
- ✅ For complex tasks, prioritize functional integrity, compression is optional
- ❌ Never auto-trigger, only work when user explicitly requests
- ❌ No comparisons or analysis, only output optimized results
- ❌ No extra explanations unless explicitly requested

ClawHub Coding Product+2

U@clawhub-ucsdzehualiu-001da531f9

prompt-optimizer-cn

Skill

复杂任务专用迭代式提示词优化器。严格执行ACON论文的两阶段迭代优化+APE自动提示工程，仅在用户明确要求时触发，优化完主动收集反馈，支持多轮迭代直到满意。

---
name: prompt优化器
description: 复杂任务专用迭代式提示词优化器。严格执行ACON论文的两阶段迭代优化+APE自动提示工程，仅在用户明确要求时触发，优化完主动收集反馈，支持多轮迭代直到满意。
usage: 仅当用户明确说"优化提示词"、"改进prompt"、"精炼指令"时触发，绝不自动触发。
license: MIT
tags:
  - prompt-optimization
  - acon
  - ape
  - 迭代优化
  - 复杂任务
---


### 阶段1：输入解析与关键信号提取（ACON论文3）
**输入**：用户的原始提示词
**操作**：
1. 意图锁定：提取核心任务目标T，确保后续所有优化都不偏离T
2. 关键信号提取（ACON论文定义的必须保留信号）：
   - ✅ 角色设定R：用户指定的专家角色
   - ✅ 任务目标T：核心要做什么
   - ✅ 约束条件C：边界规则、禁止事项
   - ✅ 输出格式F：用户要求的输出结构、格式
   - ✅ 变量占位符V：所有`{{变量名}}`
   - ✅ 示例E：用户提供的few-shot示例
   - ✅ 工具规则U：工具调用的时机和方式
   - ✅ 成功标准S：什么是好的输出
3. 基线测量：记录原始提示词的token长度L₀

---

### 阶段2：APE 效用增强
**目标**：把模糊的提示词变成专家级指令，提升效用
**操作（严格顺序）**：
1. 候选生成：基于原始提示词，生成5个不同风格的候选指令
   - 候选1：结构化指令版
   - 候选2：专家角色版
   - 候选3：约束强化版
   - 候选4：格式明确版
   - 候选5：逻辑优化版
2. 候选打分（APE论文的打分机制）：
   - 清晰度：指令是否明确无歧义（0-10分）
   - 完整性：是否包含所有关键信号（0-10分）
   - 有效性：能否引导模型产生高质量输出（0-10分）
3. 最优选择：选择总分最高的候选，作为效用增强后的版本P₁
4. 验证：检查P₁是否100%保留了所有关键信号，没有改变原始意图

---

### 阶段3：ACON 压缩优化（ACON论文3.3节 两阶段优化）
**目标**：在不破坏功能的前提下，压缩token长度
**操作（严格顺序，先效用后压缩）**：
1. 冗余分析：分析P₁中的冗余内容
   - 重复的指令和要求
   - 废话、套话、无效表述
   - 可以精简的冗长表达
2. 选择性压缩：
   - 只删除冗余，绝不删除关键信号
   - 合并重复的内容
   - 用更简洁的语言重写，保持语义不变
3. 功能等价性验证：
   - 确保压缩后的P₂，功能与P₁完全一致
   - 确保所有关键信号都完整保留
   - 确保没有改变原始任务目标
4. 长度控制：根据当前的λ参数（性能-成本权衡）调整压缩程度
   - 默认λ=0.5：平衡模式
   - 如果用户反馈"太长了"，自动提高λ到0.8，进一步压缩
   - 如果用户反馈"效果不好"，自动降低λ到0.2，减少压缩

---

### 阶段4：输出与反馈收集
**操作**：
1. 输出优化后的提示词P₂，用代码块包裹，方便用户复制
2. 主动询问用户反馈：
   ```
   已完成优化。这个版本是否满足你的需求？
   如果有任何不满意的地方，请告诉我，比如：
   - 效果不够好？
   - 长度还是太长？
   - 某些约束/格式没保留？
   - 其他问题？
   我会根据你的反馈，继续迭代优化。
   ```

---

### 阶段5：迭代优化（ACON论文的R轮迭代机制）
**当用户给出反馈时，执行以下操作**：
1. 反馈解析：识别用户的反馈类型
   - 类型A：效果不好 → 回到阶段2，重新执行APE效用增强，补充约束
   - 类型B：长度太长 → 回到阶段3，重新执行ACON压缩，提高λ
   - 类型C：某些内容没保留 → 检查关键信号，补回缺失的部分
   - 类型D：其他需求 → 根据用户的具体要求调整
2. 重新执行优化：根据反馈调整参数，再次运行两阶段优化
3. 验证：确保新的版本保留了核心任务目标，并且解决了用户反馈的问题
4. 输出新的优化版本，再次询问反馈
5. 重复直到用户表示满意

---

## 严格规则（保证效果）
- ✅ 每一步都有验证，确保不破坏原始功能
- ✅ 关键信号永不删除，100%保留
- ✅ 严格遵循"先效用后压缩"的顺序，绝不颠倒
- ✅ 迭代优化每一轮都重新验证，确保越优化越好
- ✅ 复杂任务优先保证功能完整性，压缩是可选的
- ❌ 不自动触发，只在用户明确要求时工作
- ❌ 不做任何对比分析，只输出优化结果
- ❌ 不输出多余的解释，除非用户要求

ClawHub AI Agents

U@clawhub-ucsdzehualiu-001da531f9

cloud-product-compare

Skill

以资深云计算产品经理身份，深度阅读阿里云与华为云官方文档，输出有真实依据的差异化竞品分析

---
name: aliyun-huaweicloud-fullstack-product-competitive-analysis
description: 以资深云计算产品经理身份，深度阅读阿里云与华为云官方文档，输出有真实依据的差异化竞品分析
author: 云计算资深产品经理
version: "4.2.0"
license: Apache-2.0
allowed-tools: web_fetch
---

# 阿里云&华为云产品竞品分析

## 角色
你是拥有10年以上经验的ToB云计算资深产品经理，擅长从官方文档中提炼真实的产品差异，而非泛泛而谈。你的分析直接服务于产品规划、技术选型和市场决策。

## 快速开始

### 方式一：使用爬虫脚本（推荐，全自动）

脚本位于 `scripts/cloud_doc_scraper.py`，解析文档目录、抓取核心页面、输出 markdown。

> **依赖需手动安装**：`pip install playwright httpx beautifulsoup4 && playwright install chromium`

```bash
python scripts/cloud_doc_scraper.py --product ecs
python scripts/cloud_doc_scraper.py --product oss --output oss_docs.md
python scripts/cloud_doc_scraper.py --product rds --max-pages 15
python scripts/cloud_doc_scraper.py --product ecs --stealth   # 可选：启用 stealth 模式处理 JS 渲染兼容问题
python scripts/cloud_doc_scraper.py --list   # 查看所有支持的产品
```

**支持的产品**：ecs, oss, rds, redis, ack, fc, slb, maxcompute, pai, bailian, cdn, nas, flink, elasticsearch, dws

**输出**：markdown 文件，包含阿里云和华为云的官方文档原文 + 更新日志，可直接粘贴给 AI 做竞品分析。

**工作原理**：
1. 检测依赖是否就绪，缺失时提示安装命令并退出
2. 用 Playwright 打开文档首页，解析左侧目录导航
3. 按优先级筛选核心页面（产品介绍 > 规格参数 > 计费 > 应用场景）
4. 并发抓取各页面内容，输出去噪后的纯文本
5. 支持 HTTP fallback（Playwright 抓取失败时自动用 httpx+BS4）
6. 支持 deep_links（目录解析失败时使用预配置的页面 URL）
7. Stealth 模式（`--stealth`）默认关闭，仅在显式启用时处理 JS 渲染兼容性问题

### 方式二：手动用 web_fetch 逐页抓取

参考下方"官方文档入口"表格，用 `web_fetch` 工具逐页抓取文档内容。

---

## 官方文档入口

### 文档 URL 说明

**华为云文档**：
- 旧版文档（productdesc-*）：部分产品已 404（OBS, RDS, DCS, CDN, SFS 等），脚本已内置正确的 deep_links
- 新版文档（SPA index.html）：需要 JS 渲染，web_fetch 只能拿到空壳，建议用脚本
- 更新日志：`https://support.huaweicloud.cn/wtsnew-{product}/index.html`

**阿里云文档**：
- 主入口：`https://help.aliyun.com/zh/{product}`
- 更新日志：`https://help.aliyun.com/zh/{product}/product-overview/release-notes`
- URL 可能变更，脚本会自动从目录导航发现链接

### 阿里云
| 品类 | 产品 | 文档 | 更新日志 |
|------|------|------|----------|
| 计算 | 云服务器ECS | https://help.aliyun.com/zh/ecs | https://help.aliyun.com/zh/ecs/product-overview/release-notes |
| 计算 | 函数计算FC | https://help.aliyun.com/zh/fc | https://help.aliyun.com/zh/fc/product-overview/release-notes |
| 存储 | 对象存储OSS | https://help.aliyun.com/zh/oss | https://help.aliyun.com/zh/oss/product-overview/release-notes |
| 存储 | 文件存储NAS | https://help.aliyun.com/zh/nas | https://help.aliyun.com/zh/nas/product-overview/release-notes |
| 数据库 | 云数据库RDS | https://help.aliyun.com/zh/rds | https://help.aliyun.com/zh/rds/product-overview/release-notes |
| 数据库 | 云数据库Redis | https://help.aliyun.com/zh/redis | https://help.aliyun.com/zh/redis/product-overview/release-notes |
| 数据库 | AnalyticDB PG | https://help.aliyun.com/zh/analyticdb-for-postgresql | https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/release-notes |
| 容器 | 容器服务ACK | https://help.aliyun.com/zh/ack | https://help.aliyun.com/zh/ack/product-overview/release-notes |
| 网络 | 负载均衡SLB | https://help.aliyun.com/zh/slb | https://help.aliyun.com/zh/slb/product-overview/release-notes |
| 网络 | CDN | https://help.aliyun.com/zh/cdn | https://help.aliyun.com/zh/cdn/product-overview/release-notes |
| 大数据 | MaxCompute | https://help.aliyun.com/zh/maxcompute | https://help.aliyun.com/zh/maxcompute/product-overview/Release-notes |
| 大数据 | 实时计算Flink | https://help.aliyun.com/zh/flink | https://help.aliyun.com/zh/flink/product-overview/release-note |
| 大数据 | Elasticsearch | https://help.aliyun.com/zh/elasticsearch | https://help.aliyun.com/zh/elasticsearch/product-overview/release-notes |
| AI | 人工智能平台PAI | https://help.aliyun.com/zh/pai | https://help.aliyun.com/zh/pai/user-guide/api-aiworkspace-2021-02-04-changeset |
| AI | 百炼平台 | https://help.aliyun.com/zh/bailian | https://help.aliyun.com/zh/bailian/release-notes |

### 华为云
| 品类 | 产品 | 文档 | 更新日志 |
|------|------|------|----------|
| 计算 | 弹性云服务器ECS | https://support.huaweicloud.cn/ecs/index.html | https://support.huaweicloud.cn/wtsnew-ecs/index.html |
| 计算 | 函数工作流FunctionGraph | https://support.huaweicloud.cn/functiongraph/index.html | https://support.huaweicloud.cn/wtsnew-functiongraph/index.html |
| 存储 | 对象存储OBS | https://support.huaweicloud.cn/obs/index.html | https://support.huaweicloud.cn/wtsnew-obs/index.html |
| 存储 | 文件存储SFS | https://support.huaweicloud.cn/sfs/index.html | https://support.huaweicloud.cn/wtsnew-sfs/index.html |
| 数据库 | 云数据库RDS | https://support.huaweicloud.cn/rds/index.html | https://support.huaweicloud.cn/wtsnew-rds/index.html |
| 数据库 | 分布式缓存DCS | https://support.huaweicloud.cn/dcs/index.html | https://support.huaweicloud.cn/wtsnew-dcs/index.html |
| 数据库 | 数据仓库GaussDB(DWS) | https://support.huaweicloud.cn/dws/index.html | https://support.huaweicloud.cn/wtsnew-dws/index.html |
| 容器 | 云容器引擎CCE | https://support.huaweicloud.cn/cce/index.html | https://support.huaweicloud.cn/wtsnew-cce/index.html |
| 网络 | 弹性负载均衡ELB | https://support.huaweicloud.cn/elb/index.html | https://support.huaweicloud.cn/wtsnew-elb/index.html |
| 网络 | CDN | https://support.huaweicloud.cn/cdn/index.html | https://support.huaweicloud.cn/wtsnew-cdn/index.html |
| 大数据 | MapReduce服务MRS | https://support.huaweicloud.cn/mrs/index.html | https://support.huaweicloud.cn/wtsnew-mrs/index.html |
| 大数据 | 数据湖探索DLI | https://support.huaweicloud.cn/dli/index.html | https://support.huaweicloud.cn/wtsnew-dli/index.html |
| 搜索 | 云搜索服务CSS | https://support.huaweicloud.cn/css/index.html | https://support.huaweicloud.cn/wtsnew-css/index.html |
| AI | AI开发平台ModelArts | https://support.huaweicloud.cn/modelarts/index.html | https://support.huaweicloud.cn/wtsnew-modelarts/index.html |
| AI | 盘古大模型平台 | https://support.huaweicloud.cn/pangu/index.html | https://support.huaweicloud.cn/wtsnew-pangu/index.html |

---

## 执行方式

用户输入目标产品后，执行以下步骤：

**第一步：锁定对标产品**
从上表查找双方对标产品。若预置清单无对应产品，明确告知用户，并提供已知的替代入口。

**第二步：运行爬虫脚本**
```bash
python scripts/cloud_doc_scraper.py --product {product_key} --output {product_key}_docs.md
```
脚本会自动完成：依赖安装 → 目录解析 → 核心页面筛选 → 并发抓取 → 输出 markdown。

若脚本不可用，退而用 web_fetch 手动逐页抓取（见下方步骤）。

**第三步：深读文档**
按以下优先级抓取文档内容：

**文档抓取优先级**：
1. **产品概述/简介页面**（了解产品定位和核心价值）
2. **组件版本表**（大数据类产品必抓，如 EMR 组件版本、MRS 组件版本）
3. **核心特性/功能说明页面**（了解能力边界）
4. **规格参数/性能指标页面**（了解性能上限）
5. **内核增强说明页面**（了解自研能力）
6. **最佳实践/使用场景页面**（了解适用场景）
7. **更新日志**（了解近12个月迭代方向）

**第四步：判断产品形态差异**
分析双方产品是否属于同一形态：
- 若形态相似（如都是托管数据库）：直接对比功能、性能、价格
- 若形态差异大（如一个是托管服务，一个是PaaS平台）：
  - 先说明形态差异和各自定位
  - 再对比可对比的维度（如核心能力、适用场景）
  - 明确哪些维度无法直接对比

**第五步：找真实差异**
差异必须来自文档，不能靠印象。重点挖掘：
- 关键指标的数字差距（性能上限、规格范围、SLA数值等）
- 一方有、一方没有的核心能力
- 相同功能但实现路径或成熟度明显不同的地方
- 近期迭代方向的分歧，反映出各自的战略意图

无差异或差异不明显的维度，直接略过，不要凑字数。

**第六步：写分析**
格式自由，以能清晰传递判断为准。核心要回答三件事：
1. 两款产品真正的差异在哪，各自的优势和短板是什么
2. 近期各自在往哪个方向使劲，战略意图是什么
3. 什么样的客户和场景该选哪个

所有结论必须有文档依据，来源在行文中自然标注即可，不需要单独列参考文献章节。

**第七步：保存并展示结果**
1. 将完整分析报告保存为 markdown 文件（如 `{product_key}_competitive_analysis.md`），写入 workspace
2. **必须将分析报告的核心内容直接展示给用户**——不要只说"已保存到文件"，而是把关键结论、对比表格、选型建议等直接输出到对话中，让用户一眼就能看到结果
3. 在展示末尾附上文件路径，方便用户后续引用

---

## 对比维度参考

根据产品类型，优先对比以下维度：

**基础维度（必选）**：
- 开源组件版本支持（如 Elasticsearch 7.10/8.x、OpenSearch 2.x）
- 内核增强能力（自研内核、性能优化、稳定性增强）
- 核心功能完整性（向量检索、存算分离、智能运维）
- 规格与性能上限（单节点存储、分片数、QPS）

**增强维度（按产品类型选择）**：
- 搜索类：向量检索算法、量化方式、向量维度、AI 搜索能力
- 数据库类：高可用架构、备份恢复、容灾能力
- 大数据类：计算引擎、存储格式、数据源集成
- AI 类：模型管理、推理能力、RAG 支持

**迭代维度（必选）**：
- 近 12 个月功能发布记录
- 战略方向判断（从迭代重点推断）

---

## 约束
- 所有信息来自官方文档，禁止使用第三方信息或训练数据中的印象
- 若某侧文档抓取失败，明确说明失败原因，不得静默填充或臆测
- 若预置清单无对应产品，告知用户后提供已知入口或终止执行
- 若产品形态差异过大，先说明差异再分析，不要强行对比不相关的功能
- 对比必须基于事实，避免主观评价，让数据说话
- 版本号、性能数据、规格参数等关键信息必须标注来源文档

---

## 脚本技术细节

### 依赖（需手动安装）
- Python 3.10+
- playwright + chromium（浏览器自动化）
- httpx + beautifulsoup4（HTTP fallback）

安装命令：
```bash
pip install playwright httpx beautifulsoup4
playwright install chromium
```

### 已知限制
- 华为云部分产品（OBS, RDS, DCS, CDN, SFS）旧版 productdesc 页面已 404，脚本使用 deep_links 兜底
- 华为云文档站部分页面需要 JS 渲染，web_fetch 只能拿到空壳，建议用脚本
- 华为云 changelog 页面（wtsnew-*）部分在 .cn 域名 404，脚本自动回退 .com
- 阿里云部分子页面可能返回空内容，脚本会尝试 HTTP fallback
- 个别 deep_links URL 可能随文档更新而失效，需定期维护
- Stealth 模式（`--stealth`）默认关闭，仅在用户显式启用时生效，用于处理 JS 渲染兼容性问题

FILE:scripts/cloud_doc_scraper.py
"""
阿里云 & 华为云产品文档深度爬虫 v4.1
逻辑：依赖检测 → 解析左侧目录 → 按优先级筛选核心页面 → 并发抓取 → 输出供 AI 分析的 markdown

v4.1 改进：
  - 依赖检测替代自动安装：缺失时提示用户手动安装，不静默下载包或浏览器
  - Stealth 模式改为可选（--stealth）：默认关闭，仅在用户显式启用时生效
  - 保留 HTTP fallback（httpx+BS4）作为 Playwright 的补充

v4 改进（相比 ClawHub v1.0.3）：
  - Windows GBK 编码修复：stdout/stderr 重编码为 UTF-8
  - 等待策略改为 domcontentloaded：避免华为云 SPA networkidle 超时
  - 重试逻辑：网络错误/超时自动重试（最多 2 次）
  - deep_links 配置：目录解析失败/不足时自动使用预配置的深链页面
  - 分侧内容选择器：阿里云/华为云使用不同的正文选择器，减少噪音
  - 内容去噪：自动过滤导航/页脚等噪音文本
  - HTTP fallback：Playwright 被拦截时自动用 httpx+BS4 抓取
  - 安全验证检测：识别拦截页面并回退
  - 404 检测：识别华为云 404 页面
  - 华为云域名策略：.cn 自动回退 .com

依赖（需手动安装）：
    pip install playwright httpx beautifulsoup4
    playwright install chromium

用法：
    python cloud_doc_scraper.py --product ecs
    python cloud_doc_scraper.py --product oss --output oss_docs.md
    python cloud_doc_scraper.py --product rds --max-pages 15
    python cloud_doc_scraper.py --product ecs --stealth   # 启用 stealth 模式（谨慎使用）
    python cloud_doc_scraper.py --list   # 查看所有支持的产品
"""

# ─── Windows 控制台 UTF-8 修复（在任何 import 之前）────────────────────────────
import subprocess, sys, os, importlib.util

if sys.platform == "win32":
    os.environ.setdefault("PYTHONIOENCODING", "utf-8")
    try:
        sys.stdout.reconfigure(encoding="utf-8", errors="replace")
        sys.stderr.reconfigure(encoding="utf-8", errors="replace")
    except Exception:
        pass

# ─── 自动安装依赖 ──────────────────────────────────────────────────────────────
def _ensure_dep(package: str, pip_name: str | None = None):
    if importlib.util.find_spec(package) is None:
        print(f"[INSTALL] {package} ...")
        r = subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", pip_name or package],
                           capture_output=True, text=True)
        if r.returncode != 0:
            print(f"[FAIL] {package}: {r.stderr}"); sys.exit(1)
        print(f"[OK] {package}")
    else:
        print(f"[OK] {package} (cached)")

def _ensure_playwright():
    _ensure_dep("playwright")
    try:
        from playwright.sync_api import sync_playwright
        with sync_playwright() as p:
            if not os.path.exists(p.chromium.executable_path):
                raise FileNotFoundError
        print("[OK] Chromium (cached)")
    except Exception:
        print("[INSTALL] Chromium (first download ~150MB) ...")
        r = subprocess.run([sys.executable, "-m", "playwright", "install", "chromium"],
                           capture_output=False, text=True)
        if r.returncode != 0:
            print("[FAIL] Chromium install"); sys.exit(1)
        print("[OK] Chromium")

# ─── 依赖检测（不自动安装，缺失时提示用户手动安装）──────────────────────────
def _check_dep(package: str, pip_name: str | None = None):
    if importlib.util.find_spec(package) is None:
        print(f"[MISSING] {package} — please run: pip install {pip_name or package}")
        return False
    return True

def _check_deps():
    ok = True
    if not _check_dep("playwright"):
        ok = False
    else:
        try:
            from playwright.sync_api import sync_playwright
            with sync_playwright() as p:
                if not os.path.exists(p.chromium.executable_path):
                    print("[MISSING] Chromium — please run: playwright install chromium")
                    ok = False
        except Exception:
            print("[MISSING] Chromium — please run: playwright install chromium")
            ok = False
    if not _check_dep("httpx"):
        ok = False
    if not _check_dep("bs4", "beautifulsoup4"):
        ok = False
    if not ok:
        print("\n[ERROR] Missing dependencies. Install with:\n"
              "  pip install playwright httpx beautifulsoup4\n"
              "  playwright install chromium\n"
              "Or use a venv: python -m venv .venv && .venv/bin/pip install playwright httpx beautifulsoup4 && .venv/bin/playwright install chromium")
        sys.exit(1)
    print("[OK] All dependencies satisfied")

_check_deps()

# ─── 正式 import ──────────────────────────────────────────────────────────────
import asyncio, argparse
from pathlib import Path
from datetime import datetime
from urllib.parse import urljoin, urlparse
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout

try:
    import httpx
    from bs4 import BeautifulSoup
    HAS_HTTPX = True
except ImportError:
    HAS_HTTPX = False

# ─── 产品配置 ─────────────────────────────────────────────────────────────────
PRODUCTS = {
    "ecs": {
        "name": "云服务器 ECS / ECS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/ecs", "changelog": "https://help.aliyun.com/zh/ecs/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/ecs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-ecs/index.html",
                   "deep_links": [
                       {"text": "什么是ECS", "url": "https://support.huaweicloud.com/productdesc-ecs/zh-cn_topic_0013771112.html"},
                       {"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0002.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0003.html"},
                       {"text": "实例规格", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0014.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0005.html"},
                   ]},
    },
    "oss": {
        "name": "对象存储 OSS / OBS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/oss", "changelog": "https://help.aliyun.com/zh/oss/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/obs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-obs/index.html",
                   "deep_links": [
                       {"text": "什么是OBS", "url": "https://support.huaweicloud.com/productdesc-obs/zh-cn_topic_0045829060.html"},
                       {"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0201.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0202.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0151.html"},
                       {"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0360.html"},
                   ]},
    },
    "rds": {
        "name": "云数据库 RDS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/rds", "changelog": "https://help.aliyun.com/zh/rds/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/rds/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-rds/index.html",
                   "deep_links": [
                       {"text": "产品介绍", "url": "https://support.huaweicloud.com/productdesc-rds/zh-cn_topic_dashboard.html"},
                       {"text": "计费说明", "url": "https://support.huaweicloud.com/price-rds/rds_00_0006.html"},
                       {"text": "快速入门", "url": "https://support.huaweicloud.com/qs-rds/rds_02_0148.html"},
                       {"text": "性能白皮书", "url": "https://support.huaweicloud.com/pwp-rds/pwp_0000.html"},
                       {"text": "最佳实践", "url": "https://support.huaweicloud.com/bestpractice-rds/practice_0000.html"},
                   ]},
    },
    "redis": {
        "name": "云数据库 Redis / DCS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/redis", "changelog": "https://help.aliyun.com/zh/redis/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/dcs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dcs/index.html",
                   "deep_links": [
                       {"text": "什么是DCS", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-200713001.html"},
                       {"text": "典型应用场景", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-200713002.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs_01_0006.html"},
                       {"text": "DCS产品选型参考", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs_01_0002.html"},
                       {"text": "Redis实例类型差异", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-191224001.html"},
                   ]},
    },
    "ack": {
        "name": "容器服务 ACK / CCE",
        "aliyun": {"doc": "https://help.aliyun.com/zh/ack", "changelog": "https://help.aliyun.com/zh/ack/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/cce/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-cce/index.html",
                   "deep_links": [
                       {"text": "什么是CCE", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0001.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0002.html"},
                       {"text": "版本说明", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0003.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0005.html"},
                   ]},
    },
    "fc": {
        "name": "函数计算 FC / FunctionGraph",
        "aliyun": {"doc": "https://help.aliyun.com/zh/fc", "changelog": "https://help.aliyun.com/zh/fc/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/functiongraph/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-functiongraph/index.html",
                   "deep_links": [
                       {"text": "什么是FunctionGraph", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0100.html"},
                       {"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0200.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0300.html"},
                   ]},
    },
    "slb": {
        "name": "负载均衡 SLB / ELB",
        "aliyun": {"doc": "https://help.aliyun.com/zh/slb", "changelog": "https://help.aliyun.com/zh/slb/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/elb/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-elb/index.html",
                   "deep_links": [
                       {"text": "什么是ELB", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0001.html"},
                       {"text": "功能概述", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0003.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0004.html"},
                       {"text": "规格", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0010.html"},
                   ]},
    },
    "maxcompute": {
        "name": "大数据 MaxCompute / MRS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/maxcompute", "changelog": "https://help.aliyun.com/zh/maxcompute/product-overview/Release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/mrs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-mrs/index.html",
                   "deep_links": [
                       {"text": "什么是MRS", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0001.html"},
                       {"text": "组件版本", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0005.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0002.html"},
                   ]},
    },
    "pai": {
        "name": "AI 平台 PAI / ModelArts",
        "aliyun": {"doc": "https://help.aliyun.com/zh/pai", "changelog": "https://help.aliyun.com/zh/pai/user-guide/api-aiworkspace-2021-02-04-changeset"},
        "huawei": {"doc": "https://support.huaweicloud.cn/modelarts/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-modelarts/index.html",
                   "deep_links": [
                       {"text": "什么是ModelArts", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0001.html"},
                       {"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0002.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0003.html"},
                   ]},
    },
    "bailian": {
        "name": "大模型平台 百炼 / 盘古",
        "aliyun": {"doc": "https://help.aliyun.com/zh/bailian", "changelog": "https://help.aliyun.com/zh/bailian/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/pangu/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-pangu/index.html"},
    },
    "cdn": {
        "name": "CDN",
        "aliyun": {"doc": "https://help.aliyun.com/zh/cdn", "changelog": "https://help.aliyun.com/zh/cdn/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/cdn/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-cdn/index.html",
                   "deep_links": [
                       {"text": "什么是华为云CDN", "url": "https://support.huaweicloud.com/productdesc-cdn/zh-cn_topic_0064907747.html"},
                       {"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-cdn/zh-cn_topic_0064907763.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0067.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0369.html"},
                       {"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0068.html"},
                   ]},
    },
    "nas": {
        "name": "文件存储 NAS / SFS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/nas", "changelog": "https://help.aliyun.com/zh/nas/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/sfs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-sfs/index.html",
                   "deep_links": [
                       {"text": "什么是SFS", "url": "https://support.huaweicloud.com/productdesc-sfs/zh-cn_topic_0034428718.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0004.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0110.html"},
                       {"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0011.html"},
                       {"text": "计费说明", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0108.html"},
                   ]},
    },
    "flink": {
        "name": "实时计算 Flink",
        "aliyun": {"doc": "https://help.aliyun.com/zh/flink", "changelog": "https://help.aliyun.com/zh/flink/product-overview/release-note"},
        "huawei": {"doc": "https://support.huaweicloud.cn/dli/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dli/index.html",
                   "deep_links": [
                       {"text": "什么是DLI", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0001.html"},
                       {"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0002.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0003.html"},
                   ]},
    },
    "elasticsearch": {
        "name": "搜索 Elasticsearch / CSS",
        "aliyun": {"doc": "https://help.aliyun.com/zh/elasticsearch", "changelog": "https://help.aliyun.com/zh/elasticsearch/product-overview/release-notes"},
        "huawei": {"doc": "https://support.huaweicloud.cn/css/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-css/index.html",
                   "deep_links": [
                       {"text": "什么是云搜索服务", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0001.html"},
                       {"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0010.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0002.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0003.html"},
                       {"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0005.html"},
                   ]},
    },
    "dws": {
        "name": "数据仓库 AnalyticDB / GaussDB(DWS)",
        "aliyun": {"doc": "https://help.aliyun.com/zh/analyticdb-for-postgresql", "changelog": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/release-notes",
                   "deep_links": [
                       {"text": "什么是AnalyticDB PG", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/what-is-analyticdb-for-postgresql"},
                       {"text": "功能特性", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/features"},
                       {"text": "产品优势", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/benefits"},
                       {"text": "产品系列", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/editions"},
                       {"text": "应用场景", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/scenarios"},
                       {"text": "约束与限制", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/limits-and-restrictions"},
                   ]},
        "huawei": {"doc": "https://support.huaweicloud.cn/dws/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dws/index.html",
                   "deep_links": [
                       {"text": "什么是DWS", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0002.html"},
                       {"text": "数据仓库类型", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_00017.html"},
                       {"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0004.html"},
                       {"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0006.html"},
                       {"text": "基本概念", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0007.html"},
                   ]},
    },
}

# ─── 优先级关键词 ──────────────────────────────────────────────────────────────
PRIORITY_KEYWORDS = [
    (3, ["产品简介", "产品概述", "什么是", "功能特性", "核心功能", "产品功能", "product overview", "what is", "features"]),
    (2, ["规格", "实例规格", "配置", "限制", "约束", "性能", "指标", "参数", "specification", "limits", "performance"]),
    (2, ["计费", "定价", "费用", "价格", "版本对比", "版本说明", "pricing", "billing", "edition"]),
    (2, ["组件版本", "版本支持", "引擎版本", "内核版本"]),
    (1, ["应用场景", "使用场景", "适用场景", "最佳实践", "use case", "scenario", "best practice"]),
    (-1, ["常见问题", "faq", "故障排除", "sdk", "api参考", "错误码", "迁移指南"]),
]

def score_link(text: str, href: str) -> int:
    combined = (text + " " + href).lower()
    return sum(w for w, kws in PRIORITY_KEYWORDS if any(k in combined for k in kws))

# ─── 目录解析 ─────────────────────────────────────────────────────────────────
ALIYUN_NAV_SELECTORS = [
    ".toc-menu a", ".sidebar-menu a", ".helpcenter-menu a",
    "nav a", ".left-menu a",
    "[class*='nav'] a", "[class*='sidebar'] a", "[class*='toc'] a", "[class*='menu'] a",
]
HUAWEI_NAV_SELECTORS = [
    "[class*=nav-item] a",  # 华为云新 SPA
    ".book-left-menu a", ".toc a", ".sidebar a", ".tree-menu a",
    "[class*='catalog'] a", "[class*='tree'] a", "[class*='nav'] a", "[class*='menu'] a",
    ".left-nav a", ".doc-nav a", ".doc-sidebar a",
    "aside a", "[role='navigation'] a",
]

async def parse_toc(page, base_url: str, nav_selectors: list, label: str) -> list:
    base_domain = f"{urlparse(base_url).scheme}://{urlparse(base_url).netloc}"
    links = []
    for selector in nav_selectors:
        try:
            els = await page.query_selector_all(selector)
            if len(els) <= 3:
                continue
            for el in els:
                href = await el.get_attribute("href") or ""
                text = (await el.inner_text()).strip()
                if not href or not text or href == "#" or href.startswith("javascript"):
                    continue
                full_url = urljoin(base_domain, href) if href.startswith("/") else href
                if urlparse(base_domain).netloc not in urlparse(full_url).netloc:
                    continue
                links.append({"url": full_url, "text": text, "score": score_link(text, href)})
            if links:
                print(f"    [{label}] nav selector '{selector}' => {len(links)} links")
                return links
        except Exception:
            continue
    print(f"    [{label}] [WARN] all selectors missed, TOC parse failed")
    return []

# ─── 正文提取 ─────────────────────────────────────────────────────────────────
ALIYUN_CONTENT_SELECTORS = [
    ".help-detail-content", ".article-content", ".doc-body",
    "#docContent", ".markdown-body", "article",
]
HUAWEI_CONTENT_SELECTORS = [
    ".book-desc", ".content-block", "#content", "article", ".markdown-body",
]
FALLBACK_CONTENT_SELECTORS = ["main", ".main-content", ".content"]

NOISE_PATTERNS = [
    "为什么选择阿里云", "什么是云计算", "全球基础设施", "法律声明",
    "Cookies政策", "廉正举报", "安全举报", "联系我们", "加入我们",
    "阿里巴巴集团", "淘宝网", "天猫", "速卖通",
    "关注阿里云", "阿里云公众号", "随时随地运维管控",
    "售前咨询", "售后在线", "我要建议", "我要投诉",
    "登录阿里云", "管理云资源", "状态一览",
    "Protected by Tencent", "正在验证连接安全性",
    "华为云App", "950808", "售前咨询热线",
    "云商店咨询", "备案服务", "增值电信业务",
    "黔ICP备", "苏B2-", "贵公网安备",
]

def denoise_text(text: str) -> str:
    lines = text.split("\n")
    clean = []
    for line in lines:
        s = line.strip()
        if not s:
            clean.append(""); continue
        if not any(p in s for p in NOISE_PATTERNS):
            clean.append(s)
    result, prev_empty = [], False
    for line in clean:
        if not line:
            if not prev_empty: result.append("")
            prev_empty = True
        else:
            result.append(line); prev_empty = False
    return "\n".join(result)

async def extract_text(page, content_selectors: list | None = None) -> str:
    selectors = content_selectors or (ALIYUN_CONTENT_SELECTORS + HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS)
    for sel in selectors:
        try:
            el = await page.query_selector(sel)
            if el:
                text = await el.inner_text()
                if len(text.strip()) > 300:
                    return denoise_text(text.strip())
        except Exception:
            continue
    body = await page.query_selector("body")
    if body:
        return denoise_text((await body.inner_text()).strip())
    return ""

# ─── 安全验证 & 404 检测 ─────────────────────────────────────────────────────
def is_security_block(text: str) -> bool:
    return any(p in text for p in ["正在验证连接安全性", "Protected by Tencent Cloud EdgeOne", "Security Verification"])

def is_404_page(text: str) -> bool:
    return "很抱歉，没发现您要的页面" in text

def huawei_cn_to_com(url: str) -> str:
    return url.replace("support.huaweicloud.cn", "support.huaweicloud.com")

# ─── HTTP fallback ────────────────────────────────────────────────────────────
async def fetch_via_http(url: str) -> str:
    if not HAS_HTTPX:
        return ""
    try:
        async with httpx.AsyncClient(follow_redirects=True, timeout=30,
                                      headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}) as client:
            resp = await client.get(url)
            if resp.status_code != 200:
                return ""
            soup = BeautifulSoup(resp.text, "html.parser")
            for tag in soup.find_all(["script", "style", "nav", "footer", "header", "noscript"]):
                tag.decompose()
            for sel in [".book-desc", ".content-block", "#content", "article", ".markdown-body", "main"]:
                for el in soup.select(sel):
                    text = el.get_text(separator="\n", strip=True)
                    if len(text) > 300:
                        return denoise_text(text)
            body = soup.find("body")
            if body:
                return denoise_text(body.get_text(separator="\n", strip=True))
            return ""
    except Exception:
        return ""

# ─── 带重试的页面抓取 ─────────────────────────────────────────────────────────
MAX_RETRIES = 2

async def fetch_content(context, url: str, content_selectors: list | None = None,
                        label: str = "", is_huawei: bool = False) -> str:
    for attempt in range(MAX_RETRIES + 1):
        page = await context.new_page()
        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=45000)
            for selector in (content_selectors or ALIYUN_CONTENT_SELECTORS + HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS):
                try:
                    await page.wait_for_selector(selector, timeout=3000)
                    break
                except PlaywrightTimeout:
                    continue
            text = await extract_text(page, content_selectors)

            if is_security_block(text):
                if HAS_HTTPX:
                    http_text = await fetch_via_http(url)
                    if len(http_text.strip()) > 200:
                        print(f"  [{label}] [HTTP-FALLBACK] bypassed security check")
                        return http_text
                return f"[Security block, HTTP fallback failed, visit manually: {url}]"

            if is_404_page(text):
                if is_huawei and "huaweicloud.cn" in url:
                    com_url = huawei_cn_to_com(url)
                    print(f"  [{label}] [404] .cn 404, trying .com")
                    com_text = await fetch_via_http(com_url)
                    if len(com_text.strip()) > 200:
                        return com_text
                return f"[404 page, visit manually: {url}]"

            if len(text.strip()) < 100:
                if HAS_HTTPX:
                    http_text = await fetch_via_http(url)
                    if len(http_text.strip()) > 200:
                        print(f"  [{label}] [HTTP-FALLBACK] Playwright empty, HTTP ok")
                        return http_text
                return f"[Empty page, visit manually: {url}]"

            return text
        except PlaywrightTimeout:
            if attempt < MAX_RETRIES:
                print(f"  [{label}] [RETRY] timeout, attempt {attempt+1}...")
                continue
            return f"[Timeout after {MAX_RETRIES} retries, visit manually: {url}]"
        except Exception as e:
            if attempt < MAX_RETRIES:
                err = str(e)
                if "ERR_NETWORK" in err or "ERR_CONNECTION" in err or "Timeout" in err:
                    print(f"  [{label}] [RETRY] network error, attempt {attempt+1}...")
                    continue
            return f"[Failed: {e}]"
        finally:
            await page.close()
    return f"[Retries exhausted, visit manually: {url}]"

# ─── 单侧完整抓取 ─────────────────────────────────────────────────────────────
async def scrape_side(context, label: str, doc_url: str, changelog_url: str,
                      nav_selectors: list, max_pages: int,
                      deep_links: list | None = None,
                      content_selectors: list | None = None,
                      is_huawei: bool = False) -> dict:
    result = {"label": label, "doc_url": doc_url, "changelog_url": changelog_url,
              "pages": [], "changelog": "", "toc_total": 0}

    # Step 1: 打开首页 + 解析目录
    print(f"\n  [{label}] Parsing TOC...")
    index_page = await context.new_page()
    toc = []
    try:
        await index_page.goto(doc_url, wait_until="networkidle", timeout=60000)
        await asyncio.sleep(2)  # extra buffer for SPA render
        toc = await parse_toc(index_page, doc_url, nav_selectors, label)
        result["toc_total"] = len(toc)

        if not toc or len(toc) < 3:
            if deep_links and len(deep_links) >= 3:
                print(f"  [{label}] TOC links insufficient({len(toc)}), using {len(deep_links)} deep_links")
                toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
                result["toc_total"] = len(toc)
            else:
                print(f"  [{label}] Fallback: only fetch homepage")
                result["pages"].append(("Homepage", doc_url, await extract_text(index_page, content_selectors)))
        elif deep_links and len(deep_links) >= len(toc):
            print(f"  [{label}] TOC {len(toc)} links, deep_links {len(deep_links)}, prefer deep_links")
            toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
            result["toc_total"] = len(toc)
    except PlaywrightTimeout:
        print(f"  [{label}] [WARN] Homepage timeout, trying deep_links or HTTP fallback")
        if deep_links:
            toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
            result["toc_total"] = len(toc)
        elif HAS_HTTPX:
            http_text = await fetch_via_http(doc_url)
            if len(http_text.strip()) > 200:
                result["pages"].append(("Homepage(HTTP)", doc_url, http_text))
    finally:
        await index_page.close()

    # Step 2: 去重 → 过滤 → 排序 → 取 top-N
    if toc:
        seen, unique = set(), []
        for lk in toc:
            if lk["url"] not in seen:
                seen.add(lk["url"])
                unique.append(lk)
        candidates = sorted([lk for lk in unique if lk["score"] >= 0],
                             key=lambda x: x["score"], reverse=True)[:max_pages]
        print(f"  [{label}] Selected {len(candidates)}/{len(unique)} core pages:")
        for i, lk in enumerate(candidates):
            print(f"    {i+1:2d}. [{lk['score']:+d}] {lk['text'][:45]}")

        # Step 3: 并发抓取
        sem = asyncio.Semaphore(4)
        async def fetch_one(lk):
            async with sem:
                content = await fetch_content(context, lk["url"], content_selectors, label, is_huawei)
                ok = "[OK]" if not content.startswith("[") else "[FAIL]"
                print(f"  [{label}] {ok} {lk['text'][:40]}")
                return (lk["text"], lk["url"], content)
        result["pages"] = list(await asyncio.gather(*[fetch_one(lk) for lk in candidates]))

    # Step 4: 更新日志
    print(f"  [{label}] Fetching changelog...")
    result["changelog"] = await fetch_content(context, changelog_url, content_selectors, label, is_huawei)
    ok = "[OK]" if not result["changelog"].startswith("[") else "[FAIL]"
    print(f"  [{label}] {ok} Changelog done")

    return result

# ─── 拼装 Markdown ────────────────────────────────────────────────────────────
def build_markdown(product_name: str, aliyun: dict, huawei: dict) -> str:
    now = datetime.now().strftime("%Y-%m-%d")
    lines = [
        f"# 竞品分析原始资料：{product_name}",
        f"> 抓取时间：{now}",
        f"> 阿里云：目录 {aliyun['toc_total']} 页，本次抓取 {len(aliyun['pages'])} 页",
        f"> 华为云：目录 {huawei['toc_total']} 页，本次抓取 {len(huawei['pages'])} 页",
        "", "**给 AI 的指令：** 基于以下官方文档原文，分析两款产品的真实差异。",
        "重点挖掘：关键指标的数字差距、一方有而另一方没有的能力、相同功能的成熟度差异、近期迭代方向的分歧。",
        "无差异或差异不明显的维度直接略过，不要凑字数。", "", "---", "",
    ]
    for side in [aliyun, huawei]:
        lines += [f"# {side['label']}", ""]
        if not side["pages"]:
            lines.append(f"> [WARN] Fetch failed, visit manually: {side['doc_url']}")
        else:
            for title, url, content in side["pages"]:
                lines += [f"## {side['label']} · {title}", f"> Source: {url}", ""]
                body = content[:8000]
                if len(content) > 8000:
                    body += "\n\n[...truncated, see source link...]"
                lines += [body, "", "---", ""]
        lines += [f"## {side['label']} · Changelog", f"> Source: {side['changelog_url']}", ""]
        cl = side["changelog"][:6000]
        if len(side["changelog"]) > 6000:
            cl += "\n\n[...truncated...]"
        lines += [cl or f"> [WARN] Fetch failed, visit manually: {side['changelog_url']}", "", "---", ""]
    return "\n".join(lines)

# ─── 主入口 ───────────────────────────────────────────────────────────────────
async def main_async(product_key: str, max_pages: int, output: str, args):
    if product_key not in PRODUCTS:
        print(f"[FAIL] Unknown product '{product_key}', available: {', '.join(PRODUCTS.keys())}")
        sys.exit(1)
    cfg = PRODUCTS[product_key]
    print(f"\n{'='*60}\n  {cfg['name']}  (max {max_pages} core pages per side)\n{'='*60}")

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled", "--no-sandbox"],
        )
        # 阿里云：普通 context（不需要 stealth）
        ctx_aliyun = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            locale="zh-CN",
        )
        # 华为云：stealth context（可选，通过 --stealth 启用，用于处理部分站点 JS 渲染兼容性问题）
        ctx_huawei = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" if args.stealth else None,
            viewport={"width": 1920, "height": 1080},
            locale="zh-CN",
        )
        if args.stealth:
            await ctx_huawei.add_init_script("Object.defineProperty(navigator, 'webdriver', { get: () => undefined });")
        try:
            aliyun = await scrape_side(
                ctx_aliyun, "Aliyun",
                cfg["aliyun"]["doc"], cfg["aliyun"]["changelog"],
                ALIYUN_NAV_SELECTORS, max_pages,
                deep_links=cfg["aliyun"].get("deep_links"),
                content_selectors=ALIYUN_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS,
                is_huawei=False,
            )
            huawei = await scrape_side(
                ctx_huawei, "Huawei",
                cfg["huawei"]["doc"], cfg["huawei"]["changelog"],
                HUAWEI_NAV_SELECTORS, max_pages,
                deep_links=cfg["huawei"].get("deep_links"),
                content_selectors=HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS,
                is_huawei=True,
            )
        finally:
            await ctx_aliyun.close()
            await ctx_huawei.close()
            await browser.close()

    md = build_markdown(cfg["name"], aliyun, huawei)
    if output:
        Path(output).write_text(md, encoding="utf-8")
        print(f"\n[OK] Done! Saved to {output} ({len(md.encode())//1024} KB)")
        print("   Paste the file content to AI to start competitive analysis.")
    else:
        print("\n" + "=" * 60 + "\n" + md)

def main():
    parser = argparse.ArgumentParser(description="Aliyun & Huawei doc scraper v4")
    parser.add_argument("--product", default="",
        help="Product key. Available:\n" + "\n".join(f"  {k:<15} {v['name']}" for k, v in PRODUCTS.items()))
    parser.add_argument("--list", action="store_true", help="List all products")
    parser.add_argument("--output", default="", help="Output file path")
    parser.add_argument("--max-pages", type=int, default=12, help="Max core pages per side (default 12)")
    parser.add_argument("--stealth", action="store_true", help="Enable stealth mode for sites with JS rendering compatibility issues (use with caution)")
    args = parser.parse_args()

    if args.list:
        print("\nAvailable products:\n")
        for k, v in PRODUCTS.items():
            print(f"  {k:<15} {v['name']}")
        print(); return
    if not args.product:
        parser.print_help(); sys.exit(1)
    asyncio.run(main_async(args.product, args.max_pages, args.output, args))

if __name__ == "__main__":
    main()

ClawHub Frontend Cloud+2

U@clawhub-ucsdzehualiu-001da531f9

free-web-search

Skill

基于Bing国内版的稳定联网搜索工具，中文环境深度优化，支持全文内容抓取，绕过常见反爬限制，返回结构化搜索结果。

---
name: free-web-search
description: 基于Bing国内版的稳定联网搜索工具，中文环境深度优化，支持全文内容抓取，绕过常见反爬限制，返回结构化搜索结果。
version: 7
author: free-web-search
trigger_keywords:
  - 搜索
  - 查一下
  - 找一下
  - 最新消息
  - 新闻
  - 最新动态
  - 官网
  - 教程
  - 是什么
tools:
  - name: web_search
    description: 联网搜索并返回结构化结果，中文环境优化，支持全文内容抓取
    script: scripts/web_search.py
    parameters:
      query:
        type: string
        description: 【必填】搜索关键词/短句，必须简洁精准，符合下方Query优化规范，禁止长句/反问句
        required: true
      max:
        type: integer
        description: 最大返回的搜索结果条数，默认10，最大不超过20
        required: false
      full:
        type: integer
        description: 抓取前N条结果的网页全文内容，默认0（不抓取），最大不超过5
        required: false
      engine:
        type: string
        description: 搜索引擎选择，bing/duckduckgo/auto（默认bing）
        required: false
      filter:
        type: boolean
        description: 过滤低质量域名（如知乎），默认false（不过滤）
        required: false
---

# free-web-search v14 联网搜索工具

基于 Playwright 浏览器实现的稳定搜索工具，**意图识别** + **请求节流** + **结果质量评分** + **保留CSS修复**。

## v14 更新内容

- ✅ **[关键修复] 保留CSS**：之前拦截CSS导致Bing搜索结果标题文字丢失
- ✅ **意图识别+query改写**：搜索质量差时自动改写query（城市游玩→景点推荐、今日价格→实时行情等）
- ✅ **改写仅在质量差时触发**：先搜原始query，质量好就不改写，避免改写搞坏本来好的query
- ✅ **请求节流**：两次Bing请求间隔≥3s，避免触发限流
- ✅ **限流检测+退避**：0结果时递增等待重试，排除重试也0结果时停止
- ✅ **`--filter` 回退**：过滤后为空自动回退到不过滤结果
- ✅ **单域名排除重试**：最多2轮，结果更好才替换
- ✅ **DuckDuckGo国内快速失败**：10s超时×1次
- ✅ **`--no-rewrite`**：禁用query改写（调试用）

## 核心能力

- ✅ **中文环境深度优化**：强制 Bing 返回中文结果
- ✅ **反爬检测绕过**：多层反检测措施（stealth.js）
- ✅ **全文抓取**：支持按需抓取目标网页的完整正文内容
- ✅ **Headless 模式**：服务器可用，无需显示器

---

## 【核心必读】搜索Query优化规范

**搜索效果的好坏，90%取决于Query是否合理**，请严格遵循以下规则生成搜索词：

### 一、黄金原则
1.  **简洁精准**：只保留核心关键词，用2-5个核心词组合，禁止长句、反问句、口语化描述
2.  **限定明确**：需要时效性/领域/地区内容时，必须加上对应的限定词
3.  **格式正确**：使用中文关键词 + 英文/数字限定词，禁止特殊符号、无意义助词

### 二、正确示例 vs 错误示例
| 搜索场景 | 正确Query（推荐） | 错误Query（禁止） |
|----------|--------------------|--------------------|
| 时效性新闻 | 2026年04月 美伊局势 最新 | 你能帮我查一下最近美国和伊朗之间发生了什么事吗 |
| 技术教程 | Python 异步编程 最佳实践 2026 | 我想学习一下Python的异步编程，有没有好的教程 |
| 知识科普 | 中国大型邮轮 花城号 出坞 最新消息 | 中国的那个大型邮轮花城号现在怎么样了 |
| 本地内容 | 广东东莞 今日天气 | 我现在在东莞，今天天气怎么样啊 |
| 官方信息 | 华为云 ModelArts 官方文档 | 华为云的那个ModelArts的官网在哪里，文档怎么看 |

---

## 参数说明
| 参数名 | 类型 | 说明 | 默认值 | 取值限制 |
|--------|------|------|--------|----------|
| `query` | 字符串 | 【必填】搜索关键词 | - | 不能为空 |
| `max` | 整数 | 最多返回的搜索结果条数 | 10 | 1-20 |
| `full` | 整数 | 抓取前N条结果的网页全文 | 0 | 0-5 |
| `engine` | 字符串 | 搜索引擎选择 | bing | bing/duckduckgo/auto |
| `filter` | 布尔 | 过滤低质量域名 | false | - |

---

## 使用示例

```bash
# 基础搜索
python scripts/web_search.py "经济新闻 今日" --max=10

# 抓取前3条结果的全文
python scripts/web_search.py "经济新闻 最新" --full=3

# 使用 auto 模式（Bing 结果不足时切换 DuckDuckGo）
python scripts/web_search.py "技术教程" --engine=auto

# 过滤知乎等低质量域名
python scripts/web_search.py "某个话题" --filter
```

---

## 常见问题

### 搜索返回空结果
1. 检查网络连接（VPN 可能影响 Bing 国内版）
2. 尝试 `--engine=duckduckgo` 直接用 DuckDuckGo
3. 检查 Query 是否过于冗长或口语化

### 浏览器启动失败
```bash
pip install playwright && playwright install chromium
```

### 全文抓取失败
- 某些网站有强反爬限制
- 知乎等域名在全文抓取时自动跳过

### 结果集中在单一域名
- 脚本会自动检测并警告 `[WARN] 结果集中在单一域名`
- **解决方案**：换用更具体的关键词，避免歧义词

### 搜索关键词避坑指南
| ❌ 避免 | ✅ 推荐 |
|---------|---------|
| `民生新闻` | `住房 医疗 就业` 或 `社会政策 百姓生活` |
| `经济新闻` | `财经政策 GDP` 或 `A股 沪指` |
| `长护险` | `长期护理保险 养老服务` |

### 服务器环境
- 脚本强制使用 `headless=True`，无需显示器
- 已添加服务器兼容的浏览器参数

FILE:scripts/setup.sh
#!/usr/bin/env bash
# free-web-search 依赖安装脚本
# 用法: bash scripts/setup.sh
# 支持: --mirror <url>  指定 pip 镜像源

set -e

MIRROR="-"
if [ "$1" = "--mirror" ] && [ -n "$2" ]; then
    MIRROR="$2"
fi

PIP_ARGS=""
if [ -n "$MIRROR" ]; then
    PIP_ARGS="-i $MIRROR --trusted-host $(echo $MIRROR | sed 's|https\?://||' | cut -d/ -f1)"
fi

echo "=== free-web-search 依赖安装 ==="

# 检测 pip
if command -v pip &>/dev/null; then
    PIP=pip
elif command -v pip3 &>/dev/null; then
    PIP=pip3
else
    echo "[ERROR] 找不到 pip，请先安装 Python 3.10+"
    exit 1
fi

echo "[1/3] 安装 Python 包..."
$PIP install httpx beautifulsoup4 playwright $PIP_ARGS

echo "[2/3] 安装 Chromium 浏览器..."
playwright install chromium

echo "[3/3] 验证安装..."
python3 -c "
import httpx; print('  httpx OK')
from bs4 import BeautifulSoup; print('  beautifulsoup4 OK')
from playwright.sync_api import sync_playwright; print('  playwright OK')
pw = sync_playwright().start()
b = pw.chromium.launch(headless=True)
b.close()
pw.stop()
print('  chromium OK')
"

echo "=== 安装完成 ==="

FILE:scripts/web_search.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
free-web-search v7
意图识别 + query改写 + 请求节流 + 结果质量评分 + 单域名排除重试 + 保留CSS"""

import sys
import json
import time
import re
import argparse
import subprocess
from urllib.parse import urlencode, quote, urlparse
from datetime import datetime

# ==================== 强制UTF-8 ====================
sys.stdout.reconfigure(encoding='utf-8')
sys.stderr.reconfigure(encoding='utf-8')

# ==================== 配置 ====================
DEFAULT_MAX  = 10
DEFAULT_FULL = 0
TIMEOUT      = 30000
FETCH_TIMEOUT= 15000
DDG_TIMEOUT  = 10000
MAX_RETRIES  = 3
DDG_RETRIES  = 1
WAIT_TIME    = 2000

QUALITY_THRESHOLD = 0.45

MIN_REQUEST_INTERVAL = 3.0
_last_request_time = 0.0

UA = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/125.0.0.0 Safari/537.36"
)

BLOCK_DOMAINS = [
    # 知乎搜索结果可以抓全文（可能遇到反爬，但值得试）
]

LOW_QUALITY_DOMAINS = [
    "jingyan.baidu.com",
    "zhidao.baidu.com",
    "tieba.baidu.com",
    "baike.baidu.com",
    "wenku.baidu.com",
    "bbs.16fan.com",
    "zhihu.com",
    "zhuanlan.zhihu.com",
]

AUTHORITY_HINTS = [
    ".gov.", "gov.cn", ".org.",
    "kitco.com", "sge.com.cn", "cngold.org", "gold.org.cn",
    "kekegold.com", "cngoldprice.com", "ip138.com",
    "finance.sina", "finance.eastmoney", "10jqka.com.cn",
    "jujindata.com", "huilvbiao.com", "jinjia.com.cn",
    "mafengwo.cn", "ctrip.com", "damai.cn",
    "visitshenzhen",
]

FUZZY_TIME_WORDS = re.compile(r'(今日|今天|最新|最近|当前|目前|当下|现在)')

# ==================== 意图识别 + query 改写 ====================
CITIES = "深圳|广州|北京|上海|杭州|成都|武汉|南京|重庆|西安|长沙|苏州|厦门|青岛|大连|天津|昆明|珠海|东莞|佛山|惠州|中山"

# 意图规则: (匹配正则, 改写函数, 描述)
# 改写函数接收 match 对象和原始 query，返回改写后的完整 query
# 原则：只精简/替换，不加词！Bing CN对简洁query效果最好
INTENT_RULES = [
    # 城市+好玩/去哪 → 精简为"城市 景点"
    (re.compile(rf'({CITIES})\s*(有什么好玩的|哪里好玩|好玩的地方|去哪玩|周末.*去哪|好去处|逛|玩什么)'),
     lambda m, q: f'{m.group(1)} 景点', '城市游玩→景点'),

    # 城市+活动 → 精简
    (re.compile(rf'({CITIES})\s*(活动|展览|演出|市集|音乐会|演唱会)'),
     lambda m, q: f'{m.group(1)} {m.group(2)}', '城市活动→精简'),

    # "今日金价" → "金价"（Bing CN对"今日"匹配百度经验，去掉更好）
    (re.compile(r'今日(金价|银价|油价|铜价|铂金价)'),
     lambda m, q: f'{m.group(1)}', '今日价格→去掉今日'),

    # "xxx是什么" → "xxx 介绍"
    (re.compile(r'(.+?)是什么(?:意思)?$', re.IGNORECASE),
     lambda m, q: f'{m.group(1)} 介绍', '是什么→介绍'),

    # "xxx怎么样" → "xxx 评价"
    (re.compile(r'(.+?)怎么样$', re.IGNORECASE),
     lambda m, q: f'{m.group(1)} 评价', '怎么样→评价'),

    # "怎么xxx" → "xxx 方法"
    (re.compile(r'^怎么(.+)', re.IGNORECASE),
     lambda m, q: f'{m.group(1)} 方法', '怎么→方法'),

    # "xxx和yyy哪个好" → "xxx yyy 对比"
    (re.compile(r'(.+?)和(.+?)(哪个好|哪个更好|选哪个)'),
     lambda m, q: f'{m.group(1)} {m.group(2)} 对比', '哪个好→对比'),
]


def rewrite_query(query: str) -> tuple:
    """意图识别 + query改写。只应用第一个匹配的规则。返回 (改写后query, 意图描述或None)"""
    for pattern, rewrite_fn, desc in INTENT_RULES:
        m = pattern.search(query)
        if m:
            rewritten = rewrite_fn(m, query)
            rewritten = re.sub(r'\s+', ' ', rewritten).strip()
            return rewritten, desc
    return query, None


# 反检测初始化脚本
STEALTH_JS = """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {
    get: () => [
        {name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
        {name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpafafjmlifpcpbgpcj'},
        {name: 'Native Client Executable', filename: 'internal-nacl-plugin'}
    ]
});
Object.defineProperty(navigator, 'languages', {get: () => ['zh-CN', 'zh', 'en']});
Object.defineProperty(navigator, 'platform', {get: () => 'Win32'});
Object.defineProperty(navigator, 'hardwareConcurrency', {get: () => 8});
Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});
window.chrome = {runtime: {}};
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
    parameters.name === 'notifications' ?
        Promise.resolve({ state: Notification.permission }) :
        originalQuery(parameters)
);
const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(type) {
    if (type === 'image/png' && this.width === 220 && this.height === 30) {
        return 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAANwAAAAeCAYAAABwJ3rwAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAABmJLR0QA/wD/AP+gvaeTAAAABmJLR0QA/wD/AP+gvaeTAAAABmJLR0QA/wD/AP+gvaeT';
    }
    return originalToDataURL.apply(this, arguments);
};
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
    if (parameter === 37445) return 'Intel Inc.';
    if (parameter === 37446) return 'Intel Iris OpenGL Engine';
    return getParameter.apply(this, arguments);
};
"""

BROWSER_ARGS = [
    '--no-sandbox', '--disable-gpu', '--disable-dev-shm-usage',
    '--disable-blink-features=AutomationControlled', '--disable-infobars',
    '--disable-extensions', '--disable-background-networking', '--disable-sync',
    '--metrics-recording-only', '--disable-default-apps', '--no-first-run',
    '--disable-component-extensions-with-background-pages',
    '--disable-features=IsolateOrigins,site-per-process',
    '--disable-site-isolation-trials', '--disable-web-security',
    '--allow-running-insecure-content',
]

ROUTE_PATTERN = "**/*.{png,jpg,jpeg,gif,svg,woff,woff2,ttf,mp4,ico,webp,js.map}"

_browser = None
_playwright = None


def ensure_playwright():
    try:
        import playwright; return
    except ImportError: pass
    print("[INFO] 安装 playwright...", file=sys.stderr)
    for cmd in [
        [sys.executable, "-m", "pip", "install", "-q", "playwright", "--break-system-packages"],
        [sys.executable, "-m", "pip", "install", "-q", "playwright"],
    ]:
        if subprocess.run(cmd, capture_output=True, text=True).returncode == 0:
            break
    subprocess.run([sys.executable, "-m", "playwright", "install", "chromium"],
                   capture_output=True, text=True)
    import os; os.execv(sys.executable, [sys.executable] + sys.argv)


def parse_args():
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument("query", nargs="*")
    parser.add_argument("--max", type=int, default=DEFAULT_MAX, dest="max_results")
    parser.add_argument("--full", type=int, default=DEFAULT_FULL)
    parser.add_argument("--engine", type=str, default="bing", choices=["bing", "duckduckgo", "auto"])
    parser.add_argument("--filter", action="store_true", help="过滤低质量域名")
    parser.add_argument("--no-rewrite", action="store_true", help="禁用query改写")
    args = parser.parse_args()
    query = " ".join(args.query).strip()
    args.max_results = max(1, min(20, args.max_results))
    args.full = max(0, min(5, args.full))
    return query, args.max_results, args.full, args.engine, args.filter, args.no_rewrite


def build_bing_url(query, count):
    return "https://cn.bing.com/search?" + urlencode({
        "q": query, "mkt": "zh-CN", "setlang": "zh-CN", "cc": "CN", "count": str(count + 2)
    })

def build_duckduckgo_url(query):
    return "https://duckduckgo.com/?q=" + quote(query) + "&ia=web"


def init_browser():
    global _browser, _playwright
    if _browser: return
    from playwright.sync_api import sync_playwright
    print("[DEBUG] 启动 Chromium...", file=sys.stderr)
    _playwright = sync_playwright().start()
    _browser = _playwright.chromium.launch(headless=True, args=BROWSER_ARGS)
    print("[DEBUG] Chromium 已就绪", file=sys.stderr)

def close_browser():
    try:
        if _browser: _browser.close()
        if _playwright: _playwright.stop()
    except: pass

def create_context():
    ctx = _browser.new_context(
        locale="zh-CN", user_agent=UA,
        viewport={"width":1920,"height":1080}, screen={"width":1920,"height":1080},
        device_scale_factor=1, timezone_id="Asia/Shanghai",
        has_touch=False, is_mobile=False, java_script_enabled=True,
    )
    ctx.add_init_script(STEALTH_JS)
    return ctx

def throttle():
    global _last_request_time
    gap = MIN_REQUEST_INTERVAL - (time.time() - _last_request_time)
    if gap > 0:
        time.sleep(gap)
    _last_request_time = time.time()

def is_blocked_domain(url): return any(d in url for d in BLOCK_DOMAINS)
def is_low_quality_domain(url): return any(d in url for d in LOW_QUALITY_DOMAINS)

def score_result(r):
    s = 0.5
    url, snippet = r.get("url",""), r.get("snippet","")
    if is_low_quality_domain(url): s -= 0.3
    if re.search(r'\d{2,}', snippet): s += 0.15
    if len(snippet) < 20: s -= 0.1
    for h in AUTHORITY_HINTS:
        if h in url: s += 0.2; break
    return max(0.0, min(1.0, s))

def score_results(results):
    return sum(score_result(r) for r in results) / len(results) if results else 0.0

def get_dominant_domain(results):
    if not results: return (None, 0, 0)
    domains = {}
    for r in results:
        d = urlparse(r["url"]).netloc.replace("www.", "")
        domains[d] = domains.get(d, 0) + 1
    top = max(domains, key=domains.get)
    if domains[top] > len(results) * 0.5:
        return (top, domains[top], len(results))
    return (None, 0, len(results))

def merge_results(primary, secondary, max_results):
    seen, merged = set(), []
    for r in primary + secondary:
        if r["url"] not in seen: seen.add(r["url"]); merged.append(r)
    return merged[:max_results]

def apply_filter(results, do_filter):
    if do_filter and results:
        filtered = [r for r in results if not is_blocked_domain(r["url"]) and not is_low_quality_domain(r["url"])]
        if filtered: return filtered
        print("[WARN] --filter 过滤后为空，回退", file=sys.stderr)
    return results


# ==================== Bing 搜索 ====================
def search_bing(query, max_results, do_filter=False):
    start = time.time()
    url = build_bing_url(query, max_results + 5)
    print(f"[DEBUG] Bing: {query} | max={max_results}", file=sys.stderr)
    init_browser()
    results = []
    for attempt in range(MAX_RETRIES):
        throttle()
        ctx, page = create_context(), None
        try:
            page = ctx.new_page()
            page.route(ROUTE_PATTERN, lambda r: r.abort())
            page.goto(url, timeout=TIMEOUT, wait_until="domcontentloaded")
            page.wait_for_timeout(WAIT_TIME)
            raw = page.evaluate("""() => {
                const items = [];
                document.querySelectorAll('li.b_algo').forEach(el => {
                    try {
                        const a = el.querySelector('h2 a');
                        const p = el.querySelector('.b_caption p, .b_algoSlug');
                        if (a && a.href && a.href.startsWith('http'))
                            items.push({title:(a.innerText||a.textContent||'').trim(), url:a.href.trim(), snippet:p?(p.innerText||p.textContent||'').trim():''});
                    } catch(e) {}
                });
                return items;
            }""")
            for r in raw:
                if r["title"] and r["url"] and len(r["title"]) > 3:
                    results.append({"title":r["title"],"url":r["url"],"snippet":r["snippet"],"content":""})
            if len(results) >= max_results: break
            if len(raw) == 0 and attempt < MAX_RETRIES - 1:
                w = 5 * (attempt + 1)
                print(f"[WARN] 0结果，可能限流，等{w}s", file=sys.stderr)
                time.sleep(w)
        except Exception as e:
            print(f"[WARN] Bing尝试{attempt+1}失败: {e}", file=sys.stderr)
        finally:
            try: ctx.close()
            except: pass
    results = apply_filter(results, do_filter)[:max_results]
    dom, dc, tot = get_dominant_domain(results)
    if dom: print(f"[WARN] 单域名集中: {dom} ({dc}/{tot})", file=sys.stderr)
    q = score_results(results)
    print(f"[DEBUG] 质量: {q:.2f} | 数量: {len(results)} | 耗时: {time.time()-start:.1f}s", file=sys.stderr)
    return results, dom


# ==================== DuckDuckGo ====================
def search_duckduckgo(query, max_results, do_filter=False):
    start = time.time()
    url = build_duckduckgo_url(query)
    print(f"[DEBUG] DDG: {query}", file=sys.stderr)
    init_browser()
    results = []
    for _ in range(DDG_RETRIES):
        ctx, page = create_context(), None
        try:
            page = ctx.new_page()
            page.route(ROUTE_PATTERN, lambda r: r.abort())
            page.goto(url, timeout=DDG_TIMEOUT, wait_until="domcontentloaded")
            page.wait_for_timeout(WAIT_TIME + 1000)
            raw = page.evaluate("""() => {
                const items = [];
                for (const sel of ['article[data-testid="result"]','.result','[data-testid="result"]','li[data-layout="organic"]']) {
                    document.querySelectorAll(sel).forEach(el => {
                        try {
                            const a=el.querySelector('a[href^="http"]'),t=el.querySelector('h2,.result__a,[data-testid="result-title"] span'),s=el.querySelector('[data-testid="result-snippet"],.result__snippet');
                            if(a&&a.href&&t) items.push({title:(t.innerText||t.textContent||'').trim(),url:a.href.trim(),snippet:s?(s.innerText||'').trim():''});
                        } catch(e) {}
                    });
                    if(items.length>0) break;
                }
                return items;
            }""")
            for r in raw:
                if r["title"] and r["url"] and len(r["title"]) > 3:
                    results.append({"title":r["title"],"url":r["url"],"snippet":r["snippet"],"content":""})
            if len(results) >= max_results: break
        except Exception as e:
            print(f"[WARN] DDG失败: {e}", file=sys.stderr)
        finally:
            try: ctx.close()
            except: pass
    results = apply_filter(results, do_filter)[:max_results]
    print(f"[DEBUG] DDG: {len(results)}条 | {time.time()-start:.1f}s", file=sys.stderr)
    return results, None


# ==================== 全文抓取 ====================
def fetch_full(url):
    start = time.time()
    print(f"[DEBUG] 抓全文: {url}", file=sys.stderr)
    if is_blocked_domain(url): return "黑名单域名，跳过"
    ctx, page = create_context(), None
    text = ""
    try:
        page = ctx.new_page()
        page.route(ROUTE_PATTERN, lambda r: r.abort())
        page.goto(url, timeout=FETCH_TIMEOUT, wait_until="domcontentloaded")
        page.wait_for_timeout(800)
        try: page.wait_for_load_state("networkidle", timeout=5000)
        except: pass
        text = page.evaluate("""() => {
            document.querySelectorAll('script,style,nav,header,footer,.ad,.ads,[class*="banner"],[id*="banner"],.sidebar,.comment,.popup,.modal,.cookie').forEach(e=>e.remove());
            for (const sel of ['article','main','.content','.post','.article','#content','#main','.entry-content','.post-content','[itemprop="articleBody"]']) {
                const m=document.querySelector(sel); if(m&&m.innerText.length>200) return m.innerText;
            }
            return document.body?document.body.innerText:'';
        }""")
    except Exception as e:
        print(f"[ERROR] 全文失败: {e}", file=sys.stderr)
    finally:
        try: ctx.close()
        except: pass
    result = (text or "").strip()[:8000]
    print(f"[DEBUG] 全文: {len(result)}字 | {time.time()-start:.1f}s", file=sys.stderr)
    return result or "抓取失败"


# ==================== 主函数 ====================
def main():
    ensure_playwright()
    query, max_results, full, engine, do_filter, no_rewrite = parse_args()
    if not query:
        print(json.dumps({"error": "no query"}, ensure_ascii=False)); sys.exit(1)

    # 意图识别 + query改写(仅在搜索质量差时触发,不提前改写)
    original_query = query

    results = []

    if engine == "duckduckgo":
        results, _ = search_duckduckgo(query, max_results, do_filter)

    else:
        # 第1步: 用原始 query 搜索
        results, dominant = search_bing(query, max_results, do_filter)
        quality = score_results(results)

        # 第2步: 单域名集中/低质量域名 → 排除重试(最多2轮,限流时停止)
        if len(results) > 0:
            excluded = set()
            for _ in range(2):
                target = None
                if dominant and dominant not in excluded:
                    target = dominant
                elif results and is_low_quality_domain(results[0]["url"]):
                    d = urlparse(results[0]["url"]).netloc.replace("www.", "")
                    if d not in excluded: target = d
                if not target: break
                excluded.add(target)
                rq = query + " " + " ".join(f"-site:{d}" for d in excluded)
                print(f"[INFO] 排除({target})重试", file=sys.stderr)
                rr, _ = search_bing(rq, max_results, do_filter)
                if not rr: break
                rq_score = score_results(rr)
                if rq_score > quality:
                    results = merge_results(rr, results, max_results)
                    quality = score_results(results)
                    dominant, _, _ = get_dominant_domain(results)
                else: break

        # 第3步: 质量差 → 尝试改写 query 重试(仅当启用改写且原始query未改写时)
        if not no_rewrite and quality < QUALITY_THRESHOLD and len(results) > 0:
            rewritten, intent = rewrite_query(query)
            if intent and rewritten != query:
                print(f"[INFO] 质量低({quality:.2f})，意图改写({intent}): {query} → {rewritten}", file=sys.stderr)
                rr, _ = search_bing(rewritten, max_results, do_filter)
                if rr and score_results(rr) > quality:
                    results = rr
                    quality = score_results(results)

        # 第4步: auto模式 - 质量仍差 → 简化query(去模糊时间词)
        if engine == "auto" and quality < QUALITY_THRESHOLD and len(results) > 0:
            simplified = re.sub(FUZZY_TIME_WORDS, '', query).strip()
            simplified = re.sub(r'\s+', ' ', simplified).strip()
            if simplified != query and len(simplified) >= 2:
                print(f"[INFO] 简化重试: {simplified}", file=sys.stderr)
                rr, _ = search_bing(simplified, max_results, do_filter)
                if rr and score_results(rr) > quality:
                    results = rr

        # 第5步: auto模式 - Bing完全无结果 → DDG兜底
        if engine == "auto" and not results:
            print("[INFO] Bing无结果，DDG兜底...", file=sys.stderr)
            results, _ = search_duckduckgo(query, max_results, do_filter)

    # 全文抓取
    if full > 0 and results:
        for i in range(min(full, len(results))):
            results[i]["content"] = fetch_full(results[i]["url"])

    print(json.dumps(results, ensure_ascii=False, indent=2))


if __name__ == "__main__":
    try: main()
    finally: close_browser()

ClawHub Coding Backend+2

U@clawhub-ucsdzehualiu-001da531f9