@clawhub-ucsdzehualiu-001da531f9
Playwright驱动的联网搜索工具,自动抓取前三条网页内容,无需API Key,支持国内Bing和海外DDG搜索。
# SKILL.md
---
name: free-web-search-js
description: Playwright 联网搜索,自动抓取内容,零 API Key
version: 28.0.0
trigger_keywords:
- 搜索
- 查一下
- 找一下
- 最新消息
- 新闻
- 教程
- 是什么
- search
- find
tools:
- name: search
description: 搜索+自动抓取,国内Bing Playwright,海外DDG HTTP
script: scripts/search.js
parameters:
query:
type: string
description: "搜索关键词"
required: true
max:
type: integer
description: "最大结果数,默认10,上限30"
required: false
region:
type: string
description: "区域: auto/cn/intl,默认auto按IP检测"
required: false
- name: fetch
description: 给定URL抓取正文,HTTP优先失败自动headed兜底
script: scripts/fetch.js
parameters:
urls:
type: string
description: "要抓取的URL,多个用空格分隔"
required: true
max-len:
type: integer
description: "单页最大字符数,默认12000"
required: false
---
# free-web-search-js
一步式:**search** → Playwright 搜 → 自动抓内容 → 返回
## 架构
```
国内:
Playwright 打开 Bing → 首页拿 cookie → 搜索框提交
→ 自动抓取 top 3 页面内容
延迟:首次 3~6s(启动浏览器),后续复用更快
海外:
纯 HTTP → DDG HTML 解析
→ 自动抓取 top 3 页面内容
延迟:几百ms~1s
```
## 搜索引擎
| 引擎 | 协议 | 区域 | 说明 |
|------|------|------|------|
| Bing CN | Playwright 搜索框提交 | 国内 | 先访问首页拿 cookie,再搜索框输入提交 |
| 搜狗 | 纯 HTTP | 国内 | `--engine=sogou` 可选,⚠ 无 cookie 易被反爬拦截,结果不稳定 |
| DDG HTML Lite | 纯 HTTP | 海外 | html.duckduckgo.com |
### 策略
| 区域 | 搜索 | 抓取 |
|------|------|------|
| 国内 | Bing CN (Playwright) | 自动抓前 3 条 |
| 海外 | DDG HTML | 自动抓前 3 条 |
### IP 怎么判断
每次搜索时自动检测,三轮探测并行,谁先成功用谁:
| 轮次 | 探测服务 | 逻辑 |
|------|---------|------|
| 第1轮 | `myip.ipip.net` / `cip.cc` | 国内可达优先 |
| 第2轮 | `ipinfo.io` / `ipapi.co` | 国际探测 |
| 第3轮 | 试连 `cn.bing.com` | 能通大概率国内 |
| 兜底 | — | 默认国内 |
出口 IP 走代理时可能误判,用 `--region=cn` 或 `--region=intl` 手动指定。
## 去重
智能去重:域名 + 路径主干(忽略 www/m 子域、tracking 参数、尾部斜杠、.html 后缀)。
Bing 跳转 URL(`bing.com/ck/`)自动解码为直链。
## 抓取模式
搜索后自动抓取 top N 条 URL 内容(默认 3 条)。
| 层级 | 方式 | 速度 | 说明 |
|------|------|------|------|
| 第1层 | 轻量 HTTP + cheerio | ⚡ 秒出 | 不启动浏览器 |
| 第2层 | Playwright headed | 🟡 慢 | 完整浏览器,支持 JS 渲染 |
第1层增强:
- **JSON API 响应**:自动检测 Content-Type 并提取结构化内容
- **JSON-LD**:提取 `<script type="application/ld+json">` 中的 articleBody/description
- **__NEXT_DATA__**:提取 Next.js 嵌入数据
- **meta 标签**:og:description / description 兜底
- **GBK 编码**:自动检测并转换
## 安装
**前置依赖(全部必装):**
| 依赖 | 说明 | 大小/耗时 |
|------|------|----------|
| Node.js >= 18 | 运行时 | — |
| cheerio | HTML 解析 | 小,秒装 |
| commander | CLI 参数解析 | 小,秒装 |
| iconv-lite | GBK 编码转换 | 小,秒装 |
| playwright | 浏览器自动化(Bing 搜索 + 抓取兜底) | ~50MB |
| Chromium | Playwright 专用浏览器 | **~150MB,需几分钟下载** |
安装脚本自动检测网络区域,国内使用镜像源加速:
```bash
# Windows
powershell -File scripts/setup.ps1
# Linux/macOS
bash scripts/setup.sh
```
国内镜像:
- npm: `https://registry.npmmirror.com`
- Playwright/Chromium: `https://npmmirror.com/mirrors/playwright`
手动安装:
```bash
cd skills/free-web-search-js
npm install
npx playwright install chromium # ~150MB,需几分钟
```
验证环境:`node scripts/check-env.js`
卸载:`node scripts/uninstall.js`
## 性能优化:浏览器守护进程
搜索和抓取可复用浏览器守护进程,**提速约 70%**:
```bash
node scripts/browser-daemon.js & # 启动
node scripts/browser-daemon.js --status # 状态
node scripts/browser-daemon.js --stop # 停止
```
守护进程空闲 10 分钟自动退出。
## 用法
```bash
# 搜索(搜 + 自动抓前3条内容)
node scripts/search.js "白银价格"
node scripts/search.js "how to deploy docker" --max=5
node scripts/search.js "xxx" --region=cn
node scripts/search.js "xxx" --fetch=5 # 抓前5条
node scripts/search.js "xxx" --no-fetch # 只搜不抓
# 单独抓取(给定 URL)
node scripts/fetch.js "https://example.com/page1" "https://example.com/page2"
```
## 已知限制
- **国内首次搜索较慢**:需启动 Chromium(3~6s),后续复用更快
- **Bing CN 即时答案不返回**:天气、计算器等即时卡片不走 `li.b_algo`,搜索结果为 0
- **搜狗 HTTP 不稳定**:无 cookie 纯请求易被反爬拦截,结果可能为空(`--engine=sogou` 慎用)
- **部分站点 HTTP 抓不到**:需要 JS 渲染的页面——HTTP 失败会自动 headed 重试
- **部分站点海外不可达**:国内专属站点从海外访问可能超时
- **代理干扰 IP 检测**:出口 IP 走代理时可能误判区域,用 `--region=cn/intl` 手动指定
- **海外引擎国内不可达**:DDG 在国内被墙,国内策略不使用
FILE:package.json
{
"name": "free-web-search-js",
"version": "28.0.0",
"type": "module",
"description": "Playwright 联网搜索,国内Bing/搜狗,海外DDG,自动抓取,零 API Key",
"scripts": {
"search": "node scripts/search.js",
"fetch": "node scripts/fetch.js"
},
"dependencies": {
"cheerio": "^1.0.0",
"commander": "^12.0.0",
"iconv-lite": "^0.6.3",
"playwright": "^1.52.0"
}
}
FILE:package-lock.json
{
"name": "free-web-search",
"version": "15.0.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "free-web-search",
"version": "15.0.0",
"dependencies": {
"cheerio": "^1.0.0",
"commander": "^12.0.0",
"playwright": "^1.59.1"
},
"optionalDependencies": {
"playwright": "^1.59.1"
}
},
"node_modules/boolbase": {
"version": "1.0.0",
"resolved": "https://registry.npmmirror.com/boolbase/-/boolbase-1.0.0.tgz",
"integrity": "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww==",
"license": "ISC"
},
"node_modules/cheerio": {
"version": "1.2.0",
"resolved": "https://registry.npmmirror.com/cheerio/-/cheerio-1.2.0.tgz",
"integrity": "sha512-WDrybc/gKFpTYQutKIK6UvfcuxijIZfMfXaYm8NMsPQxSYvf+13fXUJ4rztGGbJcBQ/GF55gvrZ0Bc0bj/mqvg==",
"license": "MIT",
"dependencies": {
"cheerio-select": "^2.1.0",
"dom-serializer": "^2.0.0",
"domhandler": "^5.0.3",
"domutils": "^3.2.2",
"encoding-sniffer": "^0.2.1",
"htmlparser2": "^10.1.0",
"parse5": "^7.3.0",
"parse5-htmlparser2-tree-adapter": "^7.1.0",
"parse5-parser-stream": "^7.1.2",
"undici": "^7.19.0",
"whatwg-mimetype": "^4.0.0"
},
"engines": {
"node": ">=20.18.1"
},
"funding": {
"url": "https://github.com/cheeriojs/cheerio?sponsor=1"
}
},
"node_modules/cheerio-select": {
"version": "2.1.0",
"resolved": "https://registry.npmmirror.com/cheerio-select/-/cheerio-select-2.1.0.tgz",
"integrity": "sha512-9v9kG0LvzrlcungtnJtpGNxY+fzECQKhK4EGJX2vByejiMX84MFNQw4UxPJl3bFbTMw+Dfs37XaIkCwTZfLh4g==",
"license": "BSD-2-Clause",
"dependencies": {
"boolbase": "^1.0.0",
"css-select": "^5.1.0",
"css-what": "^6.1.0",
"domelementtype": "^2.3.0",
"domhandler": "^5.0.3",
"domutils": "^3.0.1"
},
"funding": {
"url": "https://github.com/sponsors/fb55"
}
},
"node_modules/commander": {
"version": "12.1.0",
"resolved": "https://registry.npmmirror.com/commander/-/commander-12.1.0.tgz",
"integrity": "sha512-Vw8qHK3bZM9y/P10u3Vib8o/DdkvA2OtPtZvD871QKjy74Wj1WSKFILMPRPSdUSx5RFK1arlJzEtA4PkFgnbuA==",
"license": "MIT",
"engines": {
"node": ">=18"
}
},
"node_modules/css-select": {
"version": "5.2.2",
"resolved": "https://registry.npmmirror.com/css-select/-/css-select-5.2.2.tgz",
"integrity": "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw==",
"license": "BSD-2-Clause",
"dependencies": {
"boolbase": "^1.0.0",
"css-what": "^6.1.0",
"domhandler": "^5.0.2",
"domutils": "^3.0.1",
"nth-check": "^2.0.1"
},
"funding": {
"url": "https://github.com/sponsors/fb55"
}
},
"node_modules/css-what": {
"version": "6.2.2",
"resolved": "https://registry.npmmirror.com/css-what/-/css-what-6.2.2.tgz",
"integrity": "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA==",
"license": "BSD-2-Clause",
"engines": {
"node": ">= 6"
},
"funding": {
"url": "https://github.com/sponsors/fb55"
}
},
"node_modules/dom-serializer": {
"version": "2.0.0",
"resolved": "https://registry.npmmirror.com/dom-serializer/-/dom-serializer-2.0.0.tgz",
"integrity": "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg==",
"license": "MIT",
"dependencies": {
"domelementtype": "^2.3.0",
"domhandler": "^5.0.2",
"entities": "^4.2.0"
},
"funding": {
"url": "https://github.com/cheeriojs/dom-serializer?sponsor=1"
}
},
"node_modules/domelementtype": {
"version": "2.3.0",
"resolved": "https://registry.npmmirror.com/domelementtype/-/domelementtype-2.3.0.tgz",
"integrity": "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw==",
"funding": [
{
"type": "github",
"url": "https://github.com/sponsors/fb55"
}
],
"license": "BSD-2-Clause"
},
"node_modules/domhandler": {
"version": "5.0.3",
"resolved": "https://registry.npmmirror.com/domhandler/-/domhandler-5.0.3.tgz",
"integrity": "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w==",
"license": "BSD-2-Clause",
"dependencies": {
"domelementtype": "^2.3.0"
},
"engines": {
"node": ">= 4"
},
"funding": {
"url": "https://github.com/fb55/domhandler?sponsor=1"
}
},
"node_modules/domutils": {
"version": "3.2.2",
"resolved": "https://registry.npmmirror.com/domutils/-/domutils-3.2.2.tgz",
"integrity": "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw==",
"license": "BSD-2-Clause",
"dependencies": {
"dom-serializer": "^2.0.0",
"domelementtype": "^2.3.0",
"domhandler": "^5.0.3"
},
"funding": {
"url": "https://github.com/fb55/domutils?sponsor=1"
}
},
"node_modules/encoding-sniffer": {
"version": "0.2.1",
"resolved": "https://registry.npmmirror.com/encoding-sniffer/-/encoding-sniffer-0.2.1.tgz",
"integrity": "sha512-5gvq20T6vfpekVtqrYQsSCFZ1wEg5+wW0/QaZMWkFr6BqD3NfKs0rLCx4rrVlSWJeZb5NBJgVLswK/w2MWU+Gw==",
"license": "MIT",
"dependencies": {
"iconv-lite": "^0.6.3",
"whatwg-encoding": "^3.1.1"
},
"funding": {
"url": "https://github.com/fb55/encoding-sniffer?sponsor=1"
}
},
"node_modules/entities": {
"version": "4.5.0",
"resolved": "https://registry.npmmirror.com/entities/-/entities-4.5.0.tgz",
"integrity": "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw==",
"license": "BSD-2-Clause",
"engines": {
"node": ">=0.12"
},
"funding": {
"url": "https://github.com/fb55/entities?sponsor=1"
}
},
"node_modules/fsevents": {
"version": "2.3.2",
"resolved": "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz",
"integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
"hasInstallScript": true,
"license": "MIT",
"optional": true,
"os": [
"darwin"
],
"engines": {
"node": "^8.16.0 || ^10.6.0 || >=11.0.0"
}
},
"node_modules/htmlparser2": {
"version": "10.1.0",
"resolved": "https://registry.npmmirror.com/htmlparser2/-/htmlparser2-10.1.0.tgz",
"integrity": "sha512-VTZkM9GWRAtEpveh7MSF6SjjrpNVNNVJfFup7xTY3UpFtm67foy9HDVXneLtFVt4pMz5kZtgNcvCniNFb1hlEQ==",
"funding": [
"https://github.com/fb55/htmlparser2?sponsor=1",
{
"type": "github",
"url": "https://github.com/sponsors/fb55"
}
],
"license": "MIT",
"dependencies": {
"domelementtype": "^2.3.0",
"domhandler": "^5.0.3",
"domutils": "^3.2.2",
"entities": "^7.0.1"
}
},
"node_modules/htmlparser2/node_modules/entities": {
"version": "7.0.1",
"resolved": "https://registry.npmmirror.com/entities/-/entities-7.0.1.tgz",
"integrity": "sha512-TWrgLOFUQTH994YUyl1yT4uyavY5nNB5muff+RtWaqNVCAK408b5ZnnbNAUEWLTCpum9w6arT70i1XdQ4UeOPA==",
"license": "BSD-2-Clause",
"engines": {
"node": ">=0.12"
},
"funding": {
"url": "https://github.com/fb55/entities?sponsor=1"
}
},
"node_modules/iconv-lite": {
"version": "0.6.3",
"resolved": "https://registry.npmmirror.com/iconv-lite/-/iconv-lite-0.6.3.tgz",
"integrity": "sha512-4fCk79wshMdzMp2rH06qWrJE4iolqLhCUH+OiuIgU++RB0+94NlDL81atO7GX55uUKueo0txHNtvEyI6D7WdMw==",
"license": "MIT",
"dependencies": {
"safer-buffer": ">= 2.1.2 < 3.0.0"
},
"engines": {
"node": ">=0.10.0"
}
},
"node_modules/nth-check": {
"version": "2.1.1",
"resolved": "https://registry.npmmirror.com/nth-check/-/nth-check-2.1.1.tgz",
"integrity": "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w==",
"license": "BSD-2-Clause",
"dependencies": {
"boolbase": "^1.0.0"
},
"funding": {
"url": "https://github.com/fb55/nth-check?sponsor=1"
}
},
"node_modules/parse5": {
"version": "7.3.0",
"resolved": "https://registry.npmmirror.com/parse5/-/parse5-7.3.0.tgz",
"integrity": "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw==",
"license": "MIT",
"dependencies": {
"entities": "^6.0.0"
},
"funding": {
"url": "https://github.com/inikulin/parse5?sponsor=1"
}
},
"node_modules/parse5-htmlparser2-tree-adapter": {
"version": "7.1.0",
"resolved": "https://registry.npmmirror.com/parse5-htmlparser2-tree-adapter/-/parse5-htmlparser2-tree-adapter-7.1.0.tgz",
"integrity": "sha512-ruw5xyKs6lrpo9x9rCZqZZnIUntICjQAd0Wsmp396Ul9lN/h+ifgVV1x1gZHi8euej6wTfpqX8j+BFQxF0NS/g==",
"license": "MIT",
"dependencies": {
"domhandler": "^5.0.3",
"parse5": "^7.0.0"
},
"funding": {
"url": "https://github.com/inikulin/parse5?sponsor=1"
}
},
"node_modules/parse5-parser-stream": {
"version": "7.1.2",
"resolved": "https://registry.npmmirror.com/parse5-parser-stream/-/parse5-parser-stream-7.1.2.tgz",
"integrity": "sha512-JyeQc9iwFLn5TbvvqACIF/VXG6abODeB3Fwmv/TGdLk2LfbWkaySGY72at4+Ty7EkPZj854u4CrICqNk2qIbow==",
"license": "MIT",
"dependencies": {
"parse5": "^7.0.0"
},
"funding": {
"url": "https://github.com/inikulin/parse5?sponsor=1"
}
},
"node_modules/parse5/node_modules/entities": {
"version": "6.0.1",
"resolved": "https://registry.npmmirror.com/entities/-/entities-6.0.1.tgz",
"integrity": "sha512-aN97NXWF6AWBTahfVOIrB/NShkzi5H7F9r1s9mD3cDj4Ko5f2qhhVoYMibXF7GlLveb/D2ioWay8lxI97Ven3g==",
"license": "BSD-2-Clause",
"engines": {
"node": ">=0.12"
},
"funding": {
"url": "https://github.com/fb55/entities?sponsor=1"
}
},
"node_modules/playwright": {
"version": "1.59.1",
"resolved": "https://registry.npmmirror.com/playwright/-/playwright-1.59.1.tgz",
"integrity": "sha512-C8oWjPR3F81yljW9o5OxcWzfh6avkVwDD2VYdwIGqTkl+OGFISgypqzfu7dOe4QNLL2aqcWBmI3PMtLIK233lw==",
"license": "Apache-2.0",
"optional": true,
"dependencies": {
"playwright-core": "1.59.1"
},
"bin": {
"playwright": "cli.js"
},
"engines": {
"node": ">=18"
},
"optionalDependencies": {
"fsevents": "2.3.2"
}
},
"node_modules/playwright-core": {
"version": "1.59.1",
"resolved": "https://registry.npmmirror.com/playwright-core/-/playwright-core-1.59.1.tgz",
"integrity": "sha512-HBV/RJg81z5BiiZ9yPzIiClYV/QMsDCKUyogwH9p3MCP6IYjUFu/MActgYAvK0oWyV9NlwM3GLBjADyWgydVyg==",
"license": "Apache-2.0",
"optional": true,
"bin": {
"playwright-core": "cli.js"
},
"engines": {
"node": ">=18"
}
},
"node_modules/safer-buffer": {
"version": "2.1.2",
"resolved": "https://registry.npmmirror.com/safer-buffer/-/safer-buffer-2.1.2.tgz",
"integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==",
"license": "MIT"
},
"node_modules/undici": {
"version": "7.25.0",
"resolved": "https://registry.npmmirror.com/undici/-/undici-7.25.0.tgz",
"integrity": "sha512-xXnp4kTyor2Zq+J1FfPI6Eq3ew5h6Vl0F/8d9XU5zZQf1tX9s2Su1/3PiMmUANFULpmksxkClamIZcaUqryHsQ==",
"license": "MIT",
"engines": {
"node": ">=20.18.1"
}
},
"node_modules/whatwg-encoding": {
"version": "3.1.1",
"resolved": "https://registry.npmmirror.com/whatwg-encoding/-/whatwg-encoding-3.1.1.tgz",
"integrity": "sha512-6qN4hJdMwfYBtE3YBTTHhoeuUrDBPZmbQaxWAqSALV/MeEnR5z1xd8UKud2RAkFoPkmB+hli1TZSnyi84xz1vQ==",
"deprecated": "Use @exodus/bytes instead for a more spec-conformant and faster implementation",
"license": "MIT",
"dependencies": {
"iconv-lite": "0.6.3"
},
"engines": {
"node": ">=18"
}
},
"node_modules/whatwg-mimetype": {
"version": "4.0.0",
"resolved": "https://registry.npmmirror.com/whatwg-mimetype/-/whatwg-mimetype-4.0.0.tgz",
"integrity": "sha512-QaKxh0eNIi2mE9p2vEdzfagOKHCcj1pJ56EEHGQOVxp8r9/iszLUUV7v89x9O1p/T+NlTM5W7jW6+cz4Fq1YVg==",
"license": "MIT",
"engines": {
"node": ">=18"
}
}
}
}
FILE:scripts/browser-daemon.js
#!/usr/bin/env node
/**
* browser-daemon.js — 持久化 Chromium 守护进程
*
* 用 Playwright launchServer() 启动常驻浏览器,
* search.js / fetch.js 通过 CDP 复用,省去每次 1.5s+ 的 launch 开销。
*
* 用法:
* 启动: node scripts/browser-daemon.js (后台运行)
* 停止: node scripts/browser-daemon.js --stop
* 状态: node scripts/browser-daemon.js --status
*/
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const skillRoot = path.resolve(__dirname, '..');
const ENDPOINT_FILE = path.join(skillRoot, '.browser-endpoint');
function readInfo() {
try { return JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8')); } catch { return null; }
}
function isAlive() {
const info = readInfo();
if (!info) return false;
try { process.kill(info.pid, 0); return true; } catch {
try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
return false;
}
}
async function startDaemon() {
if (isAlive()) {
const info = readInfo();
const uptime = ((Date.now() - info.startedAt) / 1000).toFixed(0);
console.log(`[daemon] Already running PID: info.pid Uptime: uptimes`);
console.log(` WS: info.wsEndpoint`);
return;
}
const { chromium } = await import('playwright');
const server = await chromium.launchServer({
headless: false,
args: [
'--disable-blink-features=AutomationControlled',
'--disable-gpu',
],
});
const wsEndpoint = server.wsEndpoint();
const info = {
pid: process.pid, // daemon 进程 PID(用于 isAlive 检查)
wsEndpoint,
startedAt: Date.now(),
};
fs.writeFileSync(ENDPOINT_FILE, JSON.stringify(info, null, 2));
console.log(`[daemon] Chromium started PID: info.pid`);
console.log(`[daemon] WS: wsEndpoint`);
console.log('[daemon] Running... (Ctrl+C or --stop to quit)');
// Keep process alive
process.on('SIGINT', async () => {
console.log('[daemon] Stopping...');
await server.close();
try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
process.exit(0);
});
process.on('SIGTERM', async () => {
await server.close();
try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
process.exit(0);
});
}
function stopDaemon() {
const info = readInfo();
if (!info) { console.log('[daemon] Not running'); return; }
try {
process.kill(info.pid, 'SIGTERM');
console.log(`[daemon] Stopped PID: info.pid`);
} catch {
console.log('[daemon] Process already exited');
}
try { fs.unlinkSync(ENDPOINT_FILE); } catch {}
}
function showStatus() {
if (!isAlive()) { console.log('[daemon] Not running'); return; }
const info = readInfo();
const uptime = ((Date.now() - info.startedAt) / 1000).toFixed(0);
console.log(`[daemon] Running PID: info.pid Uptime: uptimes`);
console.log(` WS: info.wsEndpoint`);
}
const arg = process.argv[2];
if (arg === '--stop') stopDaemon();
else if (arg === '--status') showStatus();
else startDaemon();
FILE:scripts/check-env.js
#!/usr/bin/env node
/**
* free-web-search-js environment check v28
*/
import { execSync } from 'child_process';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const skillRoot = path.resolve(__dirname, '..');
function main() {
const lines = [];
// Node.js
let nodeOk = false;
try {
const v = execSync('node --version', { encoding: 'utf-8', timeout: 5000 }).trim();
const major = parseInt(v.replace('v', '').split('.')[0]);
nodeOk = major >= 18;
if (nodeOk) {
lines.push(`[OK] Node.js v (>= 18)`);
} else {
lines.push(`[X] Node.js >= 18 required (current: v)`);
lines.push(` -> https://nodejs.org`);
}
} catch {
lines.push(`[X] Node.js not found`);
lines.push(` -> https://nodejs.org`);
}
// npm dependencies (全部必装)
const nm = path.join(skillRoot, 'node_modules');
const requiredDeps = ['cheerio', 'commander', 'iconv-lite', 'playwright'];
let depsOk = true;
if (!fs.existsSync(nm)) {
lines.push(`[X] node_modules not found`);
lines.push(` -> cd skillRoot && npm install`);
depsOk = false;
} else {
const missing = requiredDeps.filter(dep => !fs.existsSync(path.join(nm, dep)));
if (missing.length > 0) {
lines.push(`[X] Missing npm packages: missing.join(', ')`);
lines.push(` -> cd skillRoot && npm install`);
depsOk = false;
} else {
lines.push(`[OK] npm packages: cheerio, commander, iconv-lite, playwright`);
}
}
// Playwright Chromium browser (必装)
let browserOk = false;
try {
const browserPaths = [
process.env.LOCALAPPDATA && path.join(process.env.LOCALAPPDATA, 'ms-playwright'),
process.env.HOME && path.join(process.env.HOME, '.cache', 'ms-playwright'),
].filter(Boolean);
browserOk = browserPaths.some(p => fs.existsSync(p) && fs.readdirSync(p).length > 0);
if (browserOk) {
lines.push(`[OK] Playwright Chromium browser installed`);
} else {
lines.push(`[X] Playwright Chromium browser not installed`);
lines.push(` -> npx playwright install chromium`);
depsOk = false;
}
} catch {
lines.push(`[X] Playwright Chromium browser check failed`);
lines.push(` -> npx playwright install chromium`);
depsOk = false;
}
const allOk = nodeOk && depsOk;
lines.push('');
if (allOk) {
lines.push(`[OK] Environment ready`);
} else {
lines.push(`[X] Environment not ready, follow the -> hints above`);
}
console.log(lines.join('\n'));
process.exit(allOk ? 0 : 1);
}
main();
FILE:scripts/fetch.js
#!/usr/bin/env node
/**
* free-web-search-js fetch.js v23.0
*
* 两层兜底 + 增强:
* 1. 轻量 HTTP + cheerio(快,不启动浏览器)
* - 支持 JSON API 响应
* - 提取 JSON-LD / __NEXT_DATA__ 等嵌入数据
* - meta 标签兜底(og:description 等)
* 2. Playwright headed(完整浏览器,支持 JS 渲染)
* 多 URL 并行,打不开跳过
*/
import process from 'process';
import child_process from 'child_process';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const ENDPOINT_FILE = path.resolve(__dirname, '..', '.browser-endpoint');
const TIMEOUT = 35000;
const DEFAULT_MAX_LEN = 12000;
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
// ==================== 浏览器复用 ====================
async function getBrowser() {
try {
const info = JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8'));
process.kill(info.pid, 0);
const { chromium } = await import('playwright');
const browser = await chromium.connectOverCDP(info.wsEndpoint);
return { browser, shared: true };
} catch {}
const { chromium } = await import('playwright');
const browser = await chromium.launch({
headless: false,
args: ['--disable-blink-features=AutomationControlled'],
});
return { browser, shared: false };
}
function releaseBrowser(browser, shared) {
return shared ? browser.disconnect() : browser.close();
}
const PAGE_COMPAT_INIT = () => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
const origQuery = window.navigator.permissions?.query;
if (origQuery) {
window.navigator.permissions.query = (params) => (
params.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(params)
);
}
};
async function ensureDeps() {
try { await import('cheerio'); } catch {
child_process.execSync('npm install cheerio --silent', { stdio: 'inherit' });
}
try { await import('commander'); } catch {
child_process.execSync('npm install commander --silent', { stdio: 'inherit' });
}
try { await import('iconv-lite'); } catch {
child_process.execSync('npm install iconv-lite --silent', { stdio: 'inherit' });
}
try { await import('playwright'); } catch {
console.error('[WARN] playwright 未安装,headed 兜底不可用');
}
}
// ==================== 编码处理 ====================
async function decodeBuffer(buf, contentTypeHeader) {
// 优先从 Content-Type 检测编码
let charset = 'utf-8';
if (contentTypeHeader) {
const m = contentTypeHeader.match(/charset=([^\s;]+)/i);
if (m) charset = m[1].toLowerCase();
}
if (charset === 'utf-8' || charset === 'utf8') {
return buf.toString('utf-8');
}
if (charset === 'gbk' || charset === 'gb2312' || charset === 'gb18030') {
try {
const iconv = await import('iconv-lite');
return iconv.default.decode(buf, 'gbk');
} catch {
try { return new TextDecoder('gbk').decode(buf); } catch {}
}
}
// fallback: 尝试 utf-8,乱码多则试 gbk
let text = buf.toString('utf-8');
if ((text.match(/\ufffd/g) || []).length > 20) {
try {
const iconv = await import('iconv-lite');
text = iconv.default.decode(buf, 'gbk');
} catch {
try { text = new TextDecoder('gbk').decode(buf); } catch {}
}
}
return text;
}
// ==================== JSON 内容提取 ====================
function extractJsonContent(data, maxLen) {
/** 从 JSON API 响应中提取有意义的文本 */
const texts = [];
function walk(obj, depth = 0) {
if (depth > 8 || texts.join(' ').length > maxLen) return;
if (typeof obj === 'string' && obj.length > 20) {
texts.push(obj);
} else if (Array.isArray(obj)) {
for (const item of obj) walk(item, depth + 1);
} else if (obj && typeof obj === 'object') {
// 优先提取常见内容字段
for (const key of ['content', 'text', 'body', 'description', 'summary',
'message', 'value', 'title', 'name', 'answer', 'result']) {
if (obj[key] && typeof obj[key] === 'string' && obj[key].length > 20) {
texts.push(obj[key]);
}
}
for (const [k, v] of Object.entries(obj)) {
if (typeof v === 'object' && v !== null) walk(v, depth + 1);
}
}
}
walk(data);
return texts.join(' ').replace(/\s+/g, ' ').trim().slice(0, maxLen);
}
// ==================== 嵌入数据提取 ====================
function extractEmbeddedData($, maxLen) {
/** 提取 HTML 中嵌入的结构化数据:JSON-LD, __NEXT_DATA__, meta 等 */
const parts = [];
// JSON-LD
$('script[type="application/ld+json"]').each((_, el) => {
try {
const data = JSON.parse($(el).text());
if (data.description) parts.push(String(data.description));
if (data.articleBody) parts.push(String(data.articleBody));
if (data.text) parts.push(String(data.text));
// 遍历 @graph
if (Array.isArray(data['@graph'])) {
for (const item of data['@graph']) {
if (item.description) parts.push(String(item.description));
if (item.articleBody) parts.push(String(item.articleBody));
}
}
} catch {}
});
// __NEXT_DATA__ (Next.js)
$('script#__NEXT_DATA__').each((_, el) => {
try {
const data = JSON.parse($(el).text());
const text = extractJsonContent(data, maxLen);
if (text.length > 100) parts.push(text);
} catch {}
});
// meta 标签兜底
const metaSelectors = [
'meta[property="og:description"]',
'meta[name="description"]',
'meta[property="og:title"]',
'meta[name="twitter:description"]',
];
for (const sel of metaSelectors) {
const content = $(sel).attr('content');
if (content && content.length > 20) parts.push(content);
}
return parts.join(' ').replace(/\s+/g, ' ').trim().slice(0, maxLen);
}
// ==================== 第1层:轻量 HTTP ====================
async function fetchLightweight(url, maxLen) {
console.error(`[fetch:http] url`);
const ac = new AbortController();
const t = setTimeout(() => ac.abort(), 15000);
try {
const r = await fetch(url, {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,application/json;q=0.8,*/*;q=0.5',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US,en;q=0.8',
},
redirect: 'follow', signal: ac.signal,
});
clearTimeout(t);
if (!r.ok) return { status: r.status, content: '', error: `HTTP r.status` };
const contentType = r.headers.get('content-type') || '';
const buf = Buffer.from(await r.arrayBuffer());
// JSON 响应:直接解析
if (/application\/json/i.test(contentType) || (/^[\[{]/.test(buf.toString('utf-8', 0, 100)))) {
try {
const data = JSON.parse(buf.toString('utf-8'));
const text = extractJsonContent(data, maxLen);
if (text.length > 50) return { status: 200, content: text };
} catch {}
}
// HTML 响应
const html = await decodeBuffer(buf, contentType);
const { load } = await import('cheerio');
const $ = load(html);
// 先提取嵌入数据(JSON-LD 等),作为补充
const embedded = extractEmbeddedData($, maxLen);
// 去噪音
$('script,style,nav,header,footer,aside,iframe,noscript,.ad,.sidebar,.comment,.social,.share,.related,.breadcrumb,.pagination,.cookie,.popup').remove();
// 正文容器
for (const sel of ['article','.article-content','.post-content','.entry-content',
'#article_content','.markdown-body','.news-content','.detail-body',
'.content','.main-content','main','#content','table']) {
const el = $(sel).first();
if (el.length) {
const text = el.text().replace(/\s+/g, ' ').trim();
if (text.length > 200) {
// 如果嵌入数据有额外信息,拼上
let result = text;
if (embedded && !text.includes(embedded.slice(0, 50))) {
result = text + '\n\n[结构化数据] ' + embedded;
}
return { status: 200, content: result.slice(0, maxLen) };
}
}
}
// 启发式:找文本密度最高的块
const candidates = [];
for (const el of $('div, section, main, article').toArray()) {
const $el = $(el);
if ($el.children().length > 50) continue;
const text = $el.text().replace(/\s+/g, ' ').trim();
if (text.length > 300) {
const linkRatio = $el.find('a').length / (text.length / 100);
if (linkRatio < 5) candidates.push({ text, len: text.length });
}
}
candidates.sort((a, b) => b.len - a.len);
if (candidates.length > 0 && candidates[0].len > 200) {
let result = candidates[0].text;
if (embedded && !result.includes(embedded.slice(0, 50))) {
result = result + '\n\n[结构化数据] ' + embedded;
}
return { status: 200, content: result.slice(0, maxLen) };
}
// 嵌入数据兜底(正文提取失败但有 JSON-LD 等)
if (embedded.length > 100) return { status: 200, content: embedded.slice(0, maxLen) };
const body = $('body').text().replace(/\s+/g, ' ').trim();
if (body.length > 200) return { status: 200, content: body.slice(0, maxLen) };
return { status: r.status, content: '', error: `内容太短(body.length字)` };
} catch (e) {
clearTimeout(t);
return { status: 0, content: '', error: e.message.split('\n')[0] };
}
}
// ==================== 第2层:Playwright headed ====================
async function fetchHeaded(url, maxLen) {
console.error(`[fetch:headed] url`);
let browser, shared;
try {
({ browser, shared } = await getBrowser());
const page = await browser.newPage();
await page.addInitScript(PAGE_COMPAT_INIT);
await page.setExtraHTTPHeaders({ 'Accept-Language': 'zh-CN,zh;q=0.9,en-US,en;q=0.8' });
const resp = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT });
const httpStatus = resp?.status() || 0;
await page.waitForTimeout(4000);
try { await page.evaluate(() => window.scrollTo(0, 300)); await page.waitForTimeout(800); } catch {}
let content = '';
try {
content = await page.evaluate((max) => {
// 提取 JSON-LD
const ldParts = [];
document.querySelectorAll('script[type="application/ld+json"]').forEach(el => {
try {
const d = JSON.parse(el.textContent);
if (d.description) ldParts.push(String(d.description));
if (d.articleBody) ldParts.push(String(d.articleBody));
} catch {}
});
// 去噪音
for (const sel of ['script','style','nav','header','footer','aside','iframe','noscript',
'.ad','.ads','.sidebar','.comment','.social','.share','.related',
'.breadcrumb','.pagination','.cookie','.popup','[role="navigation"]','[role="banner"]']) {
document.querySelectorAll(sel).forEach(el => el.remove());
}
// 正文提取
for (const sel of ['article','.article-content','.post-content','.entry-content',
'#article_content','.markdown-body','.news-content','.detail-body',
'.content','.main-content','main','#content','table']) {
const el = document.querySelector(sel);
if (el) { const text = el.innerText.replace(/\s+/g, ' ').trim(); if (text.length > 200) return text.slice(0, max); }
}
const candidates = [];
for (const el of document.querySelectorAll('div, section, main, article')) {
if (el.children.length > 50) continue;
const text = el.innerText?.replace(/\s+/g, ' ').trim() || '';
if (text.length > 300) { const links = el.querySelectorAll('a'); if (links.length / (text.length / 100) < 5) candidates.push({ el, len: text.length }); }
}
candidates.sort((a, b) => b.len - a.len);
if (candidates.length > 0) { const text = candidates[0].el.innerText.replace(/\s+/g, ' ').trim(); if (text.length > 200) return text.slice(0, max); }
return document.body?.innerText?.replace(/\s+/g, ' ').trim().slice(0, max) || '';
}, maxLen);
} catch {
try { await page.waitForTimeout(2000); content = await page.evaluate((max) => document.body?.innerText?.replace(/\s+/g, ' ').trim().slice(0, max) || '', maxLen); } catch {}
}
await page.close();
if (content.length < 50) return { status: httpStatus, content: '', error: content ? `内容太短(content.length字)` : `HTTP httpStatus` };
return { status: httpStatus, content };
} catch (e) {
return { status: 0, content: '', error: e.message.split('\n')[0] };
} finally {
if (browser) await releaseBrowser(browser, shared).catch(() => {});
}
}
// ==================== 单 URL:两层兜底 ====================
async function fetchUrl(url, maxLen) {
// 第1层:轻量 HTTP
let result = await fetchLightweight(url, maxLen);
if (result.content) return { url, ...result };
console.error(`[fetch:http] 失败: result.error`);
// 第2层:Playwright headed
result = await fetchHeaded(url, maxLen);
return { url, ...result };
}
// ==================== main ====================
async function main() {
await ensureDeps();
const { program } = await import('commander');
program
.argument('<urls...>', '要抓取的 URL,多个并行')
.option('--max-len <n>', '单页最大字符数', v => parseInt(v, 10), DEFAULT_MAX_LEN)
.option('--http-only', '只用轻量 HTTP,不启动浏览器')
.option('--headed', '跳过 HTTP,直接 headed')
.parse(process.argv);
const opts = program.opts();
const maxLen = Math.max(1000, Math.min(50000, opts.maxLen || DEFAULT_MAX_LEN));
const urls = program.args.filter(a => a.startsWith('http'));
if (!urls.length) { console.log(JSON.stringify({ error: '未传入有效 URL' })); process.exit(1); }
const tasks = urls.map(async (url) => {
if (opts.httpOnly) {
const r = await fetchLightweight(url, maxLen);
if (r.error) console.error(`[fetch] 跳过: r.error`);
return { url, ...r };
}
if (opts.headed) {
const r = await fetchHeaded(url, maxLen);
if (r.error) console.error(`[fetch] 跳过: r.error`);
return { url, ...r };
}
const r = await fetchUrl(url, maxLen);
if (r.error) console.error(`[fetch] 跳过: r.error`);
return r;
});
const settled = await Promise.allSettled(tasks);
const results = settled.map(r => r.status === 'fulfilled' ? r.value : { url: '?', status: 0, content: '', error: String(r.reason) });
console.log(JSON.stringify(results, null, 2));
}
main().catch(e => { console.error('[ERROR]', e.message); process.exit(1); });
FILE:scripts/search.js
#!/usr/bin/env node
/**
* free-web-search-js search.js v28.0
*
* 国内: Bing CN (Playwright 搜索框提交)
* 海外: DDG HTML (纯 HTTP)
* 搜完自动抓取 top N 结果内容
*/
import process from 'process';
import child_process from 'child_process';
import querystring from 'querystring';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const SKILL_ROOT = path.resolve(__dirname, '..');
const ENDPOINT_FILE = path.resolve(SKILL_ROOT, '.browser-endpoint');
const DEFAULT_MAX = 10;
const DEFAULT_FETCH = 3;
const HTTP_TIMEOUT = 10000;
const PW_TIMEOUT = 25000;
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
function clean(s) { return String(s || '').replace(/\s+/g, ' ').trim(); }
// ==================== 依赖 ====================
async function ensureDeps() {
try { await import('cheerio'); } catch {
child_process.execSync('npm install cheerio --silent', { stdio: 'inherit' });
}
try { await import('commander'); } catch {
child_process.execSync('npm install commander --silent', { stdio: 'inherit' });
}
}
// ==================== IP 检测 ====================
let _inChinaCache = null;
async function detectInChina() {
if (_inChinaCache !== null) return _inChinaCache;
const probes = [
(async () => {
for (const url of ['https://myip.ipip.net', 'https://cip.cc']) {
try {
const r = await fetch(url, { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000) });
if (!r.ok) continue;
const text = await r.text();
if (/中国|CN/i.test(text)) {
const ip = text.match(/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)?.[1] ?? '?';
return { inChina: true, label: `ip → CN` };
}
} catch {}
}
throw new Error('cn probe failed');
})(),
(async () => {
for (const url of ['https://ipinfo.io/json', 'https://ipapi.co/json/']) {
try {
const r = await fetch(url, { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000) });
if (!r.ok) continue;
const d = await r.json();
const cc = String(d.country || d.country_code || '').toUpperCase();
if (!cc) continue;
return { inChina: cc === 'CN', label: `d.ip ?? '?' → cc` };
} catch {}
}
throw new Error('intl probe failed');
})(),
(async () => {
const r = await fetch('https://cn.bing.com', { headers: { 'User-Agent': UA }, signal: AbortSignal.timeout(3000), redirect: 'manual' });
return { inChina: r.status === 200 || r.status === 302, label: `cn.bing.com → r.status` };
})(),
];
try {
const winner = await Promise.any(probes);
console.error(`[地理] winner.label → '国外'`);
_inChinaCache = winner.inChina;
return winner.inChina;
} catch {
console.error('[地理] 检测失败,默认国内');
_inChinaCache = true;
return true;
}
}
// ==================== URL 处理 ====================
function decodeBingUrl(url) {
if (!url?.includes('bing.com/ck/')) return url;
try {
const u = new URL(url).searchParams.get('u');
if (!u) return url;
const stripped = u.replace(/^a[0-9]/, '');
const b64 = stripped + '='.repeat((4 - stripped.length % 4) % 4);
const dec = Buffer.from(b64, 'base64').toString('utf-8');
return dec.startsWith('http') ? dec : url;
} catch { return url; }
}
function normalizeUrl(raw) {
let url = clean(raw);
if (!url) return url;
url = decodeBingUrl(url);
try {
const u = new URL(url);
u.hash = '';
for (const k of ['utm_source','utm_medium','utm_campaign','gclid','fbclid','msclkid','spm','from','ref','src']) {
u.searchParams.delete(k);
}
return u.toString();
} catch { return url; }
}
async function resolveRedirectUrl(url, timeout = 6000) {
if (!url) return url;
if (!/sogou\.com\/link/i.test(url)) return url;
try {
const r = await fetch(url, {
method: 'GET', headers: { 'User-Agent': UA },
redirect: 'follow', signal: AbortSignal.timeout(timeout),
});
if (r.url && r.url.startsWith('http') && !/sogou\.com\/link/i.test(r.url)) {
return r.url;
}
const text = await r.text();
const jsMatch = text.match(/window\.location\.replace\s*\(\s*["']([^"']+)["']/);
if (jsMatch) return jsMatch[1];
const metaMatch = text.match(/URL\s*=\s*['"]([^'"]+)['"]/i);
if (metaMatch) return metaMatch[1];
} catch {}
return url;
}
// ==================== Playwright 浏览器管理 ====================
const PAGE_COMPAT_INIT = () => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
const origQuery = window.navigator.permissions?.query;
if (origQuery) {
window.navigator.permissions.query = (params) => (
params.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : origQuery(params)
);
}
};
let _browserInstance = null;
async function getBrowser() {
if (_browserInstance) return _browserInstance;
try {
const info = JSON.parse(fs.readFileSync(ENDPOINT_FILE, 'utf-8'));
process.kill(info.pid, 0);
const { chromium } = await import('playwright');
const browser = await chromium.connectOverCDP(info.wsEndpoint);
_browserInstance = { browser, shared: true };
return _browserInstance;
} catch {}
const { chromium } = await import('playwright');
const browser = await chromium.launch({
headless: false,
args: ['--disable-blink-features=AutomationControlled'],
});
_browserInstance = { browser, shared: false };
return _browserInstance;
}
async function closeBrowser() {
if (!_browserInstance) return;
try {
if (_browserInstance.shared) _browserInstance.browser.disconnect();
else await _browserInstance.browser.close();
} catch {}
_browserInstance = null;
}
// ==================== 搜索引擎 ====================
async function searchBingPW(query, max) {
console.error(`[Bing:pw] query`);
const out = [], seen = new Set();
const base = 'https://cn.bing.com';
let context;
try {
const { browser } = await getBrowser();
context = await browser.newContext({
userAgent: UA,
locale: 'zh-CN',
viewport: { width: 1920, height: 1080 },
extraHTTPHeaders: { 'Accept-Language': 'zh-CN,zh;q=0.9' },
});
await context.addInitScript(PAGE_COMPAT_INIT);
const page = await context.newPage();
// 先访问首页拿 cookie
await page.goto(base + '/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(1500);
// 搜索框提交
try {
const searchBox = await page.$('#sb_form_q');
if (searchBox) {
await searchBox.click();
await searchBox.fill(query);
await page.waitForTimeout(300);
await Promise.all([
page.waitForLoadState('domcontentloaded', { timeout: PW_TIMEOUT }),
page.keyboard.press('Enter'),
]);
await page.waitForTimeout(2000);
} else {
await page.goto(base + '/search?' + querystring.stringify({ q: query }), {
waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
});
await page.waitForTimeout(1500);
}
} catch {
await page.goto(base + '/search?' + querystring.stringify({ q: query }), {
waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
});
await page.waitForTimeout(1500);
}
const results = await page.evaluate(() => {
const items = [];
const seen = new Set();
const add = (title, url, snippet) => {
if (title && url && url.startsWith('http') && !seen.has(url)) {
seen.add(url);
items.push({ title, url, snippet });
}
};
// 1) 主结果:li.b_algo
document.querySelectorAll('li.b_algo').forEach(el => {
const a = el.querySelector('h2 a');
if (!a) return;
add(a.textContent.trim(), a.href, el.querySelector('.b_caption p')?.textContent?.trim() || '');
});
// 2) 答案卡片/知识面板里的链接(li.b_ans, li.b_vList, li.b_entityTP)
if (items.length === 0) {
document.querySelectorAll('li.b_ans, li.b_vList, li.b_entityTP, li.b_mop').forEach(el => {
el.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
// 跳过 Bing 内部链接
if (!href || href.includes('bing.com') || href.includes('microsoft.com') || href.startsWith('javascript:')) return;
add(a.textContent.trim().slice(0, 120), href, '');
});
});
}
return items;
});
for (const item of results) {
const url = normalizeUrl(item.url);
const title = clean(item.title);
const snippet = clean(item.snippet);
if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
seen.add(url.toLowerCase());
out.push({ title, url, snippet });
}
}
// 3) 0 结果时补词重试(强制出网页结果而非即时卡片)
if (out.length === 0) {
const suffixes = [' 网站', ' 详情', ' 介绍'];
for (const suffix of suffixes) {
const retryQuery = query + suffix;
console.error(`[Bing:pw] 0条,补词重试: "retryQuery"`);
try {
await page.goto(base + '/search?' + querystring.stringify({ q: retryQuery }), {
waitUntil: 'domcontentloaded', timeout: PW_TIMEOUT,
});
await page.waitForTimeout(1500);
const retryResults = await page.evaluate(() => {
const items = [];
document.querySelectorAll('li.b_algo').forEach(el => {
const a = el.querySelector('h2 a');
if (!a) return;
items.push({
title: a.textContent.trim(),
url: a.href || '',
snippet: el.querySelector('.b_caption p')?.textContent?.trim() || '',
});
});
return items;
});
for (const item of retryResults) {
const url = normalizeUrl(item.url);
const title = clean(item.title);
const snippet = clean(item.snippet);
if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
seen.add(url.toLowerCase());
out.push({ title, url, snippet });
}
}
if (out.length > 0) break;
} catch {}
}
}
console.error(`[Bing:pw] out.length 条`);
} catch (e) {
console.error(`[Bing:pw] 错误: e.message.split('\n')[0]`);
} finally {
if (context) await context.close().catch(() => {});
}
return out.slice(0, max);
}
async function searchSogouHttp(query, max) {
console.error(`[搜狗:http] query`);
const out = [], seen = new Set();
try {
const url = 'https://www.sogou.com/web?' + querystring.stringify({ query });
const r = await fetch(url, {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
},
signal: AbortSignal.timeout(HTTP_TIMEOUT), redirect: 'follow',
});
if (!r.ok) { console.error(`[搜狗:http] HTTP r.status`); return out; }
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const rawItems = [];
$('.vrwrap, .rb').each((_, el) => {
const $el = $(el);
const $a = $el.find('h3 a').first();
if (!$a.length) return;
const title = clean($a.text());
let href = $a.attr('href') || '';
if (href.startsWith('/link?')) href = 'https://www.sogou.com' + href;
const snippet = clean($el.find('.str-text-info, .str_info').text());
if (title && href) rawItems.push({ title, href, snippet });
});
const resolved = await Promise.all(rawItems.map(async (item) => ({ ...item, url: normalizeUrl(await resolveRedirectUrl(item.href)) })));
for (const item of resolved) {
if (item.url && item.url.startsWith('http') && !seen.has(item.url.toLowerCase())) {
seen.add(item.url.toLowerCase());
out.push({ title: item.title, url: item.url, snippet: item.snippet });
}
}
console.error(`[搜狗:http] out.length 条`);
} catch (e) {
console.error(`[搜狗:http] 错误: e.message.split('\n')[0]`);
}
return out.slice(0, max);
}
async function searchDDGHtml(query, max) {
console.error(`[DDG:html] query`);
const out = [], seen = new Set();
try {
const r = await fetch('https://html.duckduckgo.com/html/?q=' + encodeURIComponent(query), {
headers: { 'User-Agent': UA, 'Accept-Language': 'en-US,en;q=0.9' },
signal: AbortSignal.timeout(HTTP_TIMEOUT), redirect: 'follow',
});
if (!r.ok) { console.error(`[DDG:html] HTTP r.status`); return out; }
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
$('.result, .web-result').each((_, el) => {
const $el = $(el);
const $a = $el.find('.result__title a, .result__a, h2 a').first();
if (!$a.length) return;
const title = clean($a.text());
let href = $a.attr('href') || '';
try {
const uddg = new URL(href, 'https://duckduckgo.com').searchParams.get('uddg');
if (uddg) href = uddg;
} catch {}
const snippet = clean($el.find('.result__snippet, .result__body').text());
const url = normalizeUrl(href);
if (title && url && url.startsWith('http') && !seen.has(url.toLowerCase())) {
seen.add(url.toLowerCase());
out.push({ title, url, snippet });
}
});
console.error(`[DDG:html] out.length 条`);
} catch (e) {
console.error(`[DDG:html] 错误: e.message.split('\n')[0]`);
}
return out.slice(0, max);
}
// ==================== 自动抓取 ====================
async function autoFetchUrls(results, fetchCount, maxLen) {
if (fetchCount <= 0 || results.length === 0) return;
const urls = results.slice(0, Math.min(fetchCount, results.length)).map(r => r.url);
console.error(`[fetch] 自动抓取 urls.length 条...`);
try {
const fetchArgs = ['node', path.resolve(__dirname, 'fetch.js'), ...urls, `--max-len=maxLen`, '--headed'];
const raw = child_process.execSync(fetchArgs.join(' '), {
encoding: 'utf8', timeout: 60000,
stdio: ['pipe', 'pipe', 'pipe'],
});
try {
const fetched = JSON.parse(raw);
for (let i = 0; i < Math.min(fetchCount, fetched.length); i++) {
if (fetched[i] && fetched[i].content) {
results[i].content = fetched[i].content.slice(0, maxLen);
}
}
console.error(`[fetch] 抓取完成`);
} catch (e) {
console.error(`[fetch] 解析失败: e.message.split('\n')[0]`);
}
} catch (e) {
console.error(`[fetch] 抓取失败: e.message.split('\n')[0]`);
}
}
// ==================== main ====================
async function main() {
const startTime = Date.now();
await ensureDeps();
const { program } = await import('commander');
program
.argument('[query...]', '搜索关键词')
.option('--max <n>', '结果数 (1-30)', v => parseInt(v, 10), DEFAULT_MAX)
.option('--region <r>', '区域: auto/cn/intl', 'auto')
.option('--engine <e>', '引擎: auto/bing/sogou/ddg', 'auto')
.option('--fetch <n>', '自动抓前N条URL内容 (0=不抓)', v => parseInt(v, 10), DEFAULT_FETCH)
.option('--max-len <n>', '单页最大字符数', v => parseInt(v, 10), 6000)
.option('--no-fetch', '禁用自动抓取')
.parse(process.argv);
const opts = program.opts();
const query = clean(program.args.join(' '));
if (!query) { console.log(JSON.stringify({ error: '未传入搜索关键词' })); process.exit(1); }
const max = Math.max(1, Math.min(30, opts.max));
const fetchCount = opts.fetch === true ? DEFAULT_FETCH : (opts.noFetch ? 0 : opts.fetch);
let inChina;
if (opts.region === 'cn') inChina = true;
else if (opts.region === 'intl') inChina = false;
else inChina = await detectInChina();
const out = [], seen = new Set();
function dedupKey(url) {
try {
const u = new URL(url);
let host = u.hostname.replace(/^(www|m|mobile)\./, '');
let p = u.pathname.replace(/\/+$/, '').replace(/\.(html?|php|aspx?)$/, '');
return `hostp`.toLowerCase();
} catch { return url.toLowerCase(); }
}
const add = (items) => {
for (const item of items) {
const key = dedupKey(item.url);
if (!seen.has(key)) { seen.add(key); out.push(item); }
}
};
if (inChina) {
// 国内:根据 --engine 选择
const engine = opts.engine === 'auto' ? 'bing' : opts.engine;
if (engine === 'sogou') {
console.error('[策略] 国内 → 搜狗 HTTP (⚠ 无cookie易被反爬拦截,结果可能为空)');
add(await searchSogouHttp(query, max));
} else {
console.error('[策略] 国内 → Bing PW');
add(await searchBingPW(query, max));
}
} else {
console.error('[策略] 海外 → DDG HTML');
add(await searchDDGHtml(query, max));
}
const results = out.slice(0, max);
// 自动抓取
await autoFetchUrls(results, fetchCount, opts.maxLen || 6000);
console.log(JSON.stringify(results, null, 2));
console.error(`[耗时] ((Date.now() - startTime) / 1000).toFixed(1)s | results.length条结果`);
await closeBrowser();
}
main().catch(e => { console.error('[ERROR]', e.message); process.exit(1); });
FILE:scripts/setup.sh
#!/bin/bash
# free-web-search-js setup (Linux/macOS)
# v28
set -e
SKILL_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
echo ""
echo "=== free-web-search-js Setup ==="
echo ""
echo "Dependencies:"
echo " - Node.js >= 18"
echo " - npm packages: cheerio, commander, iconv-lite, playwright"
echo " - Playwright Chromium browser (~150MB, takes a few minutes)"
echo ""
# Node.js
if ! command -v node &>/dev/null; then
echo "[X] Node.js not found"
echo " -> https://nodejs.org"
exit 1
fi
NODE_VERSION=$(node --version)
MAJOR=$(echo "$NODE_VERSION" | sed 's/^v//' | cut -d. -f1)
if [ "$MAJOR" -lt 18 ]; then
echo "[X] Node.js >= 18 required (current: $NODE_VERSION)"
exit 1
fi
echo "[OK] Node.js $NODE_VERSION"
# 检测国内网络 → 选镜像源
IN_CHINA=false
echo ""
echo "Detecting network region..."
for url in "https://myip.ipip.net" "https://cip.cc"; do
if resp=$(curl -sS --max-time 3 "$url" 2>/dev/null); then
if echo "$resp" | grep -qi "中国\|CN"; then
IN_CHINA=true
break
fi
fi
done
if [ "$IN_CHINA" = true ]; then
echo "[OK] 国内网络,使用镜像源加速"
export PLAYWRIGHT_DOWNLOAD_HOST="https://npmmirror.com/mirrors/playwright"
NPM_REGISTRY="--registry=https://registry.npmmirror.com"
else
echo "[OK] 海外网络,使用官方源"
NPM_REGISTRY=""
fi
# npm install
echo ""
echo "Installing npm packages (cheerio, commander, iconv-lite, playwright)..."
cd "$SKILL_ROOT"
if [ -n "$NPM_REGISTRY" ]; then
if ! npm install $NPM_REGISTRY; then
echo "[X] npm install failed"
exit 1
fi
else
if ! npm install; then
echo "[X] npm install failed"
exit 1
fi
fi
echo "[OK] npm packages installed"
# Playwright Chromium
echo ""
echo "Installing Playwright Chromium browser (~150MB, this may take a few minutes)..."
if ! npx playwright install chromium; then
echo "[X] Playwright Chromium install failed"
echo " Try manually: npx playwright install chromium"
exit 1
fi
echo "[OK] Playwright Chromium installed"
echo ""
echo "[OK] Setup complete!"
echo " Verify: node scripts/check-env.js"
FILE:scripts/_batch_test.js
#!/usr/bin/env node
/**
* 批量测试:多个query,记录耗时、结果数、去重后数
*/
import { execSync } from 'child_process';
const queries = [
'今日黄金价格',
'俄乌冲突最新消息',
'怎么做红烧肉',
'上海明天天气',
'感冒吃什么药',
'量子计算',
'北京',
'今日铜价',
];
console.log('Query'.padEnd(30) + 'Results Time Engines');
console.log('-'.repeat(65));
for (const q of queries) {
const t = Date.now();
try {
const raw = execSync(`node scripts/search.js "q" --max=10`, {
encoding: 'utf8',
timeout: 120000,
stdio: ['pipe', 'pipe', 'pipe'],
});
const elapsed = ((Date.now() - t) / 1000).toFixed(1);
const results = JSON.parse(raw);
// 从stderr提取引擎信息(这里简化,只看结果数)
console.log(q.padEnd(30) + `results.length`.padEnd(9) + `elapseds`.padEnd(8));
} catch (e) {
const elapsed = ((Date.now() - t) / 1000).toFixed(1);
console.log(q.padEnd(30) + 'FAIL'.padEnd(9) + `elapseds`.padEnd(8) + e.message.split('\n')[0].slice(0, 30));
}
}
FILE:scripts/_batch_test2.js
#!/usr/bin/env node
/**
* 批量测试(进程内):直接调search函数,不spawn子进程
*/
import querystring from 'querystring';
const queries = [
'今日黄金价格',
'俄乌冲突最新消息',
'怎么做红烧肉',
'上海明天天气',
'感冒吃什么药',
'量子计算',
'北京',
'今日铜价',
];
// 动态import search.js的函数太复杂,直接用时间戳包装exec
import { exec } from 'child_process';
async function runOne(q) {
const { spawn } = await import('child_process');
return new Promise((resolve) => {
const t = Date.now();
const p = spawn('node', ['scripts/search.js', q, '--max=10'], {
cwd: import.meta.dirname,
});
let stdout = '', stderr = '';
p.stdout.on('data', d => stdout += d);
p.stderr.on('data', d => stderr += d);
p.on('close', (code) => {
const elapsed = ((Date.now() - t) / 1000).toFixed(1);
if (code !== 0) {
resolve({ q, ok: false, elapsed, error: `exit code` });
return;
}
try {
const results = JSON.parse(stdout);
const bingMatch = stderr.match(/\[Bing:pw\] (\d+) 条/);
const baiduMatch = stderr.match(/\[百度:pw\] (\d+) 条/);
resolve({
q, ok: true, elapsed,
count: results.length,
bing: bingMatch ? parseInt(bingMatch[1]) : 0,
baidu: baiduMatch ? parseInt(baiduMatch[1]) : 0,
});
} catch (e) {
resolve({ q, ok: false, elapsed, error: 'parse error' });
}
});
p.on('error', e => {
const elapsed = ((Date.now() - t) / 1000).toFixed(1);
resolve({ q, ok: false, elapsed, error: e.message.slice(0, 30) });
});
});
}
console.log('Query'.padEnd(24) + 'Results Bing Baidu Time');
console.log('-'.repeat(60));
const allResults = [];
for (const q of queries) {
const r = await runOne(q);
allResults.push(r);
if (r.ok) {
console.log(r.q.padEnd(24) + `r.count`.padEnd(9) + `r.bing`.padEnd(6) + `r.baidu`.padEnd(7) + `r.elapseds`);
} else {
console.log(r.q.padEnd(24) + 'FAIL'.padEnd(9) + ''.padEnd(6) + ''.padEnd(7) + `r.elapseds ` + r.error);
}
}
// 汇总
const okResults = allResults.filter(r => r.ok);
const avgTime = okResults.reduce((s, r) => s + parseFloat(r.elapsed), 0) / okResults.length;
const avgCount = okResults.reduce((s, r) => s + r.count, 0) / okResults.length;
console.log('-'.repeat(60));
console.log(`平均: avgCount.toFixed(1)条 avgTime.toFixed(1)s (okResults.length/allResults.length 成功)`);
FILE:scripts/_bench.js
import { execSync } from 'child_process';
const t = Date.now();
const p = execSync('node scripts/search.js "今日黄金价格" --max=8', {
encoding: 'utf8',
stdio: ['pipe', 'pipe', 'pipe'],
timeout: 60000,
cwd: import.meta.dirname,
});
console.log('耗时:', ((Date.now() - t) / 1000).toFixed(1), '秒');
console.log('结果数:', JSON.parse(p).length);
FILE:scripts/_bench2.js
const start = Date.now();
process.argv = ['node', 'scripts/search.js', '今日黄金价格', '--max=8'];
import('./search.js').catch(() => {}).finally(() => {
// search.js自己会process.exit,这里不一定能跑到
});
FILE:scripts/_debug_baidu_box.js
#!/usr/bin/env node
/**
* 调试:看百度首页搜索框选择器
*/
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({ userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 } });
await context.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
const page = await context.newPage();
await page.goto('https://www.baidu.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
// 列出所有input
const inputs = await page.evaluate(() => {
return Array.from(document.querySelectorAll('input')).map(el => ({
id: el.id,
name: el.name,
type: el.type,
className: el.className,
placeholder: el.placeholder,
}));
});
console.log('Inputs:', JSON.stringify(inputs, null, 2));
// 试搜索
const query = '今日黄金价格';
const searchBox = await page.$('#kw') || await page.$('input[name="wd"]');
if (searchBox) {
console.log('找到搜索框:', await searchBox.evaluate(el => ({ id: el.id, name: el.name })));
await searchBox.fill(query);
await page.waitForTimeout(300);
await page.keyboard.press('Enter');
await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
await page.waitForTimeout(2000);
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.result h3 a, .c-container h3 a').forEach(a => {
items.push(a.textContent.trim().slice(0, 50));
});
return items;
});
console.log('\n百度搜索结果前5条:');
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
} else {
console.log('未找到搜索框');
}
await browser.close();
FILE:scripts/_debug_baidu_pw.js
#!/usr/bin/env node
/**
* 用Playwright搜百度,看结果
*/
const { chromium } = await import('playwright');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const page = await browser.newPage();
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
// 先访问百度首页
await page.goto('https://www.baidu.com', { waitUntil: 'domcontentloaded', timeout: 10000 });
await page.waitForTimeout(1000);
// 搜索
const query = '今日黄金价格';
console.log('Baidu search:', query);
await page.goto('https://www.baidu.com/s?wd=' + encodeURIComponent(query), {
waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(2000);
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.result h3 a, .c-container h3 a').forEach(a => {
items.push({
title: a.textContent.trim().slice(0, 60),
href: a.href,
});
});
return items;
});
const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('含kekegold:', html.includes('kekegold'));
console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
console.log(` i+1. r.title`);
console.log(` r.href.slice(0, 80)`);
});
await browser.close();
FILE:scripts/_debug_bing.js
#!/usr/bin/env node
/**
* 调试:看Bing CN返回的原始搜索结果是什么
*/
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36';
const query = '今日黄金价格';
console.log('Query:', query);
const url = 'https://cn.bing.com/search?' + new URLSearchParams({ q: query });
console.log('URL:', url);
const r = await fetch(url, {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
},
redirect: 'follow',
});
console.log('Status:', r.status);
const html = await r.text();
console.log('HTML length:', html.length);
// 提取结果
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $el = $(el);
const $a = $el.find('h2 a');
if (!$a.length) return;
const title = $a.text().trim();
const href = $a.attr('href') || '';
const snippet = $el.find('.b_caption p').text().trim();
results.push({
index: i + 1,
title: title.slice(0, 60),
href: href.slice(0, 80),
snippet: snippet.slice(0, 60)
});
});
console.log('\\n=== Bing CN Results ===');
results.slice(0, 10).forEach(r => {
console.log(`r.index. r.title`);
console.log(` href: r.href`);
console.log(` snippet: r.snippet`);
console.log('');
});
// 检查第一页内容里有没有金投网
const hasCngold = html.includes('cngold.org') || html.includes('金投网');
const hasSina = html.includes('finance.sina') || html.includes('新浪财经');
console.log('HTML contains cngold.org/金投网:', hasCngold);
console.log('HTML contains finance.sina/新浪财经:', hasSina);
FILE:scripts/_debug_bing2.js
#!/usr/bin/env node
/**
* 逐步排查Bing CN搜索结果差异的原因
* 对比不同请求头/cookie组合下的结果
*/
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
async function testBing(label, url, headers) {
try {
const r = await fetch(url, { headers, redirect: 'follow', signal: AbortSignal.timeout(10000) });
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push($a.text().trim().slice(0, 50));
});
const hasCngold = html.includes('cngold');
const hasSina = html.includes('finance.sina');
const has16fan = html.includes('16fan');
console.log(`\n=== label ===`);
console.log(`Status: r.status, HTML: html.length bytes`);
console.log(`含金投网: hasCngold, 含新浪: hasSina, 含十六番: has16fan`);
console.log(`前3条:`);
results.slice(0, 3).forEach((t, i) => console.log(` i+1. t`));
} catch (e) {
console.log(`\n=== label === FAILED: e.message`);
}
}
// Test 1: skill当前的方式(最简header)
await testBing('1. 当前skill方式(简header)',
'https://cn.bing.com/search?q=' + encodeURIComponent(query),
{
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
);
// Test 2: 加更多浏览器标准header
await testBing('2. 完整浏览器header',
'https://cn.bing.com/search?q=' + encodeURIComponent(query),
{
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Cache-Control': 'max-age=0',
'Sec-Ch-Ua': '"Chromium";v="136", "Google Chrome";v="136", "Not-A.Brand";v="99"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
}
);
// Test 3: 用www.bing.com而不是cn.bing.com
await testBing('3. www.bing.com + zh-CN',
'https://www.bing.com/search?q=' + encodeURIComponent(query) + '&setlang=zh-CN&cc=cn',
{
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
);
// Test 4: cn.bing.com + FORM=R5FD1 (Bing CN标准参数)
await testBing('4. cn.bing.com + FORM=R5FD1',
'https://cn.bing.com/search?q=' + encodeURIComponent(query) + '&FORM=R5FD1',
{
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
);
// Test 5: 先访问cn.bing.com首页拿cookie,再搜索
console.log('\n=== 5. 先拿cookie再搜索 ===');
try {
// 先访问首页
const homeR = await fetch('https://cn.bing.com/', {
headers: { 'User-Agent': UA, 'Accept': 'text/html' },
redirect: 'follow', signal: AbortSignal.timeout(5000),
});
const homeHtml = await homeR.text();
console.log('首页 status:', homeR.status, 'size:', homeHtml.length);
// 提取set-cookie
// Note: Node.js fetch doesn't expose Set-Cookie easily, but let's check
console.log('首页 headers:', Object.fromEntries(homeR.headers.entries()));
// 再搜索
await testBing('5a. 拿cookie后搜索',
'https://cn.bing.com/search?q=' + encodeURIComponent(query) + '&FORM=R5FD1',
{
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
);
} catch (e) {
console.log('Cookie test failed:', e.message);
}
FILE:scripts/_debug_bing3.js
#!/usr/bin/env node
/**
* 用undici的cookie jar测试Bing CN搜索
* 看带cookie后结果是否不同
*/
import pkg from 'undici';
const { CookieJar, fetch: undiciFetch } = pkg;
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
const jar = new CookieJar();
// Step 1: 访问Bing首页,让cookie jar收集cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
const homeR = await undiciFetch('https://cn.bing.com/', {
headers: { 'User-Agent': UA, 'Accept': 'text/html' },
redirect: 'follow',
signal: AbortSignal.timeout(5000),
}, { dispatcher: jar });
console.log('首页 status:', homeR.status);
// 看cookie jar里有什么
const cookies = await jar.getCookies('https://cn.bing.com');
console.log('Cookie数量:', cookies.length);
cookies.forEach(c => console.log(` c.key=String(c.value).slice(0, 30)...`));
// Step 2: 带cookie搜索
console.log('\nStep 2: 带cookie搜索...');
const searchR = await undiciFetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
},
redirect: 'follow',
signal: AbortSignal.timeout(10000),
}, { dispatcher: jar });
const html = await searchR.text();
console.log('搜索 status:', searchR.status, 'HTML:', html.length, 'bytes');
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push($a.text().trim().slice(0, 60));
});
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
FILE:scripts/_debug_bing_cookie.js
#!/usr/bin/env node
/**
* 用undici的Agent + cookie支持测试Bing CN
* Node.js 24 内置undici,可以用setGlobalDispatcher带cookie
*/
import { Agent, setGlobalDispatcher, fetch } from 'undici';
// 用带cookie的dispatcher
const agent = new Agent({ connect: { rejectUnauthorized: true } });
setGlobalDispatcher(agent);
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
// 手动管理cookie
const cookies = new Map();
function extractCookies(response, url) {
const setCookie = response.headers.getSetCookie?.() || [];
for (const c of setCookie) {
const [kv] = c.split(';');
const [k, ...v] = kv.split('=');
cookies.set(k.trim(), v.join('='));
}
}
function cookieHeader(url) {
if (cookies.size === 0) return '';
return Array.from(cookies.entries()).map(([k,v]) => `k=v`).join('; ');
}
// Step 1: 访问Bing首页拿cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
const homeR = await fetch('https://cn.bing.com/', {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
},
redirect: 'follow',
signal: AbortSignal.timeout(5000),
});
const homeHtml = await homeR.text();
extractCookies(homeR, 'https://cn.bing.com');
console.log('首页 status:', homeR.status);
console.log('Cookie:', cookieHeader('https://cn.bing.com').slice(0, 100));
// Step 2: 带cookie搜索
console.log('\nStep 2: 带cookie搜索...');
const searchR = await fetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cookie': cookieHeader('https://cn.bing.com'),
},
redirect: 'follow',
signal: AbortSignal.timeout(10000),
});
const html = await searchR.text();
extractCookies(searchR, 'https://cn.bing.com');
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push({ title: $a.text().trim().slice(0, 60), url: $a.attr('href') });
});
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((r, i) => console.log(` i+1. r.title\n r.url?.slice(0, 80)`));
FILE:scripts/_debug_bing_full.js
#!/usr/bin/env node
/**
* 排查Bing CN结果差异:
* 1. 编码问题(URL编码 vs UTF-8)
* 2. Cookie问题(先访问首页拿cookie)
* 3. 反爬问题(Playwright加强伪装)
*/
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
// ===== Test 1: 编码问题 =====
console.log('=== Test 1: 编码对比 ===');
const url1 = 'https://cn.bing.com/search?q=' + encodeURIComponent(query);
const url2 = 'https://cn.bing.com/search?q=' + query; // 不编码,让fetch自动处理
console.log('encodeURIComponent:', url1);
console.log('raw UTF-8:', url2);
console.log('');
// ===== Test 2: 用Playwright加强伪装 =====
console.log('=== Test 2: Playwright加强伪装 ===');
const { chromium } = await import('playwright');
const browser = await chromium.launch({
headless: false,
args: [
'--disable-blink-features=AutomationControlled',
'--disable-features=IsolateOrigins,site-per-process',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-web-security',
],
});
const context = await browser.newContext({
userAgent: UA,
locale: 'zh-CN',
viewport: { width: 1920, height: 1080 },
// 模拟真实浏览器环境
extraHTTPHeaders: {
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
},
});
// 注入反检测脚本
await context.addInitScript(() => {
// 隐藏webdriver
Object.defineProperty(navigator, 'webdriver', { get: () => false });
// 添加chrome对象
window.chrome = { runtime: {}, loadTimes: function(){}, csi: function(){} };
// 修改permissions
const origQuery = window.navigator.permissions?.query;
if (origQuery) {
window.navigator.permissions.query = (params) => (
params.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : origQuery(params)
);
}
// 修改plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
// 修改languages
Object.defineProperty(navigator, 'languages', {
get: () => ['zh-CN', 'zh', 'en-US', 'en'],
});
});
const page = await context.newPage();
// 先访问Bing首页,让浏览器自然拿cookie
console.log('Step 1: 访问 cn.bing.com 首页...');
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
// 检查cookie
const cookies = await context.cookies('https://cn.bing.com');
console.log('Cookie数量:', cookies.length);
cookies.forEach(c => console.log(` c.name=c.value.slice(0, 30)...`));
// Step 2: 在首页搜索框输入搜索(模拟真实用户行为)
console.log('\nStep 2: 在搜索框输入搜索...');
try {
const searchBox = await page.$('#sb_form_q');
if (searchBox) {
await searchBox.click();
await searchBox.fill(query);
await page.waitForTimeout(500);
// 按Enter搜索
await page.keyboard.press('Enter');
await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
await page.waitForTimeout(3000);
console.log('通过搜索框搜索成功');
} else {
console.log('搜索框未找到,直接URL搜索');
await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(3000);
}
} catch (e) {
console.log('搜索框搜索失败,fallback到URL:', e.message.slice(0, 50));
await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(3000);
}
// 提取结果
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('li.b_algo').forEach(el => {
const a = el.querySelector('h2 a');
if (a) items.push({
title: a.textContent.trim().slice(0, 60),
url: a.href,
snippet: el.querySelector('.b_caption p')?.textContent?.trim().slice(0, 60) || '',
});
});
return items;
});
const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('含汇率表:', html.includes('huilvbiao'));
console.log('含金价网:', html.includes('jinjia') || html.includes('94723'));
console.log('含kekegold:', html.includes('kekegold'));
console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
console.log(` i+1. r.title`);
console.log(` r.url?.slice(0, 80)`);
});
// 检查当前URL
console.log('\n当前页面URL:', page.url());
await browser.close();
FILE:scripts/_debug_bing_pw.js
#!/usr/bin/env node
/**
* 用Playwright真实浏览器搜Bing CN,看结果是否不同
*/
const { chromium } = await import('playwright');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const page = await browser.newPage();
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
// 访问Bing CN搜索
const query = '今日黄金价格';
console.log('Navigating to Bing CN...');
await page.goto('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
waitUntil: 'domcontentloaded', timeout: 15000,
});
await page.waitForTimeout(2000);
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('li.b_algo').forEach(el => {
const a = el.querySelector('h2 a');
if (a) items.push({
title: a.textContent.trim().slice(0, 60),
url: a.href,
snippet: el.querySelector('.b_caption p')?.textContent?.trim().slice(0, 60) || '',
});
});
return items;
});
const html = await page.content();
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前10条:');
results.slice(0, 10).forEach((r, i) => {
console.log(` i+1. r.title`);
console.log(` r.url`);
});
await browser.close();
FILE:scripts/_debug_cookie_combos.js
#!/usr/bin/env node
/**
* 测试:Playwright拿cookie → fetch带cookie + form=QBLH参数搜Bing
*/
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
// Step 1: Playwright拿cookie
console.log('Step 1: Playwright拿cookie...');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({
userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 },
});
await context.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
const cookies = await context.cookies('https://cn.bing.com');
const cookieStr = cookies.map(c => `c.name=c.value`).join('; ');
console.log(`拿到 cookies.length 个cookie`);
await browser.close();
// Step 2: fetch带cookie + 不同URL参数组合
const tests = [
['cookie + form=QBLH', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH`],
['cookie + form=QBLH + cvid', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH&sp=-1&lq=0&pq=&sc=12-0&qs=n&sk=&cvid=crypto.randomUUID().replace(/-/g,'').slice(0,32)`],
['cookie + FORM=R5FD1', `https://cn.bing.com/search?q=encodeURIComponent(query)&FORM=R5FD1`],
['cookie only', `https://cn.bing.com/search?q=encodeURIComponent(query)`],
['no cookie + form=QBLH', `https://cn.bing.com/search?q=encodeURIComponent(query)&form=QBLH`],
];
for (const [label, url] of tests) {
const useCookie = !label.startsWith('no cookie');
try {
const headers = {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
};
if (useCookie) headers['Cookie'] = cookieStr;
const r = await fetch(url, { headers, redirect: 'follow', signal: AbortSignal.timeout(8000) });
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push($a.text().trim().slice(0, 50));
});
console.log(`\n=== label ===`);
console.log(`含金投网: html.includes('cngold'), 含十六番: html.includes('16fan')`);
console.log(`前3: results.slice(0, 3).join(' | ')`);
} catch (e) {
console.log(`\n=== label === FAILED: e.message`);
}
}
FILE:scripts/_debug_cookie_fetch.js
#!/usr/bin/env node
/**
* 测试:Playwright拿cookie → fetch带cookie搜Bing
*/
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = '今日黄金价格';
// Step 1: Playwright拿cookie
console.log('Step 1: Playwright拿cookie...');
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({
userAgent: UA,
locale: 'zh-CN',
viewport: { width: 1920, height: 1080 },
});
await context.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
const cookies = await context.cookies('https://cn.bing.com');
const cookieStr = cookies.map(c => `c.name=c.value`).join('; ');
console.log(`拿到 cookies.length 个cookie,总长 cookieStr.length`);
await browser.close();
// Step 2: fetch带cookie搜Bing
console.log('\nStep 2: fetch带cookie搜索...');
const r = await fetch('https://cn.bing.com/search?q=' + encodeURIComponent(query), {
headers: {
'User-Agent': UA,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cookie': cookieStr,
},
redirect: 'follow',
signal: AbortSignal.timeout(10000),
});
const html = await r.text();
console.log('Status:', r.status, 'HTML:', html.length);
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push($a.text().trim().slice(0, 60));
});
console.log('\n含金投网:', html.includes('cngold'));
console.log('含新浪:', html.includes('finance.sina'));
console.log('含十六番:', html.includes('16fan'));
console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
FILE:scripts/_debug_ddg.js
#!/usr/bin/env node
/**
* 调试DDG HTML Lite
*/
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const query = 'python tutorial';
console.log('Test 1: DDG HTML Lite POST');
try {
const r = await fetch('https://html.duckduckgo.com/html/', {
method: 'POST',
headers: {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': UA,
},
body: 'q=' + encodeURIComponent(query),
signal: AbortSignal.timeout(10000),
});
console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('.result, .web-result').each((i, el) => {
const $a = $(el).find('.result__title a, .result__a, h2 a').first();
if ($a.length) results.push($a.text().trim().slice(0, 50));
});
console.log('Results:', results.length);
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
} catch (e) {
console.log('Failed:', e.message);
}
console.log('\nTest 2: DDG HTML Lite GET');
try {
const r = await fetch('https://html.duckduckgo.com/html/?q=' + encodeURIComponent(query), {
headers: { 'User-Agent': UA },
signal: AbortSignal.timeout(10000),
});
console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('.result, .web-result').each((i, el) => {
const $a = $(el).find('.result__title a, .result__a, h2 a').first();
if ($a.length) results.push($a.text().trim().slice(0, 50));
});
console.log('Results:', results.length);
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
} catch (e) {
console.log('Failed:', e.message);
}
console.log('\nTest 3: Bing International');
try {
const r = await fetch('https://www.bing.com/search?q=' + encodeURIComponent(query), {
headers: {
'User-Agent': UA,
'Accept-Language': 'en-US,en;q=0.9',
},
signal: AbortSignal.timeout(10000),
redirect: 'follow',
});
console.log('Status:', r.status, 'Size:', (await r.clone().text()).length);
const html = await r.text();
const { load } = await import('cheerio');
const $ = load(html);
const results = [];
$('li.b_algo').each((i, el) => {
const $a = $(el).find('h2 a');
if ($a.length) results.push($a.text().trim().slice(0, 50));
});
console.log('Results:', results.length);
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
} catch (e) {
console.log('Failed:', e.message);
}
FILE:scripts/_debug_searchbox.js
#!/usr/bin/env node
/**
* 调试:Bing搜索框输入中文后实际搜了什么
*/
const { chromium } = await import('playwright');
const UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36';
const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] });
const context = await browser.newContext({ userAgent: UA, locale: 'zh-CN', viewport: { width: 1920, height: 1080 } });
await context.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
window.chrome = { runtime: {} };
});
const page = await context.newPage();
await page.goto('https://cn.bing.com/', { waitUntil: 'domcontentloaded', timeout: 15000 });
await page.waitForTimeout(2000);
// 输入搜索
const query = '怎么做红烧肉';
const searchBox = await page.$('#sb_form_q');
await searchBox.click();
await searchBox.fill(query);
await page.waitForTimeout(500);
// 看搜索框的值
const inputValue = await page.evaluate(() => document.getElementById('sb_form_q').value);
console.log('搜索框值:', inputValue);
// 按Enter
await page.keyboard.press('Enter');
await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
await page.waitForTimeout(2000);
// 看最终URL
console.log('最终URL:', page.url());
// 看结果
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('li.b_algo').forEach(el => {
const a = el.querySelector('h2 a');
if (a) items.push(a.textContent.trim().slice(0, 50));
});
return items;
});
console.log('\n前5条:');
results.slice(0, 5).forEach((t, i) => console.log(` i+1. t`));
await browser.close();
FILE:scripts/_time_one.js
#!/usr/bin/env node
/**
* 单个query测试,输出耗时+结果数
*/
const q = process.argv[2] || '今日黄金价格';
const max = process.argv[3] || '10';
// 改process.argv让search.js执行
process.argv = [process.argv[0], 'scripts/search.js', q, '--max=' + max];
const t = Date.now();
try {
await import('./search.js');
} catch {}
// search.js会process.exit,如果没exit:
console.error('\n总耗时:', ((Date.now() - t) / 1000).toFixed(1), '秒');
Iterative prompt optimizer for complex tasks. Strictly implements ACON's two-stage iterative optimization + APE automatic prompt engineering. Only triggers w...
---
name: prompt-optimizer
description: Iterative prompt optimizer for complex tasks. Strictly implements ACON's two-stage iterative optimization + APE automatic prompt engineering. Only triggers when user explicitly requests it, actively collects feedback after optimization, supports multi-round iteration until satisfied.
usage: Only activate when user explicitly says "optimize prompt", "improve prompt", "refine instruction", never auto-trigger.
author: Based on arXiv:2510.00615 (ACON), arXiv:2211.01910 (APE)
license: MIT
tags:
- prompt-optimization
- acon
- ape
- iterative
- complex-tasks
---
## Atomic Optimization Methodology
### 🔬 Stage 1: Input Parsing & Critical Signal Extraction (ACON Paper §3.1)
**Input**: User's original prompt
**Operations**:
1. Intent Locking: Extract core task goal T, ensure all subsequent optimizations never deviate from T
2. Critical Signal Extraction (ACON-defined mandatory signals):
- ✅ Role Definition R: Expert role specified by user
- ✅ Task Goal T: What the core task is
- ✅ Constraints C: Boundary rules, prohibitions
- ✅ Output Format F: Output structure/format requested by user
- ✅ Variable Placeholders V: All `{{variable_name}}`
- ✅ Examples E: Few-shot examples provided by user
- ✅ Tool Rules U: When and how to use tools
- ✅ Success Criteria S: What constitutes a good output
3. Baseline Measurement: Record original prompt token length L₀
---
### 🚀 Stage 2: APE Utility Enhancement (arXiv:2211.01910 Automatic Prompt Engineering)
**Goal**: Turn vague prompts into expert-level instructions, improve utility
**Operations (Strict Order)**:
1. Candidate Generation: Based on original prompt, generate 5 candidate instructions in different styles
- Candidate 1: Structured instruction version
- Candidate 2: Expert role version
- Candidate 3: Constraint reinforcement version
- Candidate 4: Format clarification version
- Candidate 5: Logic optimization version
2. Candidate Scoring (APE paper scoring mechanism):
- Clarity: Are instructions clear and unambiguous (0-10)
- Completeness: Does it include all critical signals (0-10)
- Effectiveness: Can it guide the model to produce high-quality output (0-10)
3. Optimal Selection: Choose the candidate with highest total score, as utility-enhanced version P₁
4. Validation: Verify P₁ 100% preserves all critical signals, no change to original intent
---
### 📦 Stage 3: ACON Compression Optimization (ACON Paper §3.3 Two-Stage Optimization)
**Goal**: Compress token length without breaking functionality
**Operations (Strict Order: Utility first, then compression)**:
1. Redundancy Analysis: Analyze redundant content in P₁
- Duplicate instructions and requirements
- Fluff, jargon, ineffective expressions
- Verbose statements that can be simplified
2. Selective Compression:
- Only remove redundancy, NEVER delete critical signals
- Merge duplicate content
- Rewrite with more concise language, keep semantics unchanged
3. Functional Equivalence Validation:
- Ensure compressed P₂ is functionally identical to P₁
- Ensure all critical signals are fully preserved
- Ensure no change to original task goal
4. Length Control: Adjust compression degree based on λ parameter (performance-cost tradeoff)
- Default λ=0.5: Balanced mode
- If user feedback "too long", automatically increase λ to 0.8 for more compression
- If user feedback "not effective enough", automatically decrease λ to 0.2 to reduce compression
---
### 📤 Stage 4: Output & Feedback Collection
**Operations**:
1. Output optimized prompt P₂, wrapped in code block for easy copying
2. Actively ask for user feedback:
```
Optimization complete. Does this version meet your needs?
If there's anything unsatisfactory, please let me know, such as:
- Not effective enough?
- Still too long?
- Some constraints/formats not preserved?
- Other issues?
I'll continue iterating based on your feedback.
```
---
### 🔄 Stage 5: Iterative Optimization (ACON Paper's R-round Iteration Mechanism)
**When user provides feedback, execute the following**:
1. Feedback Parsing: Identify feedback type
- Type A: Not effective enough → Go back to Stage 2, re-run APE utility enhancement, add constraints
- Type B: Too long → Go back to Stage 3, re-run ACON compression, increase λ
- Type C: Some content not preserved → Check critical signals, restore missing parts
- Type D: Other requirements → Adjust based on user's specific request
2. Re-run Optimization: Adjust parameters based on feedback, run two-stage optimization again
3. Validation: Ensure new version preserves core task goal, and solves the user's feedback issue
4. Output new optimized version, ask for feedback again
5. Repeat until user indicates satisfaction
---
## Strict Rules (Guarantee Effectiveness)
- ✅ Every step has validation, ensure no damage to original functionality
- ✅ Critical signals are NEVER deleted, 100% preserved
- ✅ Strictly follow "utility first, then compression" order, never reverse
- ✅ Each iteration re-validates, ensure it gets better with each round
- ✅ For complex tasks, prioritize functional integrity, compression is optional
- ❌ Never auto-trigger, only work when user explicitly requests
- ❌ No comparisons or analysis, only output optimized results
- ❌ No extra explanations unless explicitly requested
复杂任务专用迭代式提示词优化器。严格执行ACON论文的两阶段迭代优化+APE自动提示工程,仅在用户明确要求时触发,优化完主动收集反馈,支持多轮迭代直到满意。
---
name: prompt优化器
description: 复杂任务专用迭代式提示词优化器。严格执行ACON论文的两阶段迭代优化+APE自动提示工程,仅在用户明确要求时触发,优化完主动收集反馈,支持多轮迭代直到满意。
usage: 仅当用户明确说"优化提示词"、"改进prompt"、"精炼指令"时触发,绝不自动触发。
license: MIT
tags:
- prompt-optimization
- acon
- ape
- 迭代优化
- 复杂任务
---
### 阶段1:输入解析与关键信号提取(ACON论文3)
**输入**:用户的原始提示词
**操作**:
1. 意图锁定:提取核心任务目标T,确保后续所有优化都不偏离T
2. 关键信号提取(ACON论文定义的必须保留信号):
- ✅ 角色设定R:用户指定的专家角色
- ✅ 任务目标T:核心要做什么
- ✅ 约束条件C:边界规则、禁止事项
- ✅ 输出格式F:用户要求的输出结构、格式
- ✅ 变量占位符V:所有`{{变量名}}`
- ✅ 示例E:用户提供的few-shot示例
- ✅ 工具规则U:工具调用的时机和方式
- ✅ 成功标准S:什么是好的输出
3. 基线测量:记录原始提示词的token长度L₀
---
### 阶段2:APE 效用增强
**目标**:把模糊的提示词变成专家级指令,提升效用
**操作(严格顺序)**:
1. 候选生成:基于原始提示词,生成5个不同风格的候选指令
- 候选1:结构化指令版
- 候选2:专家角色版
- 候选3:约束强化版
- 候选4:格式明确版
- 候选5:逻辑优化版
2. 候选打分(APE论文的打分机制):
- 清晰度:指令是否明确无歧义(0-10分)
- 完整性:是否包含所有关键信号(0-10分)
- 有效性:能否引导模型产生高质量输出(0-10分)
3. 最优选择:选择总分最高的候选,作为效用增强后的版本P₁
4. 验证:检查P₁是否100%保留了所有关键信号,没有改变原始意图
---
### 阶段3:ACON 压缩优化(ACON论文3.3节 两阶段优化)
**目标**:在不破坏功能的前提下,压缩token长度
**操作(严格顺序,先效用后压缩)**:
1. 冗余分析:分析P₁中的冗余内容
- 重复的指令和要求
- 废话、套话、无效表述
- 可以精简的冗长表达
2. 选择性压缩:
- 只删除冗余,绝不删除关键信号
- 合并重复的内容
- 用更简洁的语言重写,保持语义不变
3. 功能等价性验证:
- 确保压缩后的P₂,功能与P₁完全一致
- 确保所有关键信号都完整保留
- 确保没有改变原始任务目标
4. 长度控制:根据当前的λ参数(性能-成本权衡)调整压缩程度
- 默认λ=0.5:平衡模式
- 如果用户反馈"太长了",自动提高λ到0.8,进一步压缩
- 如果用户反馈"效果不好",自动降低λ到0.2,减少压缩
---
### 阶段4:输出与反馈收集
**操作**:
1. 输出优化后的提示词P₂,用代码块包裹,方便用户复制
2. 主动询问用户反馈:
```
已完成优化。这个版本是否满足你的需求?
如果有任何不满意的地方,请告诉我,比如:
- 效果不够好?
- 长度还是太长?
- 某些约束/格式没保留?
- 其他问题?
我会根据你的反馈,继续迭代优化。
```
---
### 阶段5:迭代优化(ACON论文的R轮迭代机制)
**当用户给出反馈时,执行以下操作**:
1. 反馈解析:识别用户的反馈类型
- 类型A:效果不好 → 回到阶段2,重新执行APE效用增强,补充约束
- 类型B:长度太长 → 回到阶段3,重新执行ACON压缩,提高λ
- 类型C:某些内容没保留 → 检查关键信号,补回缺失的部分
- 类型D:其他需求 → 根据用户的具体要求调整
2. 重新执行优化:根据反馈调整参数,再次运行两阶段优化
3. 验证:确保新的版本保留了核心任务目标,并且解决了用户反馈的问题
4. 输出新的优化版本,再次询问反馈
5. 重复直到用户表示满意
---
## 严格规则(保证效果)
- ✅ 每一步都有验证,确保不破坏原始功能
- ✅ 关键信号永不删除,100%保留
- ✅ 严格遵循"先效用后压缩"的顺序,绝不颠倒
- ✅ 迭代优化每一轮都重新验证,确保越优化越好
- ✅ 复杂任务优先保证功能完整性,压缩是可选的
- ❌ 不自动触发,只在用户明确要求时工作
- ❌ 不做任何对比分析,只输出优化结果
- ❌ 不输出多余的解释,除非用户要求
以资深云计算产品经理身份,深度阅读阿里云与华为云官方文档,输出有真实依据的差异化竞品分析
---
name: aliyun-huaweicloud-fullstack-product-competitive-analysis
description: 以资深云计算产品经理身份,深度阅读阿里云与华为云官方文档,输出有真实依据的差异化竞品分析
author: 云计算资深产品经理
version: "4.2.0"
license: Apache-2.0
allowed-tools: web_fetch
---
# 阿里云&华为云产品竞品分析
## 角色
你是拥有10年以上经验的ToB云计算资深产品经理,擅长从官方文档中提炼真实的产品差异,而非泛泛而谈。你的分析直接服务于产品规划、技术选型和市场决策。
## 快速开始
### 方式一:使用爬虫脚本(推荐,全自动)
脚本位于 `scripts/cloud_doc_scraper.py`,解析文档目录、抓取核心页面、输出 markdown。
> **依赖需手动安装**:`pip install playwright httpx beautifulsoup4 && playwright install chromium`
```bash
python scripts/cloud_doc_scraper.py --product ecs
python scripts/cloud_doc_scraper.py --product oss --output oss_docs.md
python scripts/cloud_doc_scraper.py --product rds --max-pages 15
python scripts/cloud_doc_scraper.py --product ecs --stealth # 可选:启用 stealth 模式处理 JS 渲染兼容问题
python scripts/cloud_doc_scraper.py --list # 查看所有支持的产品
```
**支持的产品**:ecs, oss, rds, redis, ack, fc, slb, maxcompute, pai, bailian, cdn, nas, flink, elasticsearch, dws
**输出**:markdown 文件,包含阿里云和华为云的官方文档原文 + 更新日志,可直接粘贴给 AI 做竞品分析。
**工作原理**:
1. 检测依赖是否就绪,缺失时提示安装命令并退出
2. 用 Playwright 打开文档首页,解析左侧目录导航
3. 按优先级筛选核心页面(产品介绍 > 规格参数 > 计费 > 应用场景)
4. 并发抓取各页面内容,输出去噪后的纯文本
5. 支持 HTTP fallback(Playwright 抓取失败时自动用 httpx+BS4)
6. 支持 deep_links(目录解析失败时使用预配置的页面 URL)
7. Stealth 模式(`--stealth`)默认关闭,仅在显式启用时处理 JS 渲染兼容性问题
### 方式二:手动用 web_fetch 逐页抓取
参考下方"官方文档入口"表格,用 `web_fetch` 工具逐页抓取文档内容。
---
## 官方文档入口
### 文档 URL 说明
**华为云文档**:
- 旧版文档(productdesc-*):部分产品已 404(OBS, RDS, DCS, CDN, SFS 等),脚本已内置正确的 deep_links
- 新版文档(SPA index.html):需要 JS 渲染,web_fetch 只能拿到空壳,建议用脚本
- 更新日志:`https://support.huaweicloud.cn/wtsnew-{product}/index.html`
**阿里云文档**:
- 主入口:`https://help.aliyun.com/zh/{product}`
- 更新日志:`https://help.aliyun.com/zh/{product}/product-overview/release-notes`
- URL 可能变更,脚本会自动从目录导航发现链接
### 阿里云
| 品类 | 产品 | 文档 | 更新日志 |
|------|------|------|----------|
| 计算 | 云服务器ECS | https://help.aliyun.com/zh/ecs | https://help.aliyun.com/zh/ecs/product-overview/release-notes |
| 计算 | 函数计算FC | https://help.aliyun.com/zh/fc | https://help.aliyun.com/zh/fc/product-overview/release-notes |
| 存储 | 对象存储OSS | https://help.aliyun.com/zh/oss | https://help.aliyun.com/zh/oss/product-overview/release-notes |
| 存储 | 文件存储NAS | https://help.aliyun.com/zh/nas | https://help.aliyun.com/zh/nas/product-overview/release-notes |
| 数据库 | 云数据库RDS | https://help.aliyun.com/zh/rds | https://help.aliyun.com/zh/rds/product-overview/release-notes |
| 数据库 | 云数据库Redis | https://help.aliyun.com/zh/redis | https://help.aliyun.com/zh/redis/product-overview/release-notes |
| 数据库 | AnalyticDB PG | https://help.aliyun.com/zh/analyticdb-for-postgresql | https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/release-notes |
| 容器 | 容器服务ACK | https://help.aliyun.com/zh/ack | https://help.aliyun.com/zh/ack/product-overview/release-notes |
| 网络 | 负载均衡SLB | https://help.aliyun.com/zh/slb | https://help.aliyun.com/zh/slb/product-overview/release-notes |
| 网络 | CDN | https://help.aliyun.com/zh/cdn | https://help.aliyun.com/zh/cdn/product-overview/release-notes |
| 大数据 | MaxCompute | https://help.aliyun.com/zh/maxcompute | https://help.aliyun.com/zh/maxcompute/product-overview/Release-notes |
| 大数据 | 实时计算Flink | https://help.aliyun.com/zh/flink | https://help.aliyun.com/zh/flink/product-overview/release-note |
| 大数据 | Elasticsearch | https://help.aliyun.com/zh/elasticsearch | https://help.aliyun.com/zh/elasticsearch/product-overview/release-notes |
| AI | 人工智能平台PAI | https://help.aliyun.com/zh/pai | https://help.aliyun.com/zh/pai/user-guide/api-aiworkspace-2021-02-04-changeset |
| AI | 百炼平台 | https://help.aliyun.com/zh/bailian | https://help.aliyun.com/zh/bailian/release-notes |
### 华为云
| 品类 | 产品 | 文档 | 更新日志 |
|------|------|------|----------|
| 计算 | 弹性云服务器ECS | https://support.huaweicloud.cn/ecs/index.html | https://support.huaweicloud.cn/wtsnew-ecs/index.html |
| 计算 | 函数工作流FunctionGraph | https://support.huaweicloud.cn/functiongraph/index.html | https://support.huaweicloud.cn/wtsnew-functiongraph/index.html |
| 存储 | 对象存储OBS | https://support.huaweicloud.cn/obs/index.html | https://support.huaweicloud.cn/wtsnew-obs/index.html |
| 存储 | 文件存储SFS | https://support.huaweicloud.cn/sfs/index.html | https://support.huaweicloud.cn/wtsnew-sfs/index.html |
| 数据库 | 云数据库RDS | https://support.huaweicloud.cn/rds/index.html | https://support.huaweicloud.cn/wtsnew-rds/index.html |
| 数据库 | 分布式缓存DCS | https://support.huaweicloud.cn/dcs/index.html | https://support.huaweicloud.cn/wtsnew-dcs/index.html |
| 数据库 | 数据仓库GaussDB(DWS) | https://support.huaweicloud.cn/dws/index.html | https://support.huaweicloud.cn/wtsnew-dws/index.html |
| 容器 | 云容器引擎CCE | https://support.huaweicloud.cn/cce/index.html | https://support.huaweicloud.cn/wtsnew-cce/index.html |
| 网络 | 弹性负载均衡ELB | https://support.huaweicloud.cn/elb/index.html | https://support.huaweicloud.cn/wtsnew-elb/index.html |
| 网络 | CDN | https://support.huaweicloud.cn/cdn/index.html | https://support.huaweicloud.cn/wtsnew-cdn/index.html |
| 大数据 | MapReduce服务MRS | https://support.huaweicloud.cn/mrs/index.html | https://support.huaweicloud.cn/wtsnew-mrs/index.html |
| 大数据 | 数据湖探索DLI | https://support.huaweicloud.cn/dli/index.html | https://support.huaweicloud.cn/wtsnew-dli/index.html |
| 搜索 | 云搜索服务CSS | https://support.huaweicloud.cn/css/index.html | https://support.huaweicloud.cn/wtsnew-css/index.html |
| AI | AI开发平台ModelArts | https://support.huaweicloud.cn/modelarts/index.html | https://support.huaweicloud.cn/wtsnew-modelarts/index.html |
| AI | 盘古大模型平台 | https://support.huaweicloud.cn/pangu/index.html | https://support.huaweicloud.cn/wtsnew-pangu/index.html |
---
## 执行方式
用户输入目标产品后,执行以下步骤:
**第一步:锁定对标产品**
从上表查找双方对标产品。若预置清单无对应产品,明确告知用户,并提供已知的替代入口。
**第二步:运行爬虫脚本**
```bash
python scripts/cloud_doc_scraper.py --product {product_key} --output {product_key}_docs.md
```
脚本会自动完成:依赖安装 → 目录解析 → 核心页面筛选 → 并发抓取 → 输出 markdown。
若脚本不可用,退而用 web_fetch 手动逐页抓取(见下方步骤)。
**第三步:深读文档**
按以下优先级抓取文档内容:
**文档抓取优先级**:
1. **产品概述/简介页面**(了解产品定位和核心价值)
2. **组件版本表**(大数据类产品必抓,如 EMR 组件版本、MRS 组件版本)
3. **核心特性/功能说明页面**(了解能力边界)
4. **规格参数/性能指标页面**(了解性能上限)
5. **内核增强说明页面**(了解自研能力)
6. **最佳实践/使用场景页面**(了解适用场景)
7. **更新日志**(了解近12个月迭代方向)
**第四步:判断产品形态差异**
分析双方产品是否属于同一形态:
- 若形态相似(如都是托管数据库):直接对比功能、性能、价格
- 若形态差异大(如一个是托管服务,一个是PaaS平台):
- 先说明形态差异和各自定位
- 再对比可对比的维度(如核心能力、适用场景)
- 明确哪些维度无法直接对比
**第五步:找真实差异**
差异必须来自文档,不能靠印象。重点挖掘:
- 关键指标的数字差距(性能上限、规格范围、SLA数值等)
- 一方有、一方没有的核心能力
- 相同功能但实现路径或成熟度明显不同的地方
- 近期迭代方向的分歧,反映出各自的战略意图
无差异或差异不明显的维度,直接略过,不要凑字数。
**第六步:写分析**
格式自由,以能清晰传递判断为准。核心要回答三件事:
1. 两款产品真正的差异在哪,各自的优势和短板是什么
2. 近期各自在往哪个方向使劲,战略意图是什么
3. 什么样的客户和场景该选哪个
所有结论必须有文档依据,来源在行文中自然标注即可,不需要单独列参考文献章节。
**第七步:保存并展示结果**
1. 将完整分析报告保存为 markdown 文件(如 `{product_key}_competitive_analysis.md`),写入 workspace
2. **必须将分析报告的核心内容直接展示给用户**——不要只说"已保存到文件",而是把关键结论、对比表格、选型建议等直接输出到对话中,让用户一眼就能看到结果
3. 在展示末尾附上文件路径,方便用户后续引用
---
## 对比维度参考
根据产品类型,优先对比以下维度:
**基础维度(必选)**:
- 开源组件版本支持(如 Elasticsearch 7.10/8.x、OpenSearch 2.x)
- 内核增强能力(自研内核、性能优化、稳定性增强)
- 核心功能完整性(向量检索、存算分离、智能运维)
- 规格与性能上限(单节点存储、分片数、QPS)
**增强维度(按产品类型选择)**:
- 搜索类:向量检索算法、量化方式、向量维度、AI 搜索能力
- 数据库类:高可用架构、备份恢复、容灾能力
- 大数据类:计算引擎、存储格式、数据源集成
- AI 类:模型管理、推理能力、RAG 支持
**迭代维度(必选)**:
- 近 12 个月功能发布记录
- 战略方向判断(从迭代重点推断)
---
## 约束
- 所有信息来自官方文档,禁止使用第三方信息或训练数据中的印象
- 若某侧文档抓取失败,明确说明失败原因,不得静默填充或臆测
- 若预置清单无对应产品,告知用户后提供已知入口或终止执行
- 若产品形态差异过大,先说明差异再分析,不要强行对比不相关的功能
- 对比必须基于事实,避免主观评价,让数据说话
- 版本号、性能数据、规格参数等关键信息必须标注来源文档
---
## 脚本技术细节
### 依赖(需手动安装)
- Python 3.10+
- playwright + chromium(浏览器自动化)
- httpx + beautifulsoup4(HTTP fallback)
安装命令:
```bash
pip install playwright httpx beautifulsoup4
playwright install chromium
```
### 已知限制
- 华为云部分产品(OBS, RDS, DCS, CDN, SFS)旧版 productdesc 页面已 404,脚本使用 deep_links 兜底
- 华为云文档站部分页面需要 JS 渲染,web_fetch 只能拿到空壳,建议用脚本
- 华为云 changelog 页面(wtsnew-*)部分在 .cn 域名 404,脚本自动回退 .com
- 阿里云部分子页面可能返回空内容,脚本会尝试 HTTP fallback
- 个别 deep_links URL 可能随文档更新而失效,需定期维护
- Stealth 模式(`--stealth`)默认关闭,仅在用户显式启用时生效,用于处理 JS 渲染兼容性问题
FILE:scripts/cloud_doc_scraper.py
"""
阿里云 & 华为云产品文档深度爬虫 v4.1
逻辑:依赖检测 → 解析左侧目录 → 按优先级筛选核心页面 → 并发抓取 → 输出供 AI 分析的 markdown
v4.1 改进:
- 依赖检测替代自动安装:缺失时提示用户手动安装,不静默下载包或浏览器
- Stealth 模式改为可选(--stealth):默认关闭,仅在用户显式启用时生效
- 保留 HTTP fallback(httpx+BS4)作为 Playwright 的补充
v4 改进(相比 ClawHub v1.0.3):
- Windows GBK 编码修复:stdout/stderr 重编码为 UTF-8
- 等待策略改为 domcontentloaded:避免华为云 SPA networkidle 超时
- 重试逻辑:网络错误/超时自动重试(最多 2 次)
- deep_links 配置:目录解析失败/不足时自动使用预配置的深链页面
- 分侧内容选择器:阿里云/华为云使用不同的正文选择器,减少噪音
- 内容去噪:自动过滤导航/页脚等噪音文本
- HTTP fallback:Playwright 被拦截时自动用 httpx+BS4 抓取
- 安全验证检测:识别拦截页面并回退
- 404 检测:识别华为云 404 页面
- 华为云域名策略:.cn 自动回退 .com
依赖(需手动安装):
pip install playwright httpx beautifulsoup4
playwright install chromium
用法:
python cloud_doc_scraper.py --product ecs
python cloud_doc_scraper.py --product oss --output oss_docs.md
python cloud_doc_scraper.py --product rds --max-pages 15
python cloud_doc_scraper.py --product ecs --stealth # 启用 stealth 模式(谨慎使用)
python cloud_doc_scraper.py --list # 查看所有支持的产品
"""
# ─── Windows 控制台 UTF-8 修复(在任何 import 之前)────────────────────────────
import subprocess, sys, os, importlib.util
if sys.platform == "win32":
os.environ.setdefault("PYTHONIOENCODING", "utf-8")
try:
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
sys.stderr.reconfigure(encoding="utf-8", errors="replace")
except Exception:
pass
# ─── 自动安装依赖 ──────────────────────────────────────────────────────────────
def _ensure_dep(package: str, pip_name: str | None = None):
if importlib.util.find_spec(package) is None:
print(f"[INSTALL] {package} ...")
r = subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", pip_name or package],
capture_output=True, text=True)
if r.returncode != 0:
print(f"[FAIL] {package}: {r.stderr}"); sys.exit(1)
print(f"[OK] {package}")
else:
print(f"[OK] {package} (cached)")
def _ensure_playwright():
_ensure_dep("playwright")
try:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
if not os.path.exists(p.chromium.executable_path):
raise FileNotFoundError
print("[OK] Chromium (cached)")
except Exception:
print("[INSTALL] Chromium (first download ~150MB) ...")
r = subprocess.run([sys.executable, "-m", "playwright", "install", "chromium"],
capture_output=False, text=True)
if r.returncode != 0:
print("[FAIL] Chromium install"); sys.exit(1)
print("[OK] Chromium")
# ─── 依赖检测(不自动安装,缺失时提示用户手动安装)──────────────────────────
def _check_dep(package: str, pip_name: str | None = None):
if importlib.util.find_spec(package) is None:
print(f"[MISSING] {package} — please run: pip install {pip_name or package}")
return False
return True
def _check_deps():
ok = True
if not _check_dep("playwright"):
ok = False
else:
try:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
if not os.path.exists(p.chromium.executable_path):
print("[MISSING] Chromium — please run: playwright install chromium")
ok = False
except Exception:
print("[MISSING] Chromium — please run: playwright install chromium")
ok = False
if not _check_dep("httpx"):
ok = False
if not _check_dep("bs4", "beautifulsoup4"):
ok = False
if not ok:
print("\n[ERROR] Missing dependencies. Install with:\n"
" pip install playwright httpx beautifulsoup4\n"
" playwright install chromium\n"
"Or use a venv: python -m venv .venv && .venv/bin/pip install playwright httpx beautifulsoup4 && .venv/bin/playwright install chromium")
sys.exit(1)
print("[OK] All dependencies satisfied")
_check_deps()
# ─── 正式 import ──────────────────────────────────────────────────────────────
import asyncio, argparse
from pathlib import Path
from datetime import datetime
from urllib.parse import urljoin, urlparse
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
try:
import httpx
from bs4 import BeautifulSoup
HAS_HTTPX = True
except ImportError:
HAS_HTTPX = False
# ─── 产品配置 ─────────────────────────────────────────────────────────────────
PRODUCTS = {
"ecs": {
"name": "云服务器 ECS / ECS",
"aliyun": {"doc": "https://help.aliyun.com/zh/ecs", "changelog": "https://help.aliyun.com/zh/ecs/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/ecs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-ecs/index.html",
"deep_links": [
{"text": "什么是ECS", "url": "https://support.huaweicloud.com/productdesc-ecs/zh-cn_topic_0013771112.html"},
{"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0002.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0003.html"},
{"text": "实例规格", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0014.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-ecs/ecs_01_0005.html"},
]},
},
"oss": {
"name": "对象存储 OSS / OBS",
"aliyun": {"doc": "https://help.aliyun.com/zh/oss", "changelog": "https://help.aliyun.com/zh/oss/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/obs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-obs/index.html",
"deep_links": [
{"text": "什么是OBS", "url": "https://support.huaweicloud.com/productdesc-obs/zh-cn_topic_0045829060.html"},
{"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0201.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0202.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0151.html"},
{"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-obs/obs_03_0360.html"},
]},
},
"rds": {
"name": "云数据库 RDS",
"aliyun": {"doc": "https://help.aliyun.com/zh/rds", "changelog": "https://help.aliyun.com/zh/rds/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/rds/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-rds/index.html",
"deep_links": [
{"text": "产品介绍", "url": "https://support.huaweicloud.com/productdesc-rds/zh-cn_topic_dashboard.html"},
{"text": "计费说明", "url": "https://support.huaweicloud.com/price-rds/rds_00_0006.html"},
{"text": "快速入门", "url": "https://support.huaweicloud.com/qs-rds/rds_02_0148.html"},
{"text": "性能白皮书", "url": "https://support.huaweicloud.com/pwp-rds/pwp_0000.html"},
{"text": "最佳实践", "url": "https://support.huaweicloud.com/bestpractice-rds/practice_0000.html"},
]},
},
"redis": {
"name": "云数据库 Redis / DCS",
"aliyun": {"doc": "https://help.aliyun.com/zh/redis", "changelog": "https://help.aliyun.com/zh/redis/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/dcs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dcs/index.html",
"deep_links": [
{"text": "什么是DCS", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-200713001.html"},
{"text": "典型应用场景", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-200713002.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs_01_0006.html"},
{"text": "DCS产品选型参考", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs_01_0002.html"},
{"text": "Redis实例类型差异", "url": "https://support.huaweicloud.com/productdesc-dcs/dcs-pd-191224001.html"},
]},
},
"ack": {
"name": "容器服务 ACK / CCE",
"aliyun": {"doc": "https://help.aliyun.com/zh/ack", "changelog": "https://help.aliyun.com/zh/ack/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/cce/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-cce/index.html",
"deep_links": [
{"text": "什么是CCE", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0001.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0002.html"},
{"text": "版本说明", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0003.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-cce/cce_productdesc_0005.html"},
]},
},
"fc": {
"name": "函数计算 FC / FunctionGraph",
"aliyun": {"doc": "https://help.aliyun.com/zh/fc", "changelog": "https://help.aliyun.com/zh/fc/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/functiongraph/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-functiongraph/index.html",
"deep_links": [
{"text": "什么是FunctionGraph", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0100.html"},
{"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0200.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-functiongraph/functiongraph_01_0300.html"},
]},
},
"slb": {
"name": "负载均衡 SLB / ELB",
"aliyun": {"doc": "https://help.aliyun.com/zh/slb", "changelog": "https://help.aliyun.com/zh/slb/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/elb/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-elb/index.html",
"deep_links": [
{"text": "什么是ELB", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0001.html"},
{"text": "功能概述", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0003.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0004.html"},
{"text": "规格", "url": "https://support.huaweicloud.com/productdesc-elb/elb_pro_0010.html"},
]},
},
"maxcompute": {
"name": "大数据 MaxCompute / MRS",
"aliyun": {"doc": "https://help.aliyun.com/zh/maxcompute", "changelog": "https://help.aliyun.com/zh/maxcompute/product-overview/Release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/mrs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-mrs/index.html",
"deep_links": [
{"text": "什么是MRS", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0001.html"},
{"text": "组件版本", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0005.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-mrs/mrs_08_0002.html"},
]},
},
"pai": {
"name": "AI 平台 PAI / ModelArts",
"aliyun": {"doc": "https://help.aliyun.com/zh/pai", "changelog": "https://help.aliyun.com/zh/pai/user-guide/api-aiworkspace-2021-02-04-changeset"},
"huawei": {"doc": "https://support.huaweicloud.cn/modelarts/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-modelarts/index.html",
"deep_links": [
{"text": "什么是ModelArts", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0001.html"},
{"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0002.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-modelarts/modelarts_product_0003.html"},
]},
},
"bailian": {
"name": "大模型平台 百炼 / 盘古",
"aliyun": {"doc": "https://help.aliyun.com/zh/bailian", "changelog": "https://help.aliyun.com/zh/bailian/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/pangu/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-pangu/index.html"},
},
"cdn": {
"name": "CDN",
"aliyun": {"doc": "https://help.aliyun.com/zh/cdn", "changelog": "https://help.aliyun.com/zh/cdn/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/cdn/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-cdn/index.html",
"deep_links": [
{"text": "什么是华为云CDN", "url": "https://support.huaweicloud.com/productdesc-cdn/zh-cn_topic_0064907747.html"},
{"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-cdn/zh-cn_topic_0064907763.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0067.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0369.html"},
{"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-cdn/cdn_01_0068.html"},
]},
},
"nas": {
"name": "文件存储 NAS / SFS",
"aliyun": {"doc": "https://help.aliyun.com/zh/nas", "changelog": "https://help.aliyun.com/zh/nas/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/sfs/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-sfs/index.html",
"deep_links": [
{"text": "什么是SFS", "url": "https://support.huaweicloud.com/productdesc-sfs/zh-cn_topic_0034428718.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0004.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0110.html"},
{"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0011.html"},
{"text": "计费说明", "url": "https://support.huaweicloud.com/productdesc-sfs/sfs_01_0108.html"},
]},
},
"flink": {
"name": "实时计算 Flink",
"aliyun": {"doc": "https://help.aliyun.com/zh/flink", "changelog": "https://help.aliyun.com/zh/flink/product-overview/release-note"},
"huawei": {"doc": "https://support.huaweicloud.cn/dli/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dli/index.html",
"deep_links": [
{"text": "什么是DLI", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0001.html"},
{"text": "功能特性", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0002.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-dli/dli_01_0003.html"},
]},
},
"elasticsearch": {
"name": "搜索 Elasticsearch / CSS",
"aliyun": {"doc": "https://help.aliyun.com/zh/elasticsearch", "changelog": "https://help.aliyun.com/zh/elasticsearch/product-overview/release-notes"},
"huawei": {"doc": "https://support.huaweicloud.cn/css/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-css/index.html",
"deep_links": [
{"text": "什么是云搜索服务", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0001.html"},
{"text": "产品优势", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0010.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0002.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0003.html"},
{"text": "约束与限制", "url": "https://support.huaweicloud.com/productdesc-css/css_04_0005.html"},
]},
},
"dws": {
"name": "数据仓库 AnalyticDB / GaussDB(DWS)",
"aliyun": {"doc": "https://help.aliyun.com/zh/analyticdb-for-postgresql", "changelog": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/release-notes",
"deep_links": [
{"text": "什么是AnalyticDB PG", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/what-is-analyticdb-for-postgresql"},
{"text": "功能特性", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/features"},
{"text": "产品优势", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/benefits"},
{"text": "产品系列", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/editions"},
{"text": "应用场景", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/scenarios"},
{"text": "约束与限制", "url": "https://help.aliyun.com/zh/analyticdb-for-postgresql/product-overview/limits-and-restrictions"},
]},
"huawei": {"doc": "https://support.huaweicloud.cn/dws/index.html", "changelog": "https://support.huaweicloud.cn/wtsnew-dws/index.html",
"deep_links": [
{"text": "什么是DWS", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0002.html"},
{"text": "数据仓库类型", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_00017.html"},
{"text": "产品功能", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0004.html"},
{"text": "应用场景", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0006.html"},
{"text": "基本概念", "url": "https://support.huaweicloud.com/productdesc-dws/dws_01_0007.html"},
]},
},
}
# ─── 优先级关键词 ──────────────────────────────────────────────────────────────
PRIORITY_KEYWORDS = [
(3, ["产品简介", "产品概述", "什么是", "功能特性", "核心功能", "产品功能", "product overview", "what is", "features"]),
(2, ["规格", "实例规格", "配置", "限制", "约束", "性能", "指标", "参数", "specification", "limits", "performance"]),
(2, ["计费", "定价", "费用", "价格", "版本对比", "版本说明", "pricing", "billing", "edition"]),
(2, ["组件版本", "版本支持", "引擎版本", "内核版本"]),
(1, ["应用场景", "使用场景", "适用场景", "最佳实践", "use case", "scenario", "best practice"]),
(-1, ["常见问题", "faq", "故障排除", "sdk", "api参考", "错误码", "迁移指南"]),
]
def score_link(text: str, href: str) -> int:
combined = (text + " " + href).lower()
return sum(w for w, kws in PRIORITY_KEYWORDS if any(k in combined for k in kws))
# ─── 目录解析 ─────────────────────────────────────────────────────────────────
ALIYUN_NAV_SELECTORS = [
".toc-menu a", ".sidebar-menu a", ".helpcenter-menu a",
"nav a", ".left-menu a",
"[class*='nav'] a", "[class*='sidebar'] a", "[class*='toc'] a", "[class*='menu'] a",
]
HUAWEI_NAV_SELECTORS = [
"[class*=nav-item] a", # 华为云新 SPA
".book-left-menu a", ".toc a", ".sidebar a", ".tree-menu a",
"[class*='catalog'] a", "[class*='tree'] a", "[class*='nav'] a", "[class*='menu'] a",
".left-nav a", ".doc-nav a", ".doc-sidebar a",
"aside a", "[role='navigation'] a",
]
async def parse_toc(page, base_url: str, nav_selectors: list, label: str) -> list:
base_domain = f"{urlparse(base_url).scheme}://{urlparse(base_url).netloc}"
links = []
for selector in nav_selectors:
try:
els = await page.query_selector_all(selector)
if len(els) <= 3:
continue
for el in els:
href = await el.get_attribute("href") or ""
text = (await el.inner_text()).strip()
if not href or not text or href == "#" or href.startswith("javascript"):
continue
full_url = urljoin(base_domain, href) if href.startswith("/") else href
if urlparse(base_domain).netloc not in urlparse(full_url).netloc:
continue
links.append({"url": full_url, "text": text, "score": score_link(text, href)})
if links:
print(f" [{label}] nav selector '{selector}' => {len(links)} links")
return links
except Exception:
continue
print(f" [{label}] [WARN] all selectors missed, TOC parse failed")
return []
# ─── 正文提取 ─────────────────────────────────────────────────────────────────
ALIYUN_CONTENT_SELECTORS = [
".help-detail-content", ".article-content", ".doc-body",
"#docContent", ".markdown-body", "article",
]
HUAWEI_CONTENT_SELECTORS = [
".book-desc", ".content-block", "#content", "article", ".markdown-body",
]
FALLBACK_CONTENT_SELECTORS = ["main", ".main-content", ".content"]
NOISE_PATTERNS = [
"为什么选择阿里云", "什么是云计算", "全球基础设施", "法律声明",
"Cookies政策", "廉正举报", "安全举报", "联系我们", "加入我们",
"阿里巴巴集团", "淘宝网", "天猫", "速卖通",
"关注阿里云", "阿里云公众号", "随时随地运维管控",
"售前咨询", "售后在线", "我要建议", "我要投诉",
"登录阿里云", "管理云资源", "状态一览",
"Protected by Tencent", "正在验证连接安全性",
"华为云App", "950808", "售前咨询热线",
"云商店咨询", "备案服务", "增值电信业务",
"黔ICP备", "苏B2-", "贵公网安备",
]
def denoise_text(text: str) -> str:
lines = text.split("\n")
clean = []
for line in lines:
s = line.strip()
if not s:
clean.append(""); continue
if not any(p in s for p in NOISE_PATTERNS):
clean.append(s)
result, prev_empty = [], False
for line in clean:
if not line:
if not prev_empty: result.append("")
prev_empty = True
else:
result.append(line); prev_empty = False
return "\n".join(result)
async def extract_text(page, content_selectors: list | None = None) -> str:
selectors = content_selectors or (ALIYUN_CONTENT_SELECTORS + HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS)
for sel in selectors:
try:
el = await page.query_selector(sel)
if el:
text = await el.inner_text()
if len(text.strip()) > 300:
return denoise_text(text.strip())
except Exception:
continue
body = await page.query_selector("body")
if body:
return denoise_text((await body.inner_text()).strip())
return ""
# ─── 安全验证 & 404 检测 ─────────────────────────────────────────────────────
def is_security_block(text: str) -> bool:
return any(p in text for p in ["正在验证连接安全性", "Protected by Tencent Cloud EdgeOne", "Security Verification"])
def is_404_page(text: str) -> bool:
return "很抱歉,没发现您要的页面" in text
def huawei_cn_to_com(url: str) -> str:
return url.replace("support.huaweicloud.cn", "support.huaweicloud.com")
# ─── HTTP fallback ────────────────────────────────────────────────────────────
async def fetch_via_http(url: str) -> str:
if not HAS_HTTPX:
return ""
try:
async with httpx.AsyncClient(follow_redirects=True, timeout=30,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}) as client:
resp = await client.get(url)
if resp.status_code != 200:
return ""
soup = BeautifulSoup(resp.text, "html.parser")
for tag in soup.find_all(["script", "style", "nav", "footer", "header", "noscript"]):
tag.decompose()
for sel in [".book-desc", ".content-block", "#content", "article", ".markdown-body", "main"]:
for el in soup.select(sel):
text = el.get_text(separator="\n", strip=True)
if len(text) > 300:
return denoise_text(text)
body = soup.find("body")
if body:
return denoise_text(body.get_text(separator="\n", strip=True))
return ""
except Exception:
return ""
# ─── 带重试的页面抓取 ─────────────────────────────────────────────────────────
MAX_RETRIES = 2
async def fetch_content(context, url: str, content_selectors: list | None = None,
label: str = "", is_huawei: bool = False) -> str:
for attempt in range(MAX_RETRIES + 1):
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=45000)
for selector in (content_selectors or ALIYUN_CONTENT_SELECTORS + HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS):
try:
await page.wait_for_selector(selector, timeout=3000)
break
except PlaywrightTimeout:
continue
text = await extract_text(page, content_selectors)
if is_security_block(text):
if HAS_HTTPX:
http_text = await fetch_via_http(url)
if len(http_text.strip()) > 200:
print(f" [{label}] [HTTP-FALLBACK] bypassed security check")
return http_text
return f"[Security block, HTTP fallback failed, visit manually: {url}]"
if is_404_page(text):
if is_huawei and "huaweicloud.cn" in url:
com_url = huawei_cn_to_com(url)
print(f" [{label}] [404] .cn 404, trying .com")
com_text = await fetch_via_http(com_url)
if len(com_text.strip()) > 200:
return com_text
return f"[404 page, visit manually: {url}]"
if len(text.strip()) < 100:
if HAS_HTTPX:
http_text = await fetch_via_http(url)
if len(http_text.strip()) > 200:
print(f" [{label}] [HTTP-FALLBACK] Playwright empty, HTTP ok")
return http_text
return f"[Empty page, visit manually: {url}]"
return text
except PlaywrightTimeout:
if attempt < MAX_RETRIES:
print(f" [{label}] [RETRY] timeout, attempt {attempt+1}...")
continue
return f"[Timeout after {MAX_RETRIES} retries, visit manually: {url}]"
except Exception as e:
if attempt < MAX_RETRIES:
err = str(e)
if "ERR_NETWORK" in err or "ERR_CONNECTION" in err or "Timeout" in err:
print(f" [{label}] [RETRY] network error, attempt {attempt+1}...")
continue
return f"[Failed: {e}]"
finally:
await page.close()
return f"[Retries exhausted, visit manually: {url}]"
# ─── 单侧完整抓取 ─────────────────────────────────────────────────────────────
async def scrape_side(context, label: str, doc_url: str, changelog_url: str,
nav_selectors: list, max_pages: int,
deep_links: list | None = None,
content_selectors: list | None = None,
is_huawei: bool = False) -> dict:
result = {"label": label, "doc_url": doc_url, "changelog_url": changelog_url,
"pages": [], "changelog": "", "toc_total": 0}
# Step 1: 打开首页 + 解析目录
print(f"\n [{label}] Parsing TOC...")
index_page = await context.new_page()
toc = []
try:
await index_page.goto(doc_url, wait_until="networkidle", timeout=60000)
await asyncio.sleep(2) # extra buffer for SPA render
toc = await parse_toc(index_page, doc_url, nav_selectors, label)
result["toc_total"] = len(toc)
if not toc or len(toc) < 3:
if deep_links and len(deep_links) >= 3:
print(f" [{label}] TOC links insufficient({len(toc)}), using {len(deep_links)} deep_links")
toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
result["toc_total"] = len(toc)
else:
print(f" [{label}] Fallback: only fetch homepage")
result["pages"].append(("Homepage", doc_url, await extract_text(index_page, content_selectors)))
elif deep_links and len(deep_links) >= len(toc):
print(f" [{label}] TOC {len(toc)} links, deep_links {len(deep_links)}, prefer deep_links")
toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
result["toc_total"] = len(toc)
except PlaywrightTimeout:
print(f" [{label}] [WARN] Homepage timeout, trying deep_links or HTTP fallback")
if deep_links:
toc = [{"url": dl["url"], "text": dl["text"], "score": 3} for dl in deep_links]
result["toc_total"] = len(toc)
elif HAS_HTTPX:
http_text = await fetch_via_http(doc_url)
if len(http_text.strip()) > 200:
result["pages"].append(("Homepage(HTTP)", doc_url, http_text))
finally:
await index_page.close()
# Step 2: 去重 → 过滤 → 排序 → 取 top-N
if toc:
seen, unique = set(), []
for lk in toc:
if lk["url"] not in seen:
seen.add(lk["url"])
unique.append(lk)
candidates = sorted([lk for lk in unique if lk["score"] >= 0],
key=lambda x: x["score"], reverse=True)[:max_pages]
print(f" [{label}] Selected {len(candidates)}/{len(unique)} core pages:")
for i, lk in enumerate(candidates):
print(f" {i+1:2d}. [{lk['score']:+d}] {lk['text'][:45]}")
# Step 3: 并发抓取
sem = asyncio.Semaphore(4)
async def fetch_one(lk):
async with sem:
content = await fetch_content(context, lk["url"], content_selectors, label, is_huawei)
ok = "[OK]" if not content.startswith("[") else "[FAIL]"
print(f" [{label}] {ok} {lk['text'][:40]}")
return (lk["text"], lk["url"], content)
result["pages"] = list(await asyncio.gather(*[fetch_one(lk) for lk in candidates]))
# Step 4: 更新日志
print(f" [{label}] Fetching changelog...")
result["changelog"] = await fetch_content(context, changelog_url, content_selectors, label, is_huawei)
ok = "[OK]" if not result["changelog"].startswith("[") else "[FAIL]"
print(f" [{label}] {ok} Changelog done")
return result
# ─── 拼装 Markdown ────────────────────────────────────────────────────────────
def build_markdown(product_name: str, aliyun: dict, huawei: dict) -> str:
now = datetime.now().strftime("%Y-%m-%d")
lines = [
f"# 竞品分析原始资料:{product_name}",
f"> 抓取时间:{now}",
f"> 阿里云:目录 {aliyun['toc_total']} 页,本次抓取 {len(aliyun['pages'])} 页",
f"> 华为云:目录 {huawei['toc_total']} 页,本次抓取 {len(huawei['pages'])} 页",
"", "**给 AI 的指令:** 基于以下官方文档原文,分析两款产品的真实差异。",
"重点挖掘:关键指标的数字差距、一方有而另一方没有的能力、相同功能的成熟度差异、近期迭代方向的分歧。",
"无差异或差异不明显的维度直接略过,不要凑字数。", "", "---", "",
]
for side in [aliyun, huawei]:
lines += [f"# {side['label']}", ""]
if not side["pages"]:
lines.append(f"> [WARN] Fetch failed, visit manually: {side['doc_url']}")
else:
for title, url, content in side["pages"]:
lines += [f"## {side['label']} · {title}", f"> Source: {url}", ""]
body = content[:8000]
if len(content) > 8000:
body += "\n\n[...truncated, see source link...]"
lines += [body, "", "---", ""]
lines += [f"## {side['label']} · Changelog", f"> Source: {side['changelog_url']}", ""]
cl = side["changelog"][:6000]
if len(side["changelog"]) > 6000:
cl += "\n\n[...truncated...]"
lines += [cl or f"> [WARN] Fetch failed, visit manually: {side['changelog_url']}", "", "---", ""]
return "\n".join(lines)
# ─── 主入口 ───────────────────────────────────────────────────────────────────
async def main_async(product_key: str, max_pages: int, output: str, args):
if product_key not in PRODUCTS:
print(f"[FAIL] Unknown product '{product_key}', available: {', '.join(PRODUCTS.keys())}")
sys.exit(1)
cfg = PRODUCTS[product_key]
print(f"\n{'='*60}\n {cfg['name']} (max {max_pages} core pages per side)\n{'='*60}")
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled", "--no-sandbox"],
)
# 阿里云:普通 context(不需要 stealth)
ctx_aliyun = await browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="zh-CN",
)
# 华为云:stealth context(可选,通过 --stealth 启用,用于处理部分站点 JS 渲染兼容性问题)
ctx_huawei = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" if args.stealth else None,
viewport={"width": 1920, "height": 1080},
locale="zh-CN",
)
if args.stealth:
await ctx_huawei.add_init_script("Object.defineProperty(navigator, 'webdriver', { get: () => undefined });")
try:
aliyun = await scrape_side(
ctx_aliyun, "Aliyun",
cfg["aliyun"]["doc"], cfg["aliyun"]["changelog"],
ALIYUN_NAV_SELECTORS, max_pages,
deep_links=cfg["aliyun"].get("deep_links"),
content_selectors=ALIYUN_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS,
is_huawei=False,
)
huawei = await scrape_side(
ctx_huawei, "Huawei",
cfg["huawei"]["doc"], cfg["huawei"]["changelog"],
HUAWEI_NAV_SELECTORS, max_pages,
deep_links=cfg["huawei"].get("deep_links"),
content_selectors=HUAWEI_CONTENT_SELECTORS + FALLBACK_CONTENT_SELECTORS,
is_huawei=True,
)
finally:
await ctx_aliyun.close()
await ctx_huawei.close()
await browser.close()
md = build_markdown(cfg["name"], aliyun, huawei)
if output:
Path(output).write_text(md, encoding="utf-8")
print(f"\n[OK] Done! Saved to {output} ({len(md.encode())//1024} KB)")
print(" Paste the file content to AI to start competitive analysis.")
else:
print("\n" + "=" * 60 + "\n" + md)
def main():
parser = argparse.ArgumentParser(description="Aliyun & Huawei doc scraper v4")
parser.add_argument("--product", default="",
help="Product key. Available:\n" + "\n".join(f" {k:<15} {v['name']}" for k, v in PRODUCTS.items()))
parser.add_argument("--list", action="store_true", help="List all products")
parser.add_argument("--output", default="", help="Output file path")
parser.add_argument("--max-pages", type=int, default=12, help="Max core pages per side (default 12)")
parser.add_argument("--stealth", action="store_true", help="Enable stealth mode for sites with JS rendering compatibility issues (use with caution)")
args = parser.parse_args()
if args.list:
print("\nAvailable products:\n")
for k, v in PRODUCTS.items():
print(f" {k:<15} {v['name']}")
print(); return
if not args.product:
parser.print_help(); sys.exit(1)
asyncio.run(main_async(args.product, args.max_pages, args.output, args))
if __name__ == "__main__":
main()
基于Bing国内版的稳定联网搜索工具,中文环境深度优化,支持全文内容抓取,绕过常见反爬限制,返回结构化搜索结果。
---
name: free-web-search
description: 基于Bing国内版的稳定联网搜索工具,中文环境深度优化,支持全文内容抓取,绕过常见反爬限制,返回结构化搜索结果。
version: 7
author: free-web-search
trigger_keywords:
- 搜索
- 查一下
- 找一下
- 最新消息
- 新闻
- 最新动态
- 官网
- 教程
- 是什么
tools:
- name: web_search
description: 联网搜索并返回结构化结果,中文环境优化,支持全文内容抓取
script: scripts/web_search.py
parameters:
query:
type: string
description: 【必填】搜索关键词/短句,必须简洁精准,符合下方Query优化规范,禁止长句/反问句
required: true
max:
type: integer
description: 最大返回的搜索结果条数,默认10,最大不超过20
required: false
full:
type: integer
description: 抓取前N条结果的网页全文内容,默认0(不抓取),最大不超过5
required: false
engine:
type: string
description: 搜索引擎选择,bing/duckduckgo/auto(默认bing)
required: false
filter:
type: boolean
description: 过滤低质量域名(如知乎),默认false(不过滤)
required: false
---
# free-web-search v14 联网搜索工具
基于 Playwright 浏览器实现的稳定搜索工具,**意图识别** + **请求节流** + **结果质量评分** + **保留CSS修复**。
## v14 更新内容
- ✅ **[关键修复] 保留CSS**:之前拦截CSS导致Bing搜索结果标题文字丢失
- ✅ **意图识别+query改写**:搜索质量差时自动改写query(城市游玩→景点推荐、今日价格→实时行情等)
- ✅ **改写仅在质量差时触发**:先搜原始query,质量好就不改写,避免改写搞坏本来好的query
- ✅ **请求节流**:两次Bing请求间隔≥3s,避免触发限流
- ✅ **限流检测+退避**:0结果时递增等待重试,排除重试也0结果时停止
- ✅ **`--filter` 回退**:过滤后为空自动回退到不过滤结果
- ✅ **单域名排除重试**:最多2轮,结果更好才替换
- ✅ **DuckDuckGo国内快速失败**:10s超时×1次
- ✅ **`--no-rewrite`**:禁用query改写(调试用)
## 核心能力
- ✅ **中文环境深度优化**:强制 Bing 返回中文结果
- ✅ **反爬检测绕过**:多层反检测措施(stealth.js)
- ✅ **全文抓取**:支持按需抓取目标网页的完整正文内容
- ✅ **Headless 模式**:服务器可用,无需显示器
---
## 【核心必读】搜索Query优化规范
**搜索效果的好坏,90%取决于Query是否合理**,请严格遵循以下规则生成搜索词:
### 一、黄金原则
1. **简洁精准**:只保留核心关键词,用2-5个核心词组合,禁止长句、反问句、口语化描述
2. **限定明确**:需要时效性/领域/地区内容时,必须加上对应的限定词
3. **格式正确**:使用中文关键词 + 英文/数字限定词,禁止特殊符号、无意义助词
### 二、正确示例 vs 错误示例
| 搜索场景 | 正确Query(推荐) | 错误Query(禁止) |
|----------|--------------------|--------------------|
| 时效性新闻 | 2026年04月 美伊局势 最新 | 你能帮我查一下最近美国和伊朗之间发生了什么事吗 |
| 技术教程 | Python 异步编程 最佳实践 2026 | 我想学习一下Python的异步编程,有没有好的教程 |
| 知识科普 | 中国大型邮轮 花城号 出坞 最新消息 | 中国的那个大型邮轮花城号现在怎么样了 |
| 本地内容 | 广东东莞 今日天气 | 我现在在东莞,今天天气怎么样啊 |
| 官方信息 | 华为云 ModelArts 官方文档 | 华为云的那个ModelArts的官网在哪里,文档怎么看 |
---
## 参数说明
| 参数名 | 类型 | 说明 | 默认值 | 取值限制 |
|--------|------|------|--------|----------|
| `query` | 字符串 | 【必填】搜索关键词 | - | 不能为空 |
| `max` | 整数 | 最多返回的搜索结果条数 | 10 | 1-20 |
| `full` | 整数 | 抓取前N条结果的网页全文 | 0 | 0-5 |
| `engine` | 字符串 | 搜索引擎选择 | bing | bing/duckduckgo/auto |
| `filter` | 布尔 | 过滤低质量域名 | false | - |
---
## 使用示例
```bash
# 基础搜索
python scripts/web_search.py "经济新闻 今日" --max=10
# 抓取前3条结果的全文
python scripts/web_search.py "经济新闻 最新" --full=3
# 使用 auto 模式(Bing 结果不足时切换 DuckDuckGo)
python scripts/web_search.py "技术教程" --engine=auto
# 过滤知乎等低质量域名
python scripts/web_search.py "某个话题" --filter
```
---
## 常见问题
### 搜索返回空结果
1. 检查网络连接(VPN 可能影响 Bing 国内版)
2. 尝试 `--engine=duckduckgo` 直接用 DuckDuckGo
3. 检查 Query 是否过于冗长或口语化
### 浏览器启动失败
```bash
pip install playwright && playwright install chromium
```
### 全文抓取失败
- 某些网站有强反爬限制
- 知乎等域名在全文抓取时自动跳过
### 结果集中在单一域名
- 脚本会自动检测并警告 `[WARN] 结果集中在单一域名`
- **解决方案**:换用更具体的关键词,避免歧义词
### 搜索关键词避坑指南
| ❌ 避免 | ✅ 推荐 |
|---------|---------|
| `民生新闻` | `住房 医疗 就业` 或 `社会政策 百姓生活` |
| `经济新闻` | `财经政策 GDP` 或 `A股 沪指` |
| `长护险` | `长期护理保险 养老服务` |
### 服务器环境
- 脚本强制使用 `headless=True`,无需显示器
- 已添加服务器兼容的浏览器参数
FILE:scripts/setup.sh
#!/usr/bin/env bash
# free-web-search 依赖安装脚本
# 用法: bash scripts/setup.sh
# 支持: --mirror <url> 指定 pip 镜像源
set -e
MIRROR="-"
if [ "$1" = "--mirror" ] && [ -n "$2" ]; then
MIRROR="$2"
fi
PIP_ARGS=""
if [ -n "$MIRROR" ]; then
PIP_ARGS="-i $MIRROR --trusted-host $(echo $MIRROR | sed 's|https\?://||' | cut -d/ -f1)"
fi
echo "=== free-web-search 依赖安装 ==="
# 检测 pip
if command -v pip &>/dev/null; then
PIP=pip
elif command -v pip3 &>/dev/null; then
PIP=pip3
else
echo "[ERROR] 找不到 pip,请先安装 Python 3.10+"
exit 1
fi
echo "[1/3] 安装 Python 包..."
$PIP install httpx beautifulsoup4 playwright $PIP_ARGS
echo "[2/3] 安装 Chromium 浏览器..."
playwright install chromium
echo "[3/3] 验证安装..."
python3 -c "
import httpx; print(' httpx OK')
from bs4 import BeautifulSoup; print(' beautifulsoup4 OK')
from playwright.sync_api import sync_playwright; print(' playwright OK')
pw = sync_playwright().start()
b = pw.chromium.launch(headless=True)
b.close()
pw.stop()
print(' chromium OK')
"
echo "=== 安装完成 ==="
FILE:scripts/web_search.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
free-web-search v7
意图识别 + query改写 + 请求节流 + 结果质量评分 + 单域名排除重试 + 保留CSS"""
import sys
import json
import time
import re
import argparse
import subprocess
from urllib.parse import urlencode, quote, urlparse
from datetime import datetime
# ==================== 强制UTF-8 ====================
sys.stdout.reconfigure(encoding='utf-8')
sys.stderr.reconfigure(encoding='utf-8')
# ==================== 配置 ====================
DEFAULT_MAX = 10
DEFAULT_FULL = 0
TIMEOUT = 30000
FETCH_TIMEOUT= 15000
DDG_TIMEOUT = 10000
MAX_RETRIES = 3
DDG_RETRIES = 1
WAIT_TIME = 2000
QUALITY_THRESHOLD = 0.45
MIN_REQUEST_INTERVAL = 3.0
_last_request_time = 0.0
UA = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
)
BLOCK_DOMAINS = [
# 知乎搜索结果可以抓全文(可能遇到反爬,但值得试)
]
LOW_QUALITY_DOMAINS = [
"jingyan.baidu.com",
"zhidao.baidu.com",
"tieba.baidu.com",
"baike.baidu.com",
"wenku.baidu.com",
"bbs.16fan.com",
"zhihu.com",
"zhuanlan.zhihu.com",
]
AUTHORITY_HINTS = [
".gov.", "gov.cn", ".org.",
"kitco.com", "sge.com.cn", "cngold.org", "gold.org.cn",
"kekegold.com", "cngoldprice.com", "ip138.com",
"finance.sina", "finance.eastmoney", "10jqka.com.cn",
"jujindata.com", "huilvbiao.com", "jinjia.com.cn",
"mafengwo.cn", "ctrip.com", "damai.cn",
"visitshenzhen",
]
FUZZY_TIME_WORDS = re.compile(r'(今日|今天|最新|最近|当前|目前|当下|现在)')
# ==================== 意图识别 + query 改写 ====================
CITIES = "深圳|广州|北京|上海|杭州|成都|武汉|南京|重庆|西安|长沙|苏州|厦门|青岛|大连|天津|昆明|珠海|东莞|佛山|惠州|中山"
# 意图规则: (匹配正则, 改写函数, 描述)
# 改写函数接收 match 对象和原始 query,返回改写后的完整 query
# 原则:只精简/替换,不加词!Bing CN对简洁query效果最好
INTENT_RULES = [
# 城市+好玩/去哪 → 精简为"城市 景点"
(re.compile(rf'({CITIES})\s*(有什么好玩的|哪里好玩|好玩的地方|去哪玩|周末.*去哪|好去处|逛|玩什么)'),
lambda m, q: f'{m.group(1)} 景点', '城市游玩→景点'),
# 城市+活动 → 精简
(re.compile(rf'({CITIES})\s*(活动|展览|演出|市集|音乐会|演唱会)'),
lambda m, q: f'{m.group(1)} {m.group(2)}', '城市活动→精简'),
# "今日金价" → "金价"(Bing CN对"今日"匹配百度经验,去掉更好)
(re.compile(r'今日(金价|银价|油价|铜价|铂金价)'),
lambda m, q: f'{m.group(1)}', '今日价格→去掉今日'),
# "xxx是什么" → "xxx 介绍"
(re.compile(r'(.+?)是什么(?:意思)?$', re.IGNORECASE),
lambda m, q: f'{m.group(1)} 介绍', '是什么→介绍'),
# "xxx怎么样" → "xxx 评价"
(re.compile(r'(.+?)怎么样$', re.IGNORECASE),
lambda m, q: f'{m.group(1)} 评价', '怎么样→评价'),
# "怎么xxx" → "xxx 方法"
(re.compile(r'^怎么(.+)', re.IGNORECASE),
lambda m, q: f'{m.group(1)} 方法', '怎么→方法'),
# "xxx和yyy哪个好" → "xxx yyy 对比"
(re.compile(r'(.+?)和(.+?)(哪个好|哪个更好|选哪个)'),
lambda m, q: f'{m.group(1)} {m.group(2)} 对比', '哪个好→对比'),
]
def rewrite_query(query: str) -> tuple:
"""意图识别 + query改写。只应用第一个匹配的规则。返回 (改写后query, 意图描述或None)"""
for pattern, rewrite_fn, desc in INTENT_RULES:
m = pattern.search(query)
if m:
rewritten = rewrite_fn(m, query)
rewritten = re.sub(r'\s+', ' ', rewritten).strip()
return rewritten, desc
return query, None
# 反检测初始化脚本
STEALTH_JS = """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {
get: () => [
{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
{name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpafafjmlifpcpbgpcj'},
{name: 'Native Client Executable', filename: 'internal-nacl-plugin'}
]
});
Object.defineProperty(navigator, 'languages', {get: () => ['zh-CN', 'zh', 'en']});
Object.defineProperty(navigator, 'platform', {get: () => 'Win32'});
Object.defineProperty(navigator, 'hardwareConcurrency', {get: () => 8});
Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});
window.chrome = {runtime: {}};
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(type) {
if (type === 'image/png' && this.width === 220 && this.height === 30) {
return 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAANwAAAAeCAYAAABwJ3rwAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAABmJLR0QA/wD/AP+gvaeTAAAABmJLR0QA/wD/AP+gvaeTAAAABmJLR0QA/wD/AP+gvaeT';
}
return originalToDataURL.apply(this, arguments);
};
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return 'Intel Inc.';
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
return getParameter.apply(this, arguments);
};
"""
BROWSER_ARGS = [
'--no-sandbox', '--disable-gpu', '--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled', '--disable-infobars',
'--disable-extensions', '--disable-background-networking', '--disable-sync',
'--metrics-recording-only', '--disable-default-apps', '--no-first-run',
'--disable-component-extensions-with-background-pages',
'--disable-features=IsolateOrigins,site-per-process',
'--disable-site-isolation-trials', '--disable-web-security',
'--allow-running-insecure-content',
]
ROUTE_PATTERN = "**/*.{png,jpg,jpeg,gif,svg,woff,woff2,ttf,mp4,ico,webp,js.map}"
_browser = None
_playwright = None
def ensure_playwright():
try:
import playwright; return
except ImportError: pass
print("[INFO] 安装 playwright...", file=sys.stderr)
for cmd in [
[sys.executable, "-m", "pip", "install", "-q", "playwright", "--break-system-packages"],
[sys.executable, "-m", "pip", "install", "-q", "playwright"],
]:
if subprocess.run(cmd, capture_output=True, text=True).returncode == 0:
break
subprocess.run([sys.executable, "-m", "playwright", "install", "chromium"],
capture_output=True, text=True)
import os; os.execv(sys.executable, [sys.executable] + sys.argv)
def parse_args():
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument("query", nargs="*")
parser.add_argument("--max", type=int, default=DEFAULT_MAX, dest="max_results")
parser.add_argument("--full", type=int, default=DEFAULT_FULL)
parser.add_argument("--engine", type=str, default="bing", choices=["bing", "duckduckgo", "auto"])
parser.add_argument("--filter", action="store_true", help="过滤低质量域名")
parser.add_argument("--no-rewrite", action="store_true", help="禁用query改写")
args = parser.parse_args()
query = " ".join(args.query).strip()
args.max_results = max(1, min(20, args.max_results))
args.full = max(0, min(5, args.full))
return query, args.max_results, args.full, args.engine, args.filter, args.no_rewrite
def build_bing_url(query, count):
return "https://cn.bing.com/search?" + urlencode({
"q": query, "mkt": "zh-CN", "setlang": "zh-CN", "cc": "CN", "count": str(count + 2)
})
def build_duckduckgo_url(query):
return "https://duckduckgo.com/?q=" + quote(query) + "&ia=web"
def init_browser():
global _browser, _playwright
if _browser: return
from playwright.sync_api import sync_playwright
print("[DEBUG] 启动 Chromium...", file=sys.stderr)
_playwright = sync_playwright().start()
_browser = _playwright.chromium.launch(headless=True, args=BROWSER_ARGS)
print("[DEBUG] Chromium 已就绪", file=sys.stderr)
def close_browser():
try:
if _browser: _browser.close()
if _playwright: _playwright.stop()
except: pass
def create_context():
ctx = _browser.new_context(
locale="zh-CN", user_agent=UA,
viewport={"width":1920,"height":1080}, screen={"width":1920,"height":1080},
device_scale_factor=1, timezone_id="Asia/Shanghai",
has_touch=False, is_mobile=False, java_script_enabled=True,
)
ctx.add_init_script(STEALTH_JS)
return ctx
def throttle():
global _last_request_time
gap = MIN_REQUEST_INTERVAL - (time.time() - _last_request_time)
if gap > 0:
time.sleep(gap)
_last_request_time = time.time()
def is_blocked_domain(url): return any(d in url for d in BLOCK_DOMAINS)
def is_low_quality_domain(url): return any(d in url for d in LOW_QUALITY_DOMAINS)
def score_result(r):
s = 0.5
url, snippet = r.get("url",""), r.get("snippet","")
if is_low_quality_domain(url): s -= 0.3
if re.search(r'\d{2,}', snippet): s += 0.15
if len(snippet) < 20: s -= 0.1
for h in AUTHORITY_HINTS:
if h in url: s += 0.2; break
return max(0.0, min(1.0, s))
def score_results(results):
return sum(score_result(r) for r in results) / len(results) if results else 0.0
def get_dominant_domain(results):
if not results: return (None, 0, 0)
domains = {}
for r in results:
d = urlparse(r["url"]).netloc.replace("www.", "")
domains[d] = domains.get(d, 0) + 1
top = max(domains, key=domains.get)
if domains[top] > len(results) * 0.5:
return (top, domains[top], len(results))
return (None, 0, len(results))
def merge_results(primary, secondary, max_results):
seen, merged = set(), []
for r in primary + secondary:
if r["url"] not in seen: seen.add(r["url"]); merged.append(r)
return merged[:max_results]
def apply_filter(results, do_filter):
if do_filter and results:
filtered = [r for r in results if not is_blocked_domain(r["url"]) and not is_low_quality_domain(r["url"])]
if filtered: return filtered
print("[WARN] --filter 过滤后为空,回退", file=sys.stderr)
return results
# ==================== Bing 搜索 ====================
def search_bing(query, max_results, do_filter=False):
start = time.time()
url = build_bing_url(query, max_results + 5)
print(f"[DEBUG] Bing: {query} | max={max_results}", file=sys.stderr)
init_browser()
results = []
for attempt in range(MAX_RETRIES):
throttle()
ctx, page = create_context(), None
try:
page = ctx.new_page()
page.route(ROUTE_PATTERN, lambda r: r.abort())
page.goto(url, timeout=TIMEOUT, wait_until="domcontentloaded")
page.wait_for_timeout(WAIT_TIME)
raw = page.evaluate("""() => {
const items = [];
document.querySelectorAll('li.b_algo').forEach(el => {
try {
const a = el.querySelector('h2 a');
const p = el.querySelector('.b_caption p, .b_algoSlug');
if (a && a.href && a.href.startsWith('http'))
items.push({title:(a.innerText||a.textContent||'').trim(), url:a.href.trim(), snippet:p?(p.innerText||p.textContent||'').trim():''});
} catch(e) {}
});
return items;
}""")
for r in raw:
if r["title"] and r["url"] and len(r["title"]) > 3:
results.append({"title":r["title"],"url":r["url"],"snippet":r["snippet"],"content":""})
if len(results) >= max_results: break
if len(raw) == 0 and attempt < MAX_RETRIES - 1:
w = 5 * (attempt + 1)
print(f"[WARN] 0结果,可能限流,等{w}s", file=sys.stderr)
time.sleep(w)
except Exception as e:
print(f"[WARN] Bing尝试{attempt+1}失败: {e}", file=sys.stderr)
finally:
try: ctx.close()
except: pass
results = apply_filter(results, do_filter)[:max_results]
dom, dc, tot = get_dominant_domain(results)
if dom: print(f"[WARN] 单域名集中: {dom} ({dc}/{tot})", file=sys.stderr)
q = score_results(results)
print(f"[DEBUG] 质量: {q:.2f} | 数量: {len(results)} | 耗时: {time.time()-start:.1f}s", file=sys.stderr)
return results, dom
# ==================== DuckDuckGo ====================
def search_duckduckgo(query, max_results, do_filter=False):
start = time.time()
url = build_duckduckgo_url(query)
print(f"[DEBUG] DDG: {query}", file=sys.stderr)
init_browser()
results = []
for _ in range(DDG_RETRIES):
ctx, page = create_context(), None
try:
page = ctx.new_page()
page.route(ROUTE_PATTERN, lambda r: r.abort())
page.goto(url, timeout=DDG_TIMEOUT, wait_until="domcontentloaded")
page.wait_for_timeout(WAIT_TIME + 1000)
raw = page.evaluate("""() => {
const items = [];
for (const sel of ['article[data-testid="result"]','.result','[data-testid="result"]','li[data-layout="organic"]']) {
document.querySelectorAll(sel).forEach(el => {
try {
const a=el.querySelector('a[href^="http"]'),t=el.querySelector('h2,.result__a,[data-testid="result-title"] span'),s=el.querySelector('[data-testid="result-snippet"],.result__snippet');
if(a&&a.href&&t) items.push({title:(t.innerText||t.textContent||'').trim(),url:a.href.trim(),snippet:s?(s.innerText||'').trim():''});
} catch(e) {}
});
if(items.length>0) break;
}
return items;
}""")
for r in raw:
if r["title"] and r["url"] and len(r["title"]) > 3:
results.append({"title":r["title"],"url":r["url"],"snippet":r["snippet"],"content":""})
if len(results) >= max_results: break
except Exception as e:
print(f"[WARN] DDG失败: {e}", file=sys.stderr)
finally:
try: ctx.close()
except: pass
results = apply_filter(results, do_filter)[:max_results]
print(f"[DEBUG] DDG: {len(results)}条 | {time.time()-start:.1f}s", file=sys.stderr)
return results, None
# ==================== 全文抓取 ====================
def fetch_full(url):
start = time.time()
print(f"[DEBUG] 抓全文: {url}", file=sys.stderr)
if is_blocked_domain(url): return "黑名单域名,跳过"
ctx, page = create_context(), None
text = ""
try:
page = ctx.new_page()
page.route(ROUTE_PATTERN, lambda r: r.abort())
page.goto(url, timeout=FETCH_TIMEOUT, wait_until="domcontentloaded")
page.wait_for_timeout(800)
try: page.wait_for_load_state("networkidle", timeout=5000)
except: pass
text = page.evaluate("""() => {
document.querySelectorAll('script,style,nav,header,footer,.ad,.ads,[class*="banner"],[id*="banner"],.sidebar,.comment,.popup,.modal,.cookie').forEach(e=>e.remove());
for (const sel of ['article','main','.content','.post','.article','#content','#main','.entry-content','.post-content','[itemprop="articleBody"]']) {
const m=document.querySelector(sel); if(m&&m.innerText.length>200) return m.innerText;
}
return document.body?document.body.innerText:'';
}""")
except Exception as e:
print(f"[ERROR] 全文失败: {e}", file=sys.stderr)
finally:
try: ctx.close()
except: pass
result = (text or "").strip()[:8000]
print(f"[DEBUG] 全文: {len(result)}字 | {time.time()-start:.1f}s", file=sys.stderr)
return result or "抓取失败"
# ==================== 主函数 ====================
def main():
ensure_playwright()
query, max_results, full, engine, do_filter, no_rewrite = parse_args()
if not query:
print(json.dumps({"error": "no query"}, ensure_ascii=False)); sys.exit(1)
# 意图识别 + query改写(仅在搜索质量差时触发,不提前改写)
original_query = query
results = []
if engine == "duckduckgo":
results, _ = search_duckduckgo(query, max_results, do_filter)
else:
# 第1步: 用原始 query 搜索
results, dominant = search_bing(query, max_results, do_filter)
quality = score_results(results)
# 第2步: 单域名集中/低质量域名 → 排除重试(最多2轮,限流时停止)
if len(results) > 0:
excluded = set()
for _ in range(2):
target = None
if dominant and dominant not in excluded:
target = dominant
elif results and is_low_quality_domain(results[0]["url"]):
d = urlparse(results[0]["url"]).netloc.replace("www.", "")
if d not in excluded: target = d
if not target: break
excluded.add(target)
rq = query + " " + " ".join(f"-site:{d}" for d in excluded)
print(f"[INFO] 排除({target})重试", file=sys.stderr)
rr, _ = search_bing(rq, max_results, do_filter)
if not rr: break
rq_score = score_results(rr)
if rq_score > quality:
results = merge_results(rr, results, max_results)
quality = score_results(results)
dominant, _, _ = get_dominant_domain(results)
else: break
# 第3步: 质量差 → 尝试改写 query 重试(仅当启用改写且原始query未改写时)
if not no_rewrite and quality < QUALITY_THRESHOLD and len(results) > 0:
rewritten, intent = rewrite_query(query)
if intent and rewritten != query:
print(f"[INFO] 质量低({quality:.2f}),意图改写({intent}): {query} → {rewritten}", file=sys.stderr)
rr, _ = search_bing(rewritten, max_results, do_filter)
if rr and score_results(rr) > quality:
results = rr
quality = score_results(results)
# 第4步: auto模式 - 质量仍差 → 简化query(去模糊时间词)
if engine == "auto" and quality < QUALITY_THRESHOLD and len(results) > 0:
simplified = re.sub(FUZZY_TIME_WORDS, '', query).strip()
simplified = re.sub(r'\s+', ' ', simplified).strip()
if simplified != query and len(simplified) >= 2:
print(f"[INFO] 简化重试: {simplified}", file=sys.stderr)
rr, _ = search_bing(simplified, max_results, do_filter)
if rr and score_results(rr) > quality:
results = rr
# 第5步: auto模式 - Bing完全无结果 → DDG兜底
if engine == "auto" and not results:
print("[INFO] Bing无结果,DDG兜底...", file=sys.stderr)
results, _ = search_duckduckgo(query, max_results, do_filter)
# 全文抓取
if full > 0 and results:
for i in range(min(full, len(results))):
results[i]["content"] = fetch_full(results[i]["url"])
print(json.dumps(results, ensure_ascii=False, indent=2))
if __name__ == "__main__":
try: main()
finally: close_browser()