@clawhub-nowhitestar-9cdfdd2dad
Personal knowledge base that captures web content (articles, tweets/threads, videos, podcasts, images, PDFs) and makes it retrievable for future conversation...
---
name: link-library
version: "1.0.0"
description: >
Personal knowledge base that captures web content (articles, tweets/threads, videos, podcasts, images, PDFs)
and makes it retrievable for future conversations and writing.
Use when: (1) User shares a URL with ANY interest signal — asking to summarize, commenting positively,
saying "有意思/不错/interesting/值得看/学到了", or requesting it be saved,
(2) User asks to find previously saved content ("我之前存的那篇...", "find that article about..."),
(3) User needs reference material for writing or discussion,
(4) User shares Twitter/X, WeChat, YouTube, Bilibili, or any web link and engages with it.
Interest signals that trigger save: "帮我总结一下", "这篇不错", "有意思", "记一下", "留着以后用",
"这个观点很好", "学到了", "值得保存", "放进知识库", sharing link + any commentary or opinion,
asking follow-up questions about link content. Do NOT require literal "save"/"bookmark" keywords.
---
# Link Library — Personal Content Knowledge Base
Save web content with full original text, generate summaries and tags, retrieve semantically.
## Core Rules
1. **Always save original full text** — summaries are for retrieval, originals are for re-reading
2. **Detect interest, don't demand commands** — if user engages with a link, offer to save
3. **Twitter/X is first-class** — tweets, threads, and articles are fully supported
## Interest Detection
When user shares a link, evaluate interest signals:
**Auto-save (no confirmation needed):**
- User explicitly says save/bookmark/记一下/放进知识库
- User asks "帮我总结一下" (summarize implies save-worthy)
**Offer to save (ask once):**
- User shares link + positive commentary ("这篇不错", "有意思", "学到了")
- User asks follow-up questions about link content
- User discusses link content substantively
**Don't save:**
- User shares link just for quick reference in conversation
- User says "不用保存" or similar
## Data Location
All entries in `~/.openclaw/workspace-main/library/`:
```
library/
├── articles/ # Web articles, blog posts, WeChat, Zhihu
├── tweets/ # Twitter/X posts and threads
├── videos/ # YouTube, Bilibili
├── podcasts/ # Podcast episodes
├── papers/ # Academic papers, PDFs
├── images/ # Infographics, visual content
└── misc/ # Everything else
```
## Content Types & Fetch Methods
| Type | URL Patterns | Fetch Method | Template |
|------|-------------|--------------|----------|
| article | Generic web, blog, /post/ | `web_fetch` or `curl -s "https://r.jina.ai/URL"` | `article.md` |
| wechat | mp.weixin.qq.com | `cd ~/.agent-reach/tools/wechat-article-for-ai && python3 main.py "URL"` | `article.md` |
| tweet | x.com, twitter.com /status/ | `xreach tweet URL --json` | `tweet.md` |
| thread | x.com, twitter.com (thread) | `xreach thread URL --json` | `tweet.md` |
| video | youtube.com, youtu.be | `yt-dlp --dump-json "URL"` + subtitle extraction | `video.md` |
| bilibili | bilibili.com | `yt-dlp --dump-json "URL"` + subtitle extraction | `video.md` |
| paper | arxiv.org, .pdf links | `web_fetch` or browser | `paper.md` |
| podcast | Podcast platforms | `web_fetch` metadata | `podcast.md` |
| image | Image URLs | Download + describe | `image.md` |
### Twitter/X Fetch Details
```bash
# Single tweet
xreach tweet URL_OR_ID --json
# Full thread
xreach thread URL_OR_ID --json
# User timeline (for context)
xreach tweets @username -n 20 --json
```
Extract from JSON: `full_text`, `user.screen_name`, `created_at`, `entities`, media URLs.
For threads: concatenate all tweets in order as full content.
### Video Subtitle Extraction
```bash
# Download subtitles
yt-dlp --write-sub --write-auto-sub --sub-lang "zh-Hans,zh,en" \
--convert-subs vtt --skip-download -o "/tmp/%(id)s" "URL"
# Then read the .vtt file as transcript
```
## Entry Structure
Every entry has two parts:
### 1. YAML Frontmatter (structured metadata)
```yaml
title: "..."
source: "..." # Platform/domain
url: "..." # Original URL
author: "..." # Author or @handle
date_published: "..." # When content was created
date_saved: "..." # When we saved it
last_updated: "..." # Last modification
type: article|tweet|video|podcast|paper|image
tags: [tag1, tag2, ...]
status: unread|read|reviewed
priority: low|normal|high
related: [] # Paths to related entries
```
### 2. Markdown Body (content)
```markdown
# {title}
## Summary
2-3 sentence summary.
## Key Points
- Point 1
- Point 2
## Original Content
THE FULL ORIGINAL TEXT — not truncated, not summarized.
This is the authoritative source for re-reading and quoting.
## Quotes
> Notable quotes worth highlighting
## Notes
Personal observations, connections, action items.
## Related
- [[library/tweets/related-tweet]]
- [[library/articles/related-article]]
```
**⚠️ MANDATORY: Always save original full text in "Original Content" section.**
Summaries and key points are for quick retrieval. The original text is for accurate re-reading and quoting. Never skip saving the full content.
## Filename Convention
`<slugified-title>-<YYYY-MM-DD>.md`
Examples:
- `library/articles/yc-why-not-work-and-startup-2026-03-12.md`
- `library/tweets/garry-tan-on-yc-advice-2026-03-13.md`
- `library/videos/how-to-build-agents-2026-03-13.md`
## Save Workflow
1. **Detect URL** — Parse link from user message
2. **Identify type** — Match URL pattern to content type
3. **Check dedup** — `memory_search("URL or title")` to avoid duplicates
4. **Fetch content** — Use appropriate method from table above
5. **Generate metadata** — Title, summary, key points, tags (3-7)
6. **Write entry** — Use template, fill frontmatter + full original text
7. **Confirm** — Tell user: title, tags, and where it's saved
## Search & Retrieval
```python
# Semantic search
memory_search("创业方法论")
memory_search("Garry Tan 的推文")
memory_search("AI agent 视频教程")
# Read specific entry
memory_get("library/tweets/garry-tan-on-yc-2026-03-13.md")
```
When returning search results, show:
- Title + source + date
- Summary (2 lines max)
- Tags
- Offer to show full original text
## Writing Reference Mode
When user asks to write something using saved content:
1. Search library for relevant entries
2. Read full original text of top matches
3. Synthesize insights, cite sources inline
4. Format citations as `[[library/type/entry-name]]`
## Templates
Located in `templates/`:
- `article.md` — Web articles, blog posts, newsletters
- `tweet.md` — Twitter/X posts and threads
- `video.md` — Videos with transcript
- `podcast.md` — Podcast episodes
- `paper.md` — Academic papers
- `image.md` — Visual content
## Best Practices
- **Save originals religiously** — summaries lose nuance
- **Tag consistently** — reuse existing tags, keep vocabulary tight
- **Link related entries** — build a knowledge graph over time
- **Don't over-ask** — if interest is clear, just save and confirm
FILE:templates/article.md
---
title: ""
source: ""
url: ""
author: ""
date_published: ""
date_saved: ""
last_updated: ""
type: article
tags: []
status: unread
priority: normal
related: []
---
# {title}
## Summary
2-3 sentence summary of the article's main points.
## Key Points
- Main takeaway 1
- Main takeaway 2
- Main takeaway 3
## Original Content
**⚠️ MANDATORY: Paste the full original article text below. Never truncate.**
---
{full article text here}
---
## Quotes
> "Notable quote from the article"
> — Author
## Notes
Personal notes, insights, connections to other ideas.
## Related
- [[library/articles/related-entry]]
FILE:templates/image.md
---
title: ""
source: ""
url: ""
creator: ""
date_published: ""
date_saved: ""
type: image
tags: []
---
# {title}
## Description
What this image shows, its purpose, key elements.
## Visual Analysis
- **Type**: Infographic / Diagram / Screenshot / Photo
- **Key elements**: What's depicted
- **Style**: Minimalist / Detailed / Technical
## Content Summary
Text content, data, or information presented in the image.
## Insights
What can be learned from this visual.
## Use Cases
When to reference this image.
## Notes
Personal observations, why this was saved.
## Related
- [[library/images/related-image]]
- [[library/articles/context-article]]
## Attachments
| File | Description |
|------|-------------|
| original.png | Full resolution image |
| caption.txt | Alt text / caption |
FILE:templates/paper.md
---
title: ""
source: ""
url: ""
authors: []
date_published: ""
date_saved: ""
type: paper
tags: []
status: unread # unread | reading | read | reviewed
---
# {title}
## Abstract
Paper abstract or concise summary.
## Key Findings
- Finding 1
- Finding 2
- Finding 3
## Methodology
Brief description of approach/methods.
## Implications
Why this matters, real-world applications.
## Criticism / Limitations
Noted limitations or potential issues.
## Notes
Personal notes, connections to other work.
## Related Work
- [[library/papers/related-paper]]
- Cites: Author et al. (Year)
## Citation
```bibtex
@article{key,
title={},
author={},
journal={},
year={}
}
```
## Metadata
- **Journal/Conference**:
- **DOI**:
- **Citations**: X
- **Peer reviewed**: yes/no
FILE:templates/podcast.md
---
title: ""
source: ""
url: ""
episode: ""
host: ""
guests: []
release_date: ""
date_saved: ""
type: podcast
tags: []
status: unlistened # unlistened | listening | listened | reviewed
---
# {title}
## Summary
Episode summary and main discussion points.
## Key Insights
- Insight 1
- Insight 2
- Insight 3
## Show Notes / Chapters
### [00:00] Opening
Content...
### [10:00] Main Discussion
Key conversation points...
## Quotes
> "Memorable quote from the episode"
> — Speaker
## People Mentioned
- [[library/people/person-name]]
## Resources Mentioned
- Book: Title by Author
- Article: [Title](URL)
- Tool: [Name](URL)
## Notes
Personal reflections, follow-up items.
## Related
- [[library/podcasts/related-episode]]
## Metadata
- **Podcast**: Name
- **Episode**: #X
- **Duration**: HH:MM:SS
FILE:templates/tweet.md
---
title: ""
source: "Twitter/X"
url: ""
author: ""
tweet_id: ""
date_published: ""
date_saved: ""
last_updated: ""
type: tweet
tags: []
status: read
priority: normal
is_thread: false
related: []
---
# {title}
## Summary
2-3 sentence summary of the tweet/thread's core message.
## Key Points
- Point 1
- Point 2
- Point 3
## Original Content
**⚠️ MANDATORY: Paste the full original tweet text (or all tweets in thread) below.**
---
{full tweet text here, preserve line breaks and formatting}
---
## Media
Description of any images, videos, or links embedded in the tweet.
## Notes
Personal observations, why this was interesting, connections to other ideas.
## Related
- [[library/tweets/related-tweet]]
- [[library/articles/related-article]]
FILE:templates/video.md
---
title: ""
source: ""
url: ""
channel: ""
upload_date: ""
date_saved: ""
type: video
tags: []
status: unwatched # unwatched | watching | watched | reviewed
duration: ""
---
# {title}
## Summary
Brief summary of video content and key messages.
## Key Points
- Main point 1
- Main point 2
- Main point 3
## Transcript
Video transcript or key segments.
### [00:00] Introduction
Transcript segment...
### [05:30] Main Topic
Key segment transcript...
## Visual Notes
Description of important visuals, diagrams, demonstrations.
## Notes
Personal observations, insights, action items.
## Related
- [[library/videos/related-video]]
- [[library/articles/related-article]]
## Metadata
- **Platform**: YouTube / Bilibili / etc
- **Duration**: HH:MM:SS
- **Views**: X
- **Upload date**: YYYY-MM-DD
Advanced web crawling and content extraction tool with multiple extraction modes
---
name: web-crawl
version: "1.0.0"
description: Advanced web crawling and content extraction tool with multiple extraction modes
activation:
keywords: ["crawl", "抓取", "提取网页", "研究", "深度研究", "research", "analyze website"]
tags: ["web", "research", "crawling"]
---
# Web Crawl Skill
Advanced web content extraction with multiple modes and intelligent content detection.
## When to Use
Use this skill when:
- User asks to "研究" / "深度研究" a topic
- User wants to "抓取" / "提取" content from websites
- Need to analyze multiple web pages systematically
- Current `web_fetch` output is insufficient
## Extraction Modes
| Mode | Use Case |
|------|----------|
| `text` | Clean plain text |
| `markdown` | Formatted Markdown (recommended) |
| `links` | Extract all links |
| `structured` | JSON metadata + content |
| `full` | Markdown + links combined |
## Tools Available
- `web_crawl` - Extract content from a single URL
- `parallel_crawl` - Extract from multiple URLs in parallel
- `research_topic` - Multi-step research with search + crawl
## Example Usage
```
User: "研究一下 OpenManus-Max 项目"
→ Use research_topic tool with query="OpenManus-Max GitHub features"
```
FILE:EXAMPLES.md
# Web Crawl Skill - Usage Examples
## Basic Usage
### 1. Crawl Single URL
```python
# In Python
import asyncio
from web_crawl import crawl_url
result = asyncio.run(crawl_url(
url="https://example.com",
mode="markdown",
max_length=10000
))
print(result)
```
### 2. Parallel Crawl Multiple URLs
```python
from web_crawl import parallel_crawl
urls = [
"https://site1.com/article",
"https://site2.com/guide",
"https://site3.com/docs",
]
result = asyncio.run(parallel_crawl(
urls=urls,
mode="markdown",
max_length=8000
))
print(result)
```
## Deep Research Workflow
### Step 1: Search for Sources
```python
web_search:0 {
"query": "OpenManus-Max features comparison",
"count": 8
}
```
### Step 2: Crawl Top Results
```python
exec:1 {
"command": "cd ~/.openclaw/workspace-main/skills/web-crawl && python3 -c \"import asyncio; from web_crawl import parallel_crawl; print(asyncio.run(parallel_crawl(['https://url1.com', 'https://url2.com'], mode='markdown', max_length=8000)))\""
}
```
### Step 3: Analyze and Synthesize
Use the crawled content to:
- Extract key findings
- Compare sources
- Identify unique insights
- Cite sources
## Extraction Modes Explained
| Mode | Best For | Output Format |
|------|----------|---------------|
| `text` | Quick reading | Plain text |
| `markdown` | Content creation | Formatted Markdown |
| `links` | Discovery | List of links |
| `structured` | Data extraction | JSON metadata |
| `full` | Complete analysis | Markdown + Links |
## Advanced: CSS Selector Targeting
```python
# Extract only the article content
result = asyncio.run(crawl_url(
url="https://example.com/page",
mode="markdown",
selector="article.main-content" # CSS selector
))
```
## Research Templates
### Product Research
```python
# Research a product
queries = [
"ProductName review 2024",
"ProductName vs competitor",
"ProductName pricing features",
]
# 1. Search each query
# 2. Crawl top 3 results per query
# 3. Synthesize pros/cons, pricing, features
```
### Company Research
```python
# Research a company
queries = [
"CompanyName funding valuation",
"CompanyName leadership team",
"CompanyName recent news",
]
# 1. Search for company info
# 2. Crawl official site + news sources
# 3. Extract funding, team, products
```
### Technology Research
```python
# Research a technology
queries = [
"TechnologyName documentation",
"TechnologyName getting started",
"TechnologyName github examples",
]
# 1. Search for docs and tutorials
# 2. Crawl official docs + GitHub
# 3. Extract key concepts and examples
```
## Integration with OpenClaw
When user asks for research:
1. **Understand intent**: What type of research? (product, company, tech, topic)
2. **Generate search queries**: Use research.py templates
3. **Execute search**: Use web_search tool
4. **Crawl sources**: Use web_crawl tool
5. **Synthesize**: Analyze and present findings
### Example Session
**User**: "研究一下 OpenManus-Max"
**Assistant**:
```
我来为你深度研究 OpenManus-Max。首先搜索相关信息...
[Use web_search for multiple queries]
找到以下关键来源:
1. GitHub 官方仓库
2. 技术博客分析
3. 社区讨论
现在抓取详细内容...
[Use parallel_crawl on top URLs]
## 研究发现
### 核心特性
- ...
### 架构设计
- ...
### 与 OpenManus 对比
- ...
### 优势与局限
- ...
---
来源:
- [1] GitHub: OpenDemon/OpenManus-Max
- [2] 技术博客分析
- [3] 社区讨论
```
FILE:README.md
# 🕷️ Web Crawl Skill - 深度研究增强工具
## 已实现功能
✅ **多模式内容提取**
- `text` - 纯文本提取
- `markdown` - 完整 Markdown 转换(保留格式)
- `links` - 链接提取
- `structured` - JSON 结构化数据
- `full` - 综合提取
✅ **智能内容识别**
- 自动识别 `main`, `article`, `content` 等主要内容区域
- 自动清理脚本、样式、广告等元素
✅ **完整 HTML → Markdown 转换**
- 标题、段落、链接、图片、列表、表格
- 代码块、引用、分隔线
✅ **并行抓取**
- 同时抓取多个 URL
- 可配置并发数
## 使用方式
### 方法 1: 直接调用 Python 函数
```python
from skills.web-crawl.web_crawl import crawl_url, parallel_crawl
# 抓取单个页面
result = crawl_url(
url="https://example.com",
mode="markdown",
max_length=10000
)
print(result)
# 并行抓取多个页面
result = parallel_crawl(
urls=["https://site1.com", "https://site2.com"],
mode="markdown",
max_length=8000
)
print(result)
```
### 方法 2: 命令行
```bash
cd ~/.openclaw/workspace-main/skills/web-crawl
# 抓取单个 URL
python3 web_crawl.py https://example.com markdown 5000
# 可用模式: text, markdown, links, structured, full
```
### 方法 3: 作为 Agent 工具使用
在对话中,当你需要深度研究时,我会自动使用此工具:
**你**: "研究一下 OpenManus-Max"
**我**:
1. 使用 `web_search` 搜索相关来源
2. 使用 `parallel_crawl` 抓取前 5 个结果
3. 分析、整合、输出研究报告
## 深度研究工作流
```
用户提问
↓
[web_search] 搜索多个相关查询
↓
[parallel_crawl] 并行抓取 Top URLs
↓
[分析] 提取关键信息、对比、整合
↓
[输出] 结构化研究报告 + 来源引用
```
## 与 web_fetch 的对比
| 功能 | web_fetch (旧) | web_crawl (新) |
|------|----------------|----------------|
| 提取模式 | 2种 (text/markdown) | 5种 |
| CSS 选择器 | ❌ | ✅ |
| 智能正文识别 | 基础 | 高级 |
| 链接提取 | ❌ | ✅ |
| 结构化数据 | ❌ | JSON 输出 |
| 并行抓取 | ❌ | ✅ |
| 表格转换 | ⚠️ | ✅ 完整支持 |
## 示例输出
### Markdown 模式
```markdown
# Page Title
## Section Heading
This is **bold** and *italic* text.
- List item 1
- List item 2
[Link text](https://example.com)
| Column 1 | Column 2 |
|----------|----------|
| Data 1 | Data 2 |
```
### Structured 模式
```json
{
"url": "https://example.com",
"title": "Page Title",
"description": "Meta description",
"headings": [
{"level": "h1", "text": "Main Title"},
{"level": "h2", "text": "Section"}
],
"main_text": "Content preview...",
"links_count": 42,
"images_count": 5
}
```
## 安装依赖
```bash
pip3 install requests beautifulsoup4
```
## 文件结构
```
skills/web-crawl/
├── SKILL.md # Skill 定义文件
├── web_crawl.py # 核心爬虫实现
├── research.py # 深度研究工具
├── EXAMPLES.md # 使用示例
└── README.md # 本文档
```
---
**现在你可以要求我进行深度研究了!** 🚀
FILE:research.py
#!/usr/bin/env python3
"""
Deep Research Tool - Multi-step research with search + crawl
This tool combines Brave search with advanced crawling for comprehensive research.
"""
import asyncio
import json
from typing import Any, Dict, List, Optional
# Import the crawler
from web_crawl import WebCrawler, parallel_crawl
async def brave_search(query: str, count: int = 8) -> List[Dict[str, str]]:
"""
Perform Brave search using OpenClaw's web_search tool
Returns list of results with title, url, snippet
"""
# This is a wrapper that will be called via OpenClaw's tool system
# For now, return mock structure
return []
async def research_topic(
query: str,
max_sources: int = 5,
extract_mode: str = "markdown",
max_content_length: int = 8000,
) -> str:
"""
Perform deep research on a topic
Workflow:
1. Search for relevant URLs
2. Crawl top results in parallel
3. Synthesize findings
Args:
query: Research topic/question
max_sources: Maximum number of sources to analyze
extract_mode: Content extraction mode
max_content_length: Max content length per source
Returns:
Comprehensive research report
"""
# Step 1: Search (will be done via OpenClaw's web_search tool)
# For now, we return instructions on how to use this
instructions = f"""# Deep Research Instructions
To perform deep research on "{query}", follow these steps:
## Step 1: Search
Use OpenClaw's web_search tool to find relevant sources:
```
web_search:0 {{
"query": "{query}",
"count": {max_sources + 3}
}}
```
## Step 2: Crawl Sources
Use the parallel_crawl function on the top {max_sources} URLs:
```python
parallel_crawl(urls=[url1, url2, ...], mode="{extract_mode}", max_length={max_content_length})
```
## Step 3: Synthesize
Analyze the crawled content and provide:
- Key findings summary
- Important quotes/data points
- Source citations
- Any contradictions or gaps
---
**Tip**: For academic research, use mode="structured" to get metadata.
For content creation, use mode="markdown" for clean text.
"""
return instructions
async def analyze_website(
url: str,
depth: int = 1,
max_pages: int = 10,
) -> str:
"""
Comprehensive website analysis
Analyzes a website's structure, content, and linked pages
"""
crawler = WebCrawler()
# First, get the main page and extract links
result = await crawler.crawl(url, mode="structured", max_length=50000)
if not result["success"]:
return f"❌ Failed to analyze {url}: {result.get('error')}"
# Parse structured data to get links
try:
data = json.loads(result["content"])
links_count = data.get("links_count", 0)
title = data.get("title", "")
except:
links_count = 0
title = ""
report = f"""# Website Analysis: {url}
## Overview
- **Title**: {title}
- **Total Links**: {links_count}
## Main Page Content
{result["content"][:3000]}
---
**Note**: For deeper analysis (following linked pages), use parallel_crawl on specific URLs.
"""
return report
# Research planning templates
RESEARCH_TEMPLATES = {
"product": {
"description": "Research a product or service",
"search_queries": [
"{topic} review",
"{topic} features",
"{topic} pricing",
"{topic} vs competitors",
],
},
"company": {
"description": "Research a company or organization",
"search_queries": [
"{topic} company profile",
"{topic} funding",
"{topic} leadership",
"{topic} news",
],
},
"technology": {
"description": "Research a technology or framework",
"search_queries": [
"{topic} documentation",
"{topic} tutorial",
"{topic} github",
"{topic} best practices",
],
},
"person": {
"description": "Research a person",
"search_queries": [
"{topic} biography",
"{topic} achievements",
"{topic} interviews",
],
},
"topic": {
"description": "General topic research",
"search_queries": [
"{topic} guide",
"{topic} explained",
"{topic} latest developments",
],
},
}
def get_research_plan(topic: str, research_type: str = "topic") -> str:
"""
Generate a research plan for a given topic
Args:
topic: Research topic
research_type: Type of research (product, company, technology, person, topic)
Returns:
Research plan with suggested search queries
"""
template = RESEARCH_TEMPLATES.get(research_type, RESEARCH_TEMPLATES["topic"])
queries = [q.format(topic=topic) for q in template["search_queries"]]
plan = f"""# Research Plan: {topic}
**Type**: {template['description']}
## Suggested Search Queries
"""
for i, query in enumerate(queries, 1):
plan += f"{i}. `{query}`\n"
plan += f"""
## Execution Steps
1. **Search Phase**: Run each query, collect top 3-5 URLs per query
2. **Crawl Phase**: Use `parallel_crawl` to extract content from all URLs
3. **Analysis Phase**: Synthesize findings, identify key insights
4. **Verification Phase**: Cross-reference important claims across sources
## Extraction Mode Recommendations
- **For data/numbers**: Use `structured` mode
- **For content/ideas**: Use `markdown` mode
- **For link discovery**: Use `links` mode
---
Ready to start research? Run the search queries above!
"""
return plan
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python research.py <command> [args]")
print("Commands:")
print(" plan <topic> [type] - Generate research plan")
print(" crawl <url> [mode] - Crawl single URL")
print(" analyze <url> - Analyze website")
sys.exit(1)
command = sys.argv[1]
if command == "plan":
topic = sys.argv[2]
rtype = sys.argv[3] if len(sys.argv) > 3 else "topic"
print(get_research_plan(topic, rtype))
elif command == "crawl":
url = sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "markdown"
result = asyncio.run(crawl_url(url, mode))
print(result)
elif command == "analyze":
url = sys.argv[2]
result = asyncio.run(analyze_website(url))
print(result)
FILE:web_crawl.py
#!/usr/bin/env python3
"""
OpenClaw Web Crawl Tool - Advanced Content Extraction
Enhanced version of web_fetch with multiple extraction modes
Features:
- Multiple extraction modes: text, markdown, links, structured, full
- CSS selector support for targeted extraction
- Intelligent main content detection
- Full HTML to Markdown conversion
- Parallel crawling capability
"""
import re
import json
import concurrent.futures
from typing import Any, Dict, List, Optional
from urllib.parse import urljoin, urlparse
try:
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
except ImportError:
raise ImportError("Required packages: requests, beautifulsoup4. Install: pip3 install requests beautifulsoup4")
class WebCrawler:
"""Advanced web content crawler and extractor"""
def __init__(self, timeout: float = 30.0):
self.timeout = timeout
self.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
}
def crawl(
self,
url: str,
mode: str = "markdown",
max_length: int = 10000,
selector: str = "",
) -> Dict[str, Any]:
"""
Crawl and extract content from a URL
Args:
url: Target URL
mode: Extraction mode (text, markdown, links, structured, full)
max_length: Maximum content length
selector: Optional CSS selector for targeted extraction
Returns:
Dict with success status and extracted content
"""
try:
resp = requests.get(url, headers=self.headers, timeout=self.timeout, allow_redirects=True)
resp.raise_for_status()
html = resp.text
soup = BeautifulSoup(html, "html.parser")
# Apply CSS selector if provided
if selector:
elements = soup.select(selector)
if not elements:
return {
"success": False,
"error": f"No elements found for selector: {selector}",
}
# Create new soup with selected elements
new_soup = BeautifulSoup("<div></div>", "html.parser")
container = new_soup.div
for el in elements:
container.append(el.copy() if hasattr(el, 'copy') else el)
soup = new_soup
# Extract based on mode
if mode == "text":
result = self._extract_text(soup, max_length)
elif mode == "markdown":
result = self._extract_markdown(soup, url, max_length)
elif mode == "links":
result = self._extract_links(soup, url, max_length)
elif mode == "structured":
result = self._extract_structured(soup, url, max_length)
elif mode == "full":
result = self._extract_full(soup, url, max_length)
else:
result = self._extract_markdown(soup, url, max_length)
return {
"success": True,
"url": url,
"mode": mode,
"content": result,
}
except requests.HTTPError as e:
return {
"success": False,
"error": f"HTTP {e.response.status_code}: {url}",
}
except Exception as e:
return {
"success": False,
"error": f"Crawl failed: {str(e)}",
}
def _clean_soup(self, soup: BeautifulSoup) -> BeautifulSoup:
"""Remove useless elements like scripts, styles, ads"""
for tag in soup.find_all([
"script", "style", "nav", "footer", "header",
"aside", "noscript", "iframe", "svg", "canvas"
]):
tag.decompose()
return soup
def _extract_text(self, soup: BeautifulSoup, max_length: int) -> str:
"""Extract clean plain text"""
soup = self._clean_soup(soup)
text = soup.get_text(separator="\n", strip=True)
# Clean excessive newlines
text = re.sub(r"\n{3,}", "\n\n", text)
return text[:max_length]
def _extract_markdown(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
"""Extract content as formatted Markdown"""
soup = self._clean_soup(soup)
lines = []
# Extract title
title = soup.find("title")
if title:
title_text = title.get_text(strip=True)
if title_text:
lines.append(f"# {title_text}\n")
# Find main content area (smart detection)
main = (
soup.find("main")
or soup.find("article")
or soup.find(attrs={"role": "main"})
or soup.find("div", class_=re.compile(r"content|article|post|entry|body", re.I))
or soup.body
or soup
)
if main:
lines.append(self._tag_to_markdown(main, base_url))
result = "\n".join(lines)
result = re.sub(r"\n{3,}", "\n\n", result)
return result[:max_length]
def _tag_to_markdown(self, tag: Tag, base_url: str) -> str:
"""Recursively convert HTML tags to Markdown"""
parts = []
for child in tag.children:
if isinstance(child, NavigableString):
text = str(child).strip()
if text:
parts.append(text)
elif isinstance(child, Tag):
name = child.name
if name in ("h1", "h2", "h3", "h4", "h5", "h6"):
level = int(name[1])
text = child.get_text(strip=True)
if text:
parts.append(f"\n{'#' * level} {text}\n")
elif name == "p":
text = child.get_text(strip=True)
if text:
parts.append(f"\n{text}\n")
elif name == "a":
text = child.get_text(strip=True)
href = child.get("href", "")
if href and text and not href.startswith(("#", "javascript:")):
full_url = urljoin(base_url, href)
parts.append(f"[{text}]({full_url})")
elif text:
parts.append(text)
elif name == "img":
alt = child.get("alt", "image")
src = child.get("src", "")
if src:
full_url = urljoin(base_url, src)
parts.append(f"")
elif name in ("ul", "ol"):
items = child.find_all("li", recursive=False)
for i, li in enumerate(items):
prefix = f"{i+1}." if name == "ol" else "-"
text = li.get_text(strip=True)
if text:
parts.append(f"\n{prefix} {text}")
parts.append("")
elif name in ("pre", "code"):
code = child.get_text()
if "\n" in code:
parts.append(f"\n```\n{code}\n```\n")
else:
parts.append(f"`{code.strip()}`")
elif name == "blockquote":
text = child.get_text(strip=True)
if text:
parts.append(f"\n> {text}\n")
elif name == "table":
parts.append(self._table_to_markdown(child))
elif name in ("strong", "b"):
text = child.get_text(strip=True)
if text:
parts.append(f"**{text}**")
elif name in ("em", "i"):
text = child.get_text(strip=True)
if text:
parts.append(f"*{text}*")
elif name == "br":
parts.append("\n")
elif name == "hr":
parts.append("\n---\n")
else:
# Recursively process other tags
inner = self._tag_to_markdown(child, base_url)
if inner.strip():
parts.append(inner)
return " ".join(parts)
def _table_to_markdown(self, table: Tag) -> str:
"""Convert HTML table to Markdown table"""
rows = table.find_all("tr")
if not rows:
return ""
md_rows = []
for row in rows:
cells = row.find_all(["th", "td"])
md_cells = [cell.get_text(strip=True).replace("|", "\\|") for cell in cells]
if md_cells:
md_rows.append("| " + " | ".join(md_cells) + " |")
if len(md_rows) > 0:
# Add separator row
num_cols = md_rows[0].count("|") - 1
separator = "| " + " | ".join(["---"] * max(num_cols, 1)) + " |"
md_rows.insert(1, separator)
return "\n" + "\n".join(md_rows) + "\n"
def _extract_links(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
"""Extract all links from the page"""
links = []
for a in soup.find_all("a", href=True):
href = a["href"]
text = a.get_text(strip=True)
if href.startswith(("#", "javascript:", "mailto:", "tel:")):
continue
full_url = urljoin(base_url, href)
links.append(f"- [{text or 'link'}]({full_url})")
result = f"Found {len(links)} links:\n\n" + "\n".join(links)
return result[:max_length]
def _extract_structured(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
"""Extract structured data as JSON"""
data = {
"url": base_url,
"title": "",
"description": "",
"headings": [],
"main_text": "",
"links_count": 0,
"images_count": 0,
}
# Title
title = soup.find("title")
if title:
data["title"] = title.get_text(strip=True)
# Meta description
meta_desc = soup.find("meta", attrs={"name": "description"})
if meta_desc:
data["description"] = meta_desc.get("content", "")
# Headings
for h in soup.find_all(["h1", "h2", "h3"]):
text = h.get_text(strip=True)
if text:
data["headings"].append({"level": h.name, "text": text})
# Counts
data["links_count"] = len(soup.find_all("a", href=True))
data["images_count"] = len(soup.find_all("img"))
# Main text preview
soup_clean = self._clean_soup(soup)
data["main_text"] = soup_clean.get_text(separator=" ", strip=True)[:3000]
return json.dumps(data, indent=2, ensure_ascii=False)[:max_length]
def _extract_full(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
"""Full extraction: Markdown + Links"""
md = self._extract_markdown(soup, base_url, max_length // 2)
links = self._extract_links(soup, base_url, max_length // 4)
return f"{md}\n\n---\n\n{links}"[:max_length]
def crawl_url(
url: str,
mode: str = "markdown",
max_length: int = 10000,
selector: str = "",
) -> str:
"""
Crawl a single URL and return formatted result
Use this for extracting content from web pages.
"""
crawler = WebCrawler()
result = crawler.crawl(url, mode, max_length, selector)
if result["success"]:
return f"✅ Crawled: {result['url']}\n\n{result['content']}"
else:
return f"❌ Failed: {result['error']}"
def parallel_crawl(
urls: List[str],
mode: str = "markdown",
max_length: int = 10000,
max_workers: int = 5,
) -> str:
"""
Crawl multiple URLs in parallel
Use this for researching multiple sources at once.
"""
crawler = WebCrawler()
def crawl_one(url: str) -> Dict[str, Any]:
return crawler.crawl(url, mode, max_length)
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url = {executor.submit(crawl_one, url): url for url in urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
results.append(future.result())
except Exception as e:
results.append({"success": False, "error": str(e), "url": url})
# Format results
output = [f"# Parallel Crawl Results ({len(urls)} URLs)\n"]
for i, result in enumerate(results, 1):
url = result.get("url", "Unknown")
output.append(f"\n---\n\n## Source {i}: {url}\n")
if result.get("success"):
content = result.get("content", "")
# Truncate if too long
if len(content) > max_length:
content = content[:max_length] + "\n\n[Content truncated...]"
output.append(content)
else:
output.append(f"❌ Failed: {result.get('error', 'Unknown error')}")
return "\n".join(output)
# CLI interface
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python web_crawl.py <url> [mode] [max_length]")
print("Modes: text, markdown, links, structured, full")
sys.exit(1)
url = sys.argv[1]
mode = sys.argv[2] if len(sys.argv) > 2 else "markdown"
max_length = int(sys.argv[3]) if len(sys.argv) > 3 else 10000
result = crawl_url(url, mode, max_length)
print(result)