不

不白

@clawhub-nowhitestar-9cdfdd2dad

2prompts

0upvotes received

0contributions

Joined 3 months ago

2 contributions in the last year

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Less

Link Library

Skill

Personal knowledge base that captures web content (articles, tweets/threads, videos, podcasts, images, PDFs) and makes it retrievable for future conversation...

---
name: link-library
version: "1.0.0"
description: >
  Personal knowledge base that captures web content (articles, tweets/threads, videos, podcasts, images, PDFs)
  and makes it retrievable for future conversations and writing.
  Use when: (1) User shares a URL with ANY interest signal — asking to summarize, commenting positively,
  saying "有意思/不错/interesting/值得看/学到了", or requesting it be saved,
  (2) User asks to find previously saved content ("我之前存的那篇...", "find that article about..."),
  (3) User needs reference material for writing or discussion,
  (4) User shares Twitter/X, WeChat, YouTube, Bilibili, or any web link and engages with it.
  Interest signals that trigger save: "帮我总结一下", "这篇不错", "有意思", "记一下", "留着以后用",
  "这个观点很好", "学到了", "值得保存", "放进知识库", sharing link + any commentary or opinion,
  asking follow-up questions about link content. Do NOT require literal "save"/"bookmark" keywords.
---

# Link Library — Personal Content Knowledge Base

Save web content with full original text, generate summaries and tags, retrieve semantically.

## Core Rules

1. **Always save original full text** — summaries are for retrieval, originals are for re-reading
2. **Detect interest, don't demand commands** — if user engages with a link, offer to save
3. **Twitter/X is first-class** — tweets, threads, and articles are fully supported

## Interest Detection

When user shares a link, evaluate interest signals:

**Auto-save (no confirmation needed):**
- User explicitly says save/bookmark/记一下/放进知识库
- User asks "帮我总结一下" (summarize implies save-worthy)

**Offer to save (ask once):**
- User shares link + positive commentary ("这篇不错", "有意思", "学到了")
- User asks follow-up questions about link content
- User discusses link content substantively

**Don't save:**
- User shares link just for quick reference in conversation
- User says "不用保存" or similar

## Data Location

All entries in `~/.openclaw/workspace-main/library/`:

```
library/
├── articles/     # Web articles, blog posts, WeChat, Zhihu
├── tweets/       # Twitter/X posts and threads
├── videos/       # YouTube, Bilibili
├── podcasts/     # Podcast episodes
├── papers/       # Academic papers, PDFs
├── images/       # Infographics, visual content
└── misc/         # Everything else
```

## Content Types & Fetch Methods

| Type | URL Patterns | Fetch Method | Template |
|------|-------------|--------------|----------|
| article | Generic web, blog, /post/ | `web_fetch` or `curl -s "https://r.jina.ai/URL"` | `article.md` |
| wechat | mp.weixin.qq.com | `cd ~/.agent-reach/tools/wechat-article-for-ai && python3 main.py "URL"` | `article.md` |
| tweet | x.com, twitter.com /status/ | `xreach tweet URL --json` | `tweet.md` |
| thread | x.com, twitter.com (thread) | `xreach thread URL --json` | `tweet.md` |
| video | youtube.com, youtu.be | `yt-dlp --dump-json "URL"` + subtitle extraction | `video.md` |
| bilibili | bilibili.com | `yt-dlp --dump-json "URL"` + subtitle extraction | `video.md` |
| paper | arxiv.org, .pdf links | `web_fetch` or browser | `paper.md` |
| podcast | Podcast platforms | `web_fetch` metadata | `podcast.md` |
| image | Image URLs | Download + describe | `image.md` |

### Twitter/X Fetch Details

```bash
# Single tweet
xreach tweet URL_OR_ID --json

# Full thread
xreach thread URL_OR_ID --json

# User timeline (for context)
xreach tweets @username -n 20 --json
```

Extract from JSON: `full_text`, `user.screen_name`, `created_at`, `entities`, media URLs.
For threads: concatenate all tweets in order as full content.

### Video Subtitle Extraction

```bash
# Download subtitles
yt-dlp --write-sub --write-auto-sub --sub-lang "zh-Hans,zh,en" \
  --convert-subs vtt --skip-download -o "/tmp/%(id)s" "URL"
# Then read the .vtt file as transcript
```

## Entry Structure

Every entry has two parts:

### 1. YAML Frontmatter (structured metadata)
```yaml
title: "..."
source: "..."           # Platform/domain
url: "..."              # Original URL
author: "..."           # Author or @handle
date_published: "..."   # When content was created
date_saved: "..."       # When we saved it
last_updated: "..."     # Last modification
type: article|tweet|video|podcast|paper|image
tags: [tag1, tag2, ...]
status: unread|read|reviewed
priority: low|normal|high
related: []             # Paths to related entries
```

### 2. Markdown Body (content)
```markdown
# {title}

## Summary
2-3 sentence summary.

## Key Points
- Point 1
- Point 2

## Original Content
THE FULL ORIGINAL TEXT — not truncated, not summarized.
This is the authoritative source for re-reading and quoting.

## Quotes
> Notable quotes worth highlighting

## Notes
Personal observations, connections, action items.

## Related
- [[library/tweets/related-tweet]]
- [[library/articles/related-article]]
```

**⚠️ MANDATORY: Always save original full text in "Original Content" section.**
Summaries and key points are for quick retrieval. The original text is for accurate re-reading and quoting. Never skip saving the full content.

## Filename Convention

`<slugified-title>-<YYYY-MM-DD>.md`

Examples:
- `library/articles/yc-why-not-work-and-startup-2026-03-12.md`
- `library/tweets/garry-tan-on-yc-advice-2026-03-13.md`
- `library/videos/how-to-build-agents-2026-03-13.md`

## Save Workflow

1. **Detect URL** — Parse link from user message
2. **Identify type** — Match URL pattern to content type
3. **Check dedup** — `memory_search("URL or title")` to avoid duplicates
4. **Fetch content** — Use appropriate method from table above
5. **Generate metadata** — Title, summary, key points, tags (3-7)
6. **Write entry** — Use template, fill frontmatter + full original text
7. **Confirm** — Tell user: title, tags, and where it's saved

## Search & Retrieval

```python
# Semantic search
memory_search("创业方法论")
memory_search("Garry Tan 的推文")
memory_search("AI agent 视频教程")

# Read specific entry
memory_get("library/tweets/garry-tan-on-yc-2026-03-13.md")
```

When returning search results, show:
- Title + source + date
- Summary (2 lines max)
- Tags
- Offer to show full original text

## Writing Reference Mode

When user asks to write something using saved content:

1. Search library for relevant entries
2. Read full original text of top matches
3. Synthesize insights, cite sources inline
4. Format citations as `[[library/type/entry-name]]`

## Templates

Located in `templates/`:
- `article.md` — Web articles, blog posts, newsletters
- `tweet.md` — Twitter/X posts and threads
- `video.md` — Videos with transcript
- `podcast.md` — Podcast episodes
- `paper.md` — Academic papers
- `image.md` — Visual content

## Best Practices

- **Save originals religiously** — summaries lose nuance
- **Tag consistently** — reuse existing tags, keep vocabulary tight
- **Link related entries** — build a knowledge graph over time
- **Don't over-ask** — if interest is clear, just save and confirm

FILE:templates/article.md
---
title: ""
source: ""
url: ""
author: ""
date_published: ""
date_saved: ""
last_updated: ""
type: article
tags: []
status: unread
priority: normal
related: []
---

# {title}

## Summary

2-3 sentence summary of the article's main points.

## Key Points

- Main takeaway 1
- Main takeaway 2
- Main takeaway 3

## Original Content

**⚠️ MANDATORY: Paste the full original article text below. Never truncate.**

---

{full article text here}

---

## Quotes

> "Notable quote from the article"
> — Author

## Notes

Personal notes, insights, connections to other ideas.

## Related

- [[library/articles/related-entry]]

FILE:templates/image.md
---
title: ""
source: ""
url: ""
creator: ""
date_published: ""
date_saved: ""
type: image
tags: []
---

# {title}

## Description

What this image shows, its purpose, key elements.

## Visual Analysis

- **Type**: Infographic / Diagram / Screenshot / Photo
- **Key elements**: What's depicted
- **Style**: Minimalist / Detailed / Technical

## Content Summary

Text content, data, or information presented in the image.

## Insights

What can be learned from this visual.

## Use Cases

When to reference this image.

## Notes

Personal observations, why this was saved.

## Related

- [[library/images/related-image]]
- [[library/articles/context-article]]

## Attachments

| File | Description |
|------|-------------|
| original.png | Full resolution image |
| caption.txt | Alt text / caption |

FILE:templates/paper.md
---
title: ""
source: ""
url: ""
authors: []
date_published: ""
date_saved: ""
type: paper
tags: []
status: unread  # unread | reading | read | reviewed
---

# {title}

## Abstract

Paper abstract or concise summary.

## Key Findings

- Finding 1
- Finding 2
- Finding 3

## Methodology

Brief description of approach/methods.

## Implications

Why this matters, real-world applications.

## Criticism / Limitations

Noted limitations or potential issues.

## Notes

Personal notes, connections to other work.

## Related Work

- [[library/papers/related-paper]]
- Cites: Author et al. (Year)

## Citation

```bibtex
@article{key,
  title={},
  author={},
  journal={},
  year={}
}
```

## Metadata

- **Journal/Conference**: 
- **DOI**: 
- **Citations**: X
- **Peer reviewed**: yes/no

FILE:templates/podcast.md
---
title: ""
source: ""
url: ""
episode: ""
host: ""
guests: []
release_date: ""
date_saved: ""
type: podcast
tags: []
status: unlistened  # unlistened | listening | listened | reviewed
---

# {title}

## Summary

Episode summary and main discussion points.

## Key Insights

- Insight 1
- Insight 2
- Insight 3

## Show Notes / Chapters

### [00:00] Opening

Content...

### [10:00] Main Discussion

Key conversation points...

## Quotes

> "Memorable quote from the episode"
> — Speaker

## People Mentioned

- [[library/people/person-name]]

## Resources Mentioned

- Book: Title by Author
- Article: [Title](URL)
- Tool: [Name](URL)

## Notes

Personal reflections, follow-up items.

## Related

- [[library/podcasts/related-episode]]

## Metadata

- **Podcast**: Name
- **Episode**: #X
- **Duration**: HH:MM:SS

FILE:templates/tweet.md
---
title: ""
source: "Twitter/X"
url: ""
author: ""
tweet_id: ""
date_published: ""
date_saved: ""
last_updated: ""
type: tweet
tags: []
status: read
priority: normal
is_thread: false
related: []
---

# {title}

## Summary

2-3 sentence summary of the tweet/thread's core message.

## Key Points

- Point 1
- Point 2
- Point 3

## Original Content

**⚠️ MANDATORY: Paste the full original tweet text (or all tweets in thread) below.**

---

{full tweet text here, preserve line breaks and formatting}

---

## Media

Description of any images, videos, or links embedded in the tweet.

## Notes

Personal observations, why this was interesting, connections to other ideas.

## Related

- [[library/tweets/related-tweet]]
- [[library/articles/related-article]]

FILE:templates/video.md
---
title: ""
source: ""
url: ""
channel: ""
upload_date: ""
date_saved: ""
type: video
tags: []
status: unwatched  # unwatched | watching | watched | reviewed
duration: ""
---

# {title}

## Summary

Brief summary of video content and key messages.

## Key Points

- Main point 1
- Main point 2
- Main point 3

## Transcript

Video transcript or key segments.

### [00:00] Introduction

Transcript segment...

### [05:30] Main Topic

Key segment transcript...

## Visual Notes

Description of important visuals, diagrams, demonstrations.

## Notes

Personal observations, insights, action items.

## Related

- [[library/videos/related-video]]
- [[library/articles/related-article]]

## Metadata

- **Platform**: YouTube / Bilibili / etc
- **Duration**: HH:MM:SS
- **Views**: X
- **Upload date**: YYYY-MM-DD

ClawHub Research Writing+2

不@clawhub-nowhitestar-9cdfdd2dad

Web Crawl

Skill

Advanced web crawling and content extraction tool with multiple extraction modes

---
name: web-crawl
version: "1.0.0"
description: Advanced web crawling and content extraction tool with multiple extraction modes
activation:
  keywords: ["crawl", "抓取", "提取网页", "研究", "深度研究", "research", "analyze website"]
  tags: ["web", "research", "crawling"]
---

# Web Crawl Skill

Advanced web content extraction with multiple modes and intelligent content detection.

## When to Use

Use this skill when:
- User asks to "研究" / "深度研究" a topic
- User wants to "抓取" / "提取" content from websites
- Need to analyze multiple web pages systematically
- Current `web_fetch` output is insufficient

## Extraction Modes

| Mode | Use Case |
|------|----------|
| `text` | Clean plain text |
| `markdown` | Formatted Markdown (recommended) |
| `links` | Extract all links |
| `structured` | JSON metadata + content |
| `full` | Markdown + links combined |

## Tools Available

- `web_crawl` - Extract content from a single URL
- `parallel_crawl` - Extract from multiple URLs in parallel
- `research_topic` - Multi-step research with search + crawl

## Example Usage

```
User: "研究一下 OpenManus-Max 项目"
→ Use research_topic tool with query="OpenManus-Max GitHub features"
```

FILE:EXAMPLES.md
# Web Crawl Skill - Usage Examples

## Basic Usage

### 1. Crawl Single URL

```python
# In Python
import asyncio
from web_crawl import crawl_url

result = asyncio.run(crawl_url(
    url="https://example.com",
    mode="markdown",
    max_length=10000
))
print(result)
```

### 2. Parallel Crawl Multiple URLs

```python
from web_crawl import parallel_crawl

urls = [
    "https://site1.com/article",
    "https://site2.com/guide",
    "https://site3.com/docs",
]

result = asyncio.run(parallel_crawl(
    urls=urls,
    mode="markdown",
    max_length=8000
))
print(result)
```

## Deep Research Workflow

### Step 1: Search for Sources

```python
web_search:0 {
  "query": "OpenManus-Max features comparison",
  "count": 8
}
```

### Step 2: Crawl Top Results

```python
exec:1 {
  "command": "cd ~/.openclaw/workspace-main/skills/web-crawl && python3 -c \"import asyncio; from web_crawl import parallel_crawl; print(asyncio.run(parallel_crawl(['https://url1.com', 'https://url2.com'], mode='markdown', max_length=8000)))\""
}
```

### Step 3: Analyze and Synthesize

Use the crawled content to:
- Extract key findings
- Compare sources
- Identify unique insights
- Cite sources

## Extraction Modes Explained

| Mode | Best For | Output Format |
|------|----------|---------------|
| `text` | Quick reading | Plain text |
| `markdown` | Content creation | Formatted Markdown |
| `links` | Discovery | List of links |
| `structured` | Data extraction | JSON metadata |
| `full` | Complete analysis | Markdown + Links |

## Advanced: CSS Selector Targeting

```python
# Extract only the article content
result = asyncio.run(crawl_url(
    url="https://example.com/page",
    mode="markdown",
    selector="article.main-content"  # CSS selector
))
```

## Research Templates

### Product Research

```python
# Research a product
queries = [
    "ProductName review 2024",
    "ProductName vs competitor",
    "ProductName pricing features",
]

# 1. Search each query
# 2. Crawl top 3 results per query
# 3. Synthesize pros/cons, pricing, features
```

### Company Research

```python
# Research a company
queries = [
    "CompanyName funding valuation",
    "CompanyName leadership team",
    "CompanyName recent news",
]

# 1. Search for company info
# 2. Crawl official site + news sources
# 3. Extract funding, team, products
```

### Technology Research

```python
# Research a technology
queries = [
    "TechnologyName documentation",
    "TechnologyName getting started",
    "TechnologyName github examples",
]

# 1. Search for docs and tutorials
# 2. Crawl official docs + GitHub
# 3. Extract key concepts and examples
```

## Integration with OpenClaw

When user asks for research:

1. **Understand intent**: What type of research? (product, company, tech, topic)
2. **Generate search queries**: Use research.py templates
3. **Execute search**: Use web_search tool
4. **Crawl sources**: Use web_crawl tool
5. **Synthesize**: Analyze and present findings

### Example Session

**User**: "研究一下 OpenManus-Max"

**Assistant**:
```
我来为你深度研究 OpenManus-Max。首先搜索相关信息...

[Use web_search for multiple queries]

找到以下关键来源：
1. GitHub 官方仓库
2. 技术博客分析
3. 社区讨论

现在抓取详细内容...

[Use parallel_crawl on top URLs]

## 研究发现

### 核心特性
- ...

### 架构设计
- ...

### 与 OpenManus 对比
- ...

### 优势与局限
- ...

---
来源：
- [1] GitHub: OpenDemon/OpenManus-Max
- [2] 技术博客分析
- [3] 社区讨论
```

FILE:README.md
# 🕷️ Web Crawl Skill - 深度研究增强工具

## 已实现功能

✅ **多模式内容提取**
- `text` - 纯文本提取
- `markdown` - 完整 Markdown 转换（保留格式）
- `links` - 链接提取
- `structured` - JSON 结构化数据
- `full` - 综合提取

✅ **智能内容识别**
- 自动识别 `main`, `article`, `content` 等主要内容区域
- 自动清理脚本、样式、广告等元素

✅ **完整 HTML → Markdown 转换**
- 标题、段落、链接、图片、列表、表格
- 代码块、引用、分隔线

✅ **并行抓取**
- 同时抓取多个 URL
- 可配置并发数

## 使用方式

### 方法 1: 直接调用 Python 函数

```python
from skills.web-crawl.web_crawl import crawl_url, parallel_crawl

# 抓取单个页面
result = crawl_url(
    url="https://example.com",
    mode="markdown",
    max_length=10000
)
print(result)

# 并行抓取多个页面
result = parallel_crawl(
    urls=["https://site1.com", "https://site2.com"],
    mode="markdown",
    max_length=8000
)
print(result)
```

### 方法 2: 命令行

```bash
cd ~/.openclaw/workspace-main/skills/web-crawl

# 抓取单个 URL
python3 web_crawl.py https://example.com markdown 5000

# 可用模式: text, markdown, links, structured, full
```

### 方法 3: 作为 Agent 工具使用

在对话中，当你需要深度研究时，我会自动使用此工具：

**你**: "研究一下 OpenManus-Max"

**我**:
1. 使用 `web_search` 搜索相关来源
2. 使用 `parallel_crawl` 抓取前 5 个结果
3. 分析、整合、输出研究报告

## 深度研究工作流

```
用户提问
    ↓
[web_search] 搜索多个相关查询
    ↓
[parallel_crawl] 并行抓取 Top URLs
    ↓
[分析] 提取关键信息、对比、整合
    ↓
[输出] 结构化研究报告 + 来源引用
```

## 与 web_fetch 的对比

| 功能 | web_fetch (旧) | web_crawl (新) |
|------|----------------|----------------|
| 提取模式 | 2种 (text/markdown) | 5种 |
| CSS 选择器 | ❌ | ✅ |
| 智能正文识别 | 基础 | 高级 |
| 链接提取 | ❌ | ✅ |
| 结构化数据 | ❌ | JSON 输出 |
| 并行抓取 | ❌ | ✅ |
| 表格转换 | ⚠️ | ✅ 完整支持 |

## 示例输出

### Markdown 模式

```markdown
# Page Title

## Section Heading

This is **bold** and *italic* text.

- List item 1
- List item 2

[Link text](https://example.com)

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |
```

### Structured 模式

```json
{
  "url": "https://example.com",
  "title": "Page Title",
  "description": "Meta description",
  "headings": [
    {"level": "h1", "text": "Main Title"},
    {"level": "h2", "text": "Section"}
  ],
  "main_text": "Content preview...",
  "links_count": 42,
  "images_count": 5
}
```

## 安装依赖

```bash
pip3 install requests beautifulsoup4
```

## 文件结构

```
skills/web-crawl/
├── SKILL.md          # Skill 定义文件
├── web_crawl.py      # 核心爬虫实现
├── research.py       # 深度研究工具
├── EXAMPLES.md       # 使用示例
└── README.md         # 本文档
```

---

**现在你可以要求我进行深度研究了！** 🚀

FILE:research.py
#!/usr/bin/env python3
"""
Deep Research Tool - Multi-step research with search + crawl

This tool combines Brave search with advanced crawling for comprehensive research.
"""

import asyncio
import json
from typing import Any, Dict, List, Optional

# Import the crawler
from web_crawl import WebCrawler, parallel_crawl


async def brave_search(query: str, count: int = 8) -> List[Dict[str, str]]:
    """
    Perform Brave search using OpenClaw's web_search tool
    
    Returns list of results with title, url, snippet
    """
    # This is a wrapper that will be called via OpenClaw's tool system
    # For now, return mock structure
    return []


async def research_topic(
    query: str,
    max_sources: int = 5,
    extract_mode: str = "markdown",
    max_content_length: int = 8000,
) -> str:
    """
    Perform deep research on a topic
    
    Workflow:
    1. Search for relevant URLs
    2. Crawl top results in parallel
    3. Synthesize findings
    
    Args:
        query: Research topic/question
        max_sources: Maximum number of sources to analyze
        extract_mode: Content extraction mode
        max_content_length: Max content length per source
    
    Returns:
        Comprehensive research report
    """
    # Step 1: Search (will be done via OpenClaw's web_search tool)
    # For now, we return instructions on how to use this
    
    instructions = f"""# Deep Research Instructions

To perform deep research on "{query}", follow these steps:

## Step 1: Search
Use OpenClaw's web_search tool to find relevant sources:
```
web_search:0 {{
  "query": "{query}",
  "count": {max_sources + 3}
}}
```

## Step 2: Crawl Sources
Use the parallel_crawl function on the top {max_sources} URLs:
```python
parallel_crawl(urls=[url1, url2, ...], mode="{extract_mode}", max_length={max_content_length})
```

## Step 3: Synthesize
Analyze the crawled content and provide:
- Key findings summary
- Important quotes/data points
- Source citations
- Any contradictions or gaps

---

**Tip**: For academic research, use mode="structured" to get metadata.
For content creation, use mode="markdown" for clean text.
"""
    
    return instructions


async def analyze_website(
    url: str,
    depth: int = 1,
    max_pages: int = 10,
) -> str:
    """
    Comprehensive website analysis
    
    Analyzes a website's structure, content, and linked pages
    """
    crawler = WebCrawler()
    
    # First, get the main page and extract links
    result = await crawler.crawl(url, mode="structured", max_length=50000)
    
    if not result["success"]:
        return f"❌ Failed to analyze {url}: {result.get('error')}"
    
    # Parse structured data to get links
    try:
        data = json.loads(result["content"])
        links_count = data.get("links_count", 0)
        title = data.get("title", "")
    except:
        links_count = 0
        title = ""
    
    report = f"""# Website Analysis: {url}

## Overview
- **Title**: {title}
- **Total Links**: {links_count}

## Main Page Content
{result["content"][:3000]}

---

**Note**: For deeper analysis (following linked pages), use parallel_crawl on specific URLs.
"""
    
    return report


# Research planning templates
RESEARCH_TEMPLATES = {
    "product": {
        "description": "Research a product or service",
        "search_queries": [
            "{topic} review",
            "{topic} features",
            "{topic} pricing",
            "{topic} vs competitors",
        ],
    },
    "company": {
        "description": "Research a company or organization",
        "search_queries": [
            "{topic} company profile",
            "{topic} funding",
            "{topic} leadership",
            "{topic} news",
        ],
    },
    "technology": {
        "description": "Research a technology or framework",
        "search_queries": [
            "{topic} documentation",
            "{topic} tutorial",
            "{topic} github",
            "{topic} best practices",
        ],
    },
    "person": {
        "description": "Research a person",
        "search_queries": [
            "{topic} biography",
            "{topic} achievements",
            "{topic} interviews",
        ],
    },
    "topic": {
        "description": "General topic research",
        "search_queries": [
            "{topic} guide",
            "{topic} explained",
            "{topic} latest developments",
        ],
    },
}


def get_research_plan(topic: str, research_type: str = "topic") -> str:
    """
    Generate a research plan for a given topic
    
    Args:
        topic: Research topic
        research_type: Type of research (product, company, technology, person, topic)
    
    Returns:
        Research plan with suggested search queries
    """
    template = RESEARCH_TEMPLATES.get(research_type, RESEARCH_TEMPLATES["topic"])
    
    queries = [q.format(topic=topic) for q in template["search_queries"]]
    
    plan = f"""# Research Plan: {topic}

**Type**: {template['description']}

## Suggested Search Queries
"""
    for i, query in enumerate(queries, 1):
        plan += f"{i}. `{query}`\n"
    
    plan += f"""
## Execution Steps

1. **Search Phase**: Run each query, collect top 3-5 URLs per query
2. **Crawl Phase**: Use `parallel_crawl` to extract content from all URLs
3. **Analysis Phase**: Synthesize findings, identify key insights
4. **Verification Phase**: Cross-reference important claims across sources

## Extraction Mode Recommendations
- **For data/numbers**: Use `structured` mode
- **For content/ideas**: Use `markdown` mode  
- **For link discovery**: Use `links` mode

---

Ready to start research? Run the search queries above!
"""
    
    return plan


if __name__ == "__main__":
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: python research.py <command> [args]")
        print("Commands:")
        print("  plan <topic> [type]  - Generate research plan")
        print("  crawl <url> [mode]   - Crawl single URL")
        print("  analyze <url>        - Analyze website")
        sys.exit(1)
    
    command = sys.argv[1]
    
    if command == "plan":
        topic = sys.argv[2]
        rtype = sys.argv[3] if len(sys.argv) > 3 else "topic"
        print(get_research_plan(topic, rtype))
        
    elif command == "crawl":
        url = sys.argv[2]
        mode = sys.argv[3] if len(sys.argv) > 3 else "markdown"
        result = asyncio.run(crawl_url(url, mode))
        print(result)
        
    elif command == "analyze":
        url = sys.argv[2]
        result = asyncio.run(analyze_website(url))
        print(result)

FILE:web_crawl.py
#!/usr/bin/env python3
"""
OpenClaw Web Crawl Tool - Advanced Content Extraction
Enhanced version of web_fetch with multiple extraction modes

Features:
- Multiple extraction modes: text, markdown, links, structured, full
- CSS selector support for targeted extraction
- Intelligent main content detection
- Full HTML to Markdown conversion
- Parallel crawling capability
"""

import re
import json
import concurrent.futures
from typing import Any, Dict, List, Optional
from urllib.parse import urljoin, urlparse

try:
    import requests
    from bs4 import BeautifulSoup, NavigableString, Tag
except ImportError:
    raise ImportError("Required packages: requests, beautifulsoup4. Install: pip3 install requests beautifulsoup4")


class WebCrawler:
    """Advanced web content crawler and extractor"""
    
    def __init__(self, timeout: float = 30.0):
        self.timeout = timeout
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
                         "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8",
            "Accept-Encoding": "gzip, deflate, br",
            "DNT": "1",
            "Connection": "keep-alive",
        }
    
    def crawl(
        self,
        url: str,
        mode: str = "markdown",
        max_length: int = 10000,
        selector: str = "",
    ) -> Dict[str, Any]:
        """
        Crawl and extract content from a URL
        
        Args:
            url: Target URL
            mode: Extraction mode (text, markdown, links, structured, full)
            max_length: Maximum content length
            selector: Optional CSS selector for targeted extraction
        
        Returns:
            Dict with success status and extracted content
        """
        try:
            resp = requests.get(url, headers=self.headers, timeout=self.timeout, allow_redirects=True)
            resp.raise_for_status()
            html = resp.text
            soup = BeautifulSoup(html, "html.parser")
            
            # Apply CSS selector if provided
            if selector:
                elements = soup.select(selector)
                if not elements:
                    return {
                        "success": False,
                        "error": f"No elements found for selector: {selector}",
                    }
                # Create new soup with selected elements
                new_soup = BeautifulSoup("<div></div>", "html.parser")
                container = new_soup.div
                for el in elements:
                    container.append(el.copy() if hasattr(el, 'copy') else el)
                soup = new_soup
            
            # Extract based on mode
            if mode == "text":
                result = self._extract_text(soup, max_length)
            elif mode == "markdown":
                result = self._extract_markdown(soup, url, max_length)
            elif mode == "links":
                result = self._extract_links(soup, url, max_length)
            elif mode == "structured":
                result = self._extract_structured(soup, url, max_length)
            elif mode == "full":
                result = self._extract_full(soup, url, max_length)
            else:
                result = self._extract_markdown(soup, url, max_length)
            
            return {
                "success": True,
                "url": url,
                "mode": mode,
                "content": result,
            }
                
        except requests.HTTPError as e:
            return {
                "success": False,
                "error": f"HTTP {e.response.status_code}: {url}",
            }
        except Exception as e:
            return {
                "success": False,
                "error": f"Crawl failed: {str(e)}",
            }
    
    def _clean_soup(self, soup: BeautifulSoup) -> BeautifulSoup:
        """Remove useless elements like scripts, styles, ads"""
        for tag in soup.find_all([
            "script", "style", "nav", "footer", "header", 
            "aside", "noscript", "iframe", "svg", "canvas"
        ]):
            tag.decompose()
        return soup
    
    def _extract_text(self, soup: BeautifulSoup, max_length: int) -> str:
        """Extract clean plain text"""
        soup = self._clean_soup(soup)
        text = soup.get_text(separator="\n", strip=True)
        # Clean excessive newlines
        text = re.sub(r"\n{3,}", "\n\n", text)
        return text[:max_length]
    
    def _extract_markdown(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
        """Extract content as formatted Markdown"""
        soup = self._clean_soup(soup)
        lines = []
        
        # Extract title
        title = soup.find("title")
        if title:
            title_text = title.get_text(strip=True)
            if title_text:
                lines.append(f"# {title_text}\n")
        
        # Find main content area (smart detection)
        main = (
            soup.find("main")
            or soup.find("article")
            or soup.find(attrs={"role": "main"})
            or soup.find("div", class_=re.compile(r"content|article|post|entry|body", re.I))
            or soup.body
            or soup
        )
        
        if main:
            lines.append(self._tag_to_markdown(main, base_url))
        
        result = "\n".join(lines)
        result = re.sub(r"\n{3,}", "\n\n", result)
        return result[:max_length]
    
    def _tag_to_markdown(self, tag: Tag, base_url: str) -> str:
        """Recursively convert HTML tags to Markdown"""
        parts = []
        
        for child in tag.children:
            if isinstance(child, NavigableString):
                text = str(child).strip()
                if text:
                    parts.append(text)
                    
            elif isinstance(child, Tag):
                name = child.name
                
                if name in ("h1", "h2", "h3", "h4", "h5", "h6"):
                    level = int(name[1])
                    text = child.get_text(strip=True)
                    if text:
                        parts.append(f"\n{'#' * level} {text}\n")
                        
                elif name == "p":
                    text = child.get_text(strip=True)
                    if text:
                        parts.append(f"\n{text}\n")
                        
                elif name == "a":
                    text = child.get_text(strip=True)
                    href = child.get("href", "")
                    if href and text and not href.startswith(("#", "javascript:")):
                        full_url = urljoin(base_url, href)
                        parts.append(f"[{text}]({full_url})")
                    elif text:
                        parts.append(text)
                        
                elif name == "img":
                    alt = child.get("alt", "image")
                    src = child.get("src", "")
                    if src:
                        full_url = urljoin(base_url, src)
                        parts.append(f"![{alt}]({full_url})")
                        
                elif name in ("ul", "ol"):
                    items = child.find_all("li", recursive=False)
                    for i, li in enumerate(items):
                        prefix = f"{i+1}." if name == "ol" else "-"
                        text = li.get_text(strip=True)
                        if text:
                            parts.append(f"\n{prefix} {text}")
                    parts.append("")
                    
                elif name in ("pre", "code"):
                    code = child.get_text()
                    if "\n" in code:
                        parts.append(f"\n```\n{code}\n```\n")
                    else:
                        parts.append(f"`{code.strip()}`")
                        
                elif name == "blockquote":
                    text = child.get_text(strip=True)
                    if text:
                        parts.append(f"\n> {text}\n")
                        
                elif name == "table":
                    parts.append(self._table_to_markdown(child))
                    
                elif name in ("strong", "b"):
                    text = child.get_text(strip=True)
                    if text:
                        parts.append(f"**{text}**")
                        
                elif name in ("em", "i"):
                    text = child.get_text(strip=True)
                    if text:
                        parts.append(f"*{text}*")
                        
                elif name == "br":
                    parts.append("\n")
                    
                elif name == "hr":
                    parts.append("\n---\n")
                    
                else:
                    # Recursively process other tags
                    inner = self._tag_to_markdown(child, base_url)
                    if inner.strip():
                        parts.append(inner)
        
        return " ".join(parts)
    
    def _table_to_markdown(self, table: Tag) -> str:
        """Convert HTML table to Markdown table"""
        rows = table.find_all("tr")
        if not rows:
            return ""
        
        md_rows = []
        for row in rows:
            cells = row.find_all(["th", "td"])
            md_cells = [cell.get_text(strip=True).replace("|", "\\|") for cell in cells]
            if md_cells:
                md_rows.append("| " + " | ".join(md_cells) + " |")
        
        if len(md_rows) > 0:
            # Add separator row
            num_cols = md_rows[0].count("|") - 1
            separator = "| " + " | ".join(["---"] * max(num_cols, 1)) + " |"
            md_rows.insert(1, separator)
        
        return "\n" + "\n".join(md_rows) + "\n"
    
    def _extract_links(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
        """Extract all links from the page"""
        links = []
        
        for a in soup.find_all("a", href=True):
            href = a["href"]
            text = a.get_text(strip=True)
            
            if href.startswith(("#", "javascript:", "mailto:", "tel:")):
                continue
                
            full_url = urljoin(base_url, href)
            links.append(f"- [{text or 'link'}]({full_url})")
        
        result = f"Found {len(links)} links:\n\n" + "\n".join(links)
        return result[:max_length]
    
    def _extract_structured(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
        """Extract structured data as JSON"""
        data = {
            "url": base_url,
            "title": "",
            "description": "",
            "headings": [],
            "main_text": "",
            "links_count": 0,
            "images_count": 0,
        }
        
        # Title
        title = soup.find("title")
        if title:
            data["title"] = title.get_text(strip=True)
        
        # Meta description
        meta_desc = soup.find("meta", attrs={"name": "description"})
        if meta_desc:
            data["description"] = meta_desc.get("content", "")
        
        # Headings
        for h in soup.find_all(["h1", "h2", "h3"]):
            text = h.get_text(strip=True)
            if text:
                data["headings"].append({"level": h.name, "text": text})
        
        # Counts
        data["links_count"] = len(soup.find_all("a", href=True))
        data["images_count"] = len(soup.find_all("img"))
        
        # Main text preview
        soup_clean = self._clean_soup(soup)
        data["main_text"] = soup_clean.get_text(separator=" ", strip=True)[:3000]
        
        return json.dumps(data, indent=2, ensure_ascii=False)[:max_length]
    
    def _extract_full(self, soup: BeautifulSoup, base_url: str, max_length: int) -> str:
        """Full extraction: Markdown + Links"""
        md = self._extract_markdown(soup, base_url, max_length // 2)
        links = self._extract_links(soup, base_url, max_length // 4)
        return f"{md}\n\n---\n\n{links}"[:max_length]


def crawl_url(
    url: str,
    mode: str = "markdown",
    max_length: int = 10000,
    selector: str = "",
) -> str:
    """
    Crawl a single URL and return formatted result
    
    Use this for extracting content from web pages.
    """
    crawler = WebCrawler()
    result = crawler.crawl(url, mode, max_length, selector)
    
    if result["success"]:
        return f"✅ Crawled: {result['url']}\n\n{result['content']}"
    else:
        return f"❌ Failed: {result['error']}"


def parallel_crawl(
    urls: List[str],
    mode: str = "markdown",
    max_length: int = 10000,
    max_workers: int = 5,
) -> str:
    """
    Crawl multiple URLs in parallel
    
    Use this for researching multiple sources at once.
    """
    crawler = WebCrawler()
    
    def crawl_one(url: str) -> Dict[str, Any]:
        return crawler.crawl(url, mode, max_length)
    
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {executor.submit(crawl_one, url): url for url in urls}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                results.append(future.result())
            except Exception as e:
                results.append({"success": False, "error": str(e), "url": url})
    
    # Format results
    output = [f"# Parallel Crawl Results ({len(urls)} URLs)\n"]
    
    for i, result in enumerate(results, 1):
        url = result.get("url", "Unknown")
        output.append(f"\n---\n\n## Source {i}: {url}\n")
        
        if result.get("success"):
            content = result.get("content", "")
            # Truncate if too long
            if len(content) > max_length:
                content = content[:max_length] + "\n\n[Content truncated...]"
            output.append(content)
        else:
            output.append(f"❌ Failed: {result.get('error', 'Unknown error')}")
    
    return "\n".join(output)


# CLI interface
if __name__ == "__main__":
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: python web_crawl.py <url> [mode] [max_length]")
        print("Modes: text, markdown, links, structured, full")
        sys.exit(1)
    
    url = sys.argv[1]
    mode = sys.argv[2] if len(sys.argv) > 2 else "markdown"
    max_length = int(sys.argv[3]) if len(sys.argv) > 3 else 10000
    
    result = crawl_url(url, mode, max_length)
    print(result)

ClawHub Frontend Data Analysis+2

不@clawhub-nowhitestar-9cdfdd2dad