Yeziwnl

@clawhub-lilw-yezi-1505a4903f
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
Webpage Export
Skill
Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages....
---
name: webpage-export
description: Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages. Useful for archiving articles, policy pages, WeChat posts, official notices, and other webpages before downstream analysis or sharing.
---

# Webpage Export

Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference.

## Core workflow

1. Run `scripts/export_webpage.py <url>` to create a TXT snapshot first.
2. Treat TXT as the baseline extracted record.
3. Add `--docx` when the user wants a Word document.
4. Add `--pdf` when Chrome/Chromium is available and the user wants a PDF.
5. Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents.
6. Save outputs to an explicit `--outdir` when the user provides one; otherwise let the script use its local default export folder under the current working directory.
7. For accuracy-sensitive work, keep original title, original URL, and extracted source metadata.

## Commands

### TXT only

```bash
python3 scripts/export_webpage.py "<url>"
```

### TXT + DOCX

```bash
python3 scripts/export_webpage.py "<url>" --docx
```

### TXT + PDF

```bash
python3 scripts/export_webpage.py "<url>" --pdf
```

### TXT + DOCX + PDF with explicit output folder

```bash
python3 scripts/export_webpage.py "<url>" --docx --pdf --outdir ./exports/temp
```

## Runtime requirements

- Requires `python3`.
- Requires `curl` for baseline webpage fetching.
- PDF export requires Chrome or Chromium.
- Browser-assisted fallback requires `node` and the `playwright` package.
- DOCX export on macOS requires `textutil`.

## Safety and execution notes

- This skill fetches arbitrary URLs and may use a headless browser for difficult pages.
- Browser-assisted fallback executes page JavaScript and should be used only when needed.
- Prefer explicit `--outdir` values for production or shared environments.

## What the script does

- Fetch the page with `curl`
- Extract title/source/publish-time when available
- Try multiple body candidates before falling back to a full-page text snapshot
- Score extraction quality and emit warnings for suspicious/partial results
- Strip HTML into readable text for a TXT snapshot
- Convert TXT to DOCX using `textutil` on macOS
- Render webpage to PDF using Chrome/Chromium headless printing when available
- Emit a JSON metadata file with status, paths, word count, quality, and warnings

## Format choice

- Prefer **TXT** as the baseline extracted record.
- Prefer **DOCX** when the user wants an editable or shareable document.
- Prefer **PDF** when the user wants page-like rendering or easier direct viewing.
- For important work, do not treat PDF as the only source of truth.

## Chrome/Chromium PDF path

When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation.

Read `references/chrome-pdf-guide.md` when:
- you need the exact Chrome PDF logic
- PDF output is incomplete or suspicious
- Chrome emits warnings and you need to judge whether the result is still usable
- you need fallback decisions

## Accuracy and fallbacks

Read `references/accuracy-and-fallbacks.md` when:
- source accuracy matters
- webpage metadata is incomplete
- a field cannot be extracted cleanly
- you need fallback behavior after a partial extraction

## Delivery decisions

Read `references/delivery-rules.md` when:
- deciding whether to deliver TXT, DOCX, PDF, or a combination
- preparing files for downstream agents or user delivery
- choosing archive placement under the local workspace

## Limitations

- Some highly dynamic or anti-bot pages may extract only partially.
- PDF depends on Chrome/Chromium being installed.
- DOCX depends on macOS `textutil`.
- If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up.

## Accuracy rule

Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.

FILE:references/accuracy-and-fallbacks.md
# Accuracy and Fallbacks

## Accuracy rules

- Preserve the original title whenever possible; do not rewrite it for storage.
- Preserve the original URL.
- Keep source/account/publisher separate from your own summary.
- If publish time is not directly extractable, mark it as missing instead of guessing.
- Treat PDF generation as a rendering task, not a content-understanding task.

## Export order

1. Export TXT first. This is the ground-truth extracted text snapshot.
2. Export DOCX next when the user wants an editable/shareable document.
3. Export PDF with Chrome/Chromium when the user wants page-like fidelity.

## Fallbacks

- If PDF fails, still keep TXT and DOCX.
- If webpage extraction is partial, keep the URL and note that extraction needs browser/manual review.
- If the page is dynamic or anti-bot protected, use browser-assisted inspection before claiming the page is unavailable.
- Treat `STATUS=partial` as usable-but-incomplete output: keep the TXT and metadata JSON, then decide whether a browser-assisted retry is worth it.
- Use the JSON metadata file as the authoritative machine-readable summary of success, partial success, warnings, and output paths.
- Prefer browser-rendered DOM fallback when static HTML extraction is too short or obviously shell-like; record whether fallback was used and do not hide the difference between static and browser-rendered text.

## Storage

Default archive root:
a local `webpage-exports/` folder under the current working directory, unless the task explicitly sets another output path.

Recommended subdirectories:
- `raw/` for original downloads
- `processed/` for cleaned outputs
- `temp/` for testing or intermediate exports

FILE:references/chrome-pdf-guide.md
# Chrome PDF Guide

## Why prefer Chrome/Chromium for PDF

Use Chrome/Chromium headless printing as the preferred PDF path when the goal is to preserve webpage layout with better Chinese rendering and fewer乱码 issues than ad-hoc PDF generation.

Compared with plain text PDF generation, Chrome PDF is better for:
- WeChat articles
- Official policy pages
- Long-form web articles
- Pages where the user expects visual page fidelity

## Standard command pattern

The script wraps this pattern internally:

```bash
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' \
  --headless=new \
  --disable-gpu \
  --no-first-run \
  --virtual-time-budget=15000 \
  --print-to-pdf=/path/to/output.pdf \
  '<url>'
```

## When to use Chrome PDF

Use Chrome PDF when:
- the user explicitly asks for PDF
- the page contains Chinese text and plain PDF generation caused乱码 before
- the page is primarily a normal article/detail page
- layout fidelity matters more than editability

## When NOT to rely on Chrome PDF alone

Do not rely on Chrome PDF alone when:
- you still need structured text extraction for summarization or database fields
- the page is heavily dynamic and content may not finish loading in time
- the page is access-controlled and requires login/cookies not available in headless mode
- the page contains attachments that should be downloaded separately instead of merely rendered

In those cases:
- still keep TXT as the extraction baseline
- optionally keep DOCX for editable delivery
- treat PDF as one output, not the only source of truth

## Required companion outputs

For accuracy-sensitive work, keep at least:
- TXT snapshot
- original URL
- title/source metadata

For user delivery, optionally add:
- DOCX
- PDF

## Common issues

### 1. PDF generated successfully but content is incomplete

Likely cause:
- page did not finish rendering before print

Action:
- increase `--virtual-time-budget`
- re-run once
- if still incomplete, inspect the page with browser-assisted workflow

### 2. Chrome prints the shell page but not the article body

Likely cause:
- dynamic rendering, anti-bot behavior, or blocked resources

Action:
- keep TXT extraction result
- record that browser/manual review is needed
- do not pretend the PDF is complete if the article body is missing

### 3. Chrome path not found

Check these common locations:
- `/Applications/Google Chrome.app/Contents/MacOS/Google Chrome`
- `/Applications/Chromium.app/Contents/MacOS/Chromium`

If neither exists:
- install Chrome/Chromium
- fall back to TXT/DOCX until PDF path is restored

### 4. Console warnings appear during headless print

Some Chrome/macOS warnings do not block PDF generation.
Judge by outcome:
- if PDF file is generated and content is correct, treat as success
- if file is missing or content is broken, treat as failure and fall back

## Success criteria

A Chrome-generated PDF is only considered valid when:
- the file exists
- Chinese text renders correctly
- the main body is present
- title/source can still be matched against the original page

If any of these fail, keep TXT/DOCX and mark PDF as failed or incomplete.

FILE:references/delivery-rules.md
# Delivery Rules

## Output priority

Choose outputs based on the real request, but follow this priority logic:

1. TXT = baseline extraction record
2. DOCX = editable/shareable output
3. PDF = layout-preserving output

For accuracy-sensitive work, do not deliver only PDF without keeping TXT or source metadata.

## Recommended combinations

### If the user says "extract webpage content"

Deliver:
- TXT
- extracted metadata summary in chat if useful

### If the user says "make it a Word file"

Deliver:
- TXT
- DOCX

### If the user says "make it a PDF"

Deliver:
- TXT
- PDF

### If the user says "archive this webpage"

Deliver/store:
- TXT
- original URL
- metadata
- optionally DOCX/PDF

## Accuracy first

Never optimize for format over correctness.

If PDF looks pretty but content is incomplete, prefer the TXT/DOCX as the trustworthy output and explicitly note the PDF issue.

## Storage guidance

Recommended root:
`./exports/` or another explicit folder chosen for the task.

Typical usage:
- `temp/` for tests and intermediate exports
- `raw/` for original download artifacts
- `processed/` for cleaned and finalized deliverables

FILE:scripts/export_webpage.py
#!/usr/bin/env python3
import argparse
import html
import json
import os
import pathlib
import re
import subprocess
import sys
from shutil import which
from typing import Optional

UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
DEFAULT_ARCHIVE = pathlib.Path.cwd() / 'webpage-exports'


def run(cmd):
    return subprocess.run(cmd, check=True, capture_output=True, text=True)


def fetch_html(url: str) -> str:
    result = run(['curl', '-L', '--max-time', '30', '-A', UA, url])
    return result.stdout


def pick(pattern: str, text: str) -> str:
    m = re.search(pattern, text, re.S | re.I)
    return m.group(1).strip() if m else ''


def html_to_text(fragment: str) -> str:
    fragment = re.sub(r'<script.*?</script>', '', fragment, flags=re.S | re.I)
    fragment = re.sub(r'<style.*?</style>', '', fragment, flags=re.S | re.I)
    fragment = re.sub(r'<noscript.*?</noscript>', '', fragment, flags=re.S | re.I)
    fragment = re.sub(r'<br\s*/?>', '\n', fragment, flags=re.I)
    fragment = re.sub(r'</p>|</section>|</div>|</li>|</h\d>|</tr>|</article>|</main>', '\n', fragment, flags=re.I)
    fragment = re.sub(r'<li[^>]*>', '• ', fragment, flags=re.I)
    fragment = re.sub(r'<[^>]+>', '', fragment)
    fragment = html.unescape(fragment)
    fragment = fragment.replace('\xa0', ' ')
    fragment = re.sub(r'\n{3,}', '\n\n', fragment)
    fragment = re.sub(r'[ \t]+', ' ', fragment)
    return fragment.strip()


def extract_title(raw: str) -> str:
    return (
        pick(r'<meta property="og:title" content="(.*?)"', raw)
        or pick(r'<meta name="twitter:title" content="(.*?)"', raw)
        or pick(r'<title[^>]*>(.*?)</title>', raw)
        or '网页内容提取'
    )


def extract_source(raw: str) -> str:
    return (
        pick(r'id="js_name">\s*(.*?)\s*</a>', raw)
        or pick(r'<meta name="author" content="(.*?)"', raw)
        or pick(r'<meta property="og:article:author" content="(.*?)"', raw)
        or pick(r'<meta property="article:author" content="(.*?)"', raw)
        or ''
    )


def extract_publish(raw: str) -> str:
    return (
        pick(r'id="publish_time"[^>]*>\s*(.*?)\s*</', raw)
        or pick(r'<meta property="article:published_time" content="(.*?)"', raw)
        or pick(r'<time[^>]*datetime="(.*?)"', raw)
        or ''
    )


def body_candidates():
    return [
        ('wechat-js-content', r'<div class="rich_media_content js_underline_content[^"]*"\s+id="js_content"[^>]*>(.*?)</div>'),
        ('article', r'<article[^>]*>(.*?)</article>'),
        ('main', r'<main[^>]*>(.*?)</main>'),
        ('post-content', r'<div[^>]+class="[^"]*(?:post-content|article-content|content-detail|content-main|entry-content|content-wrapper|detail-content)[^"]*"[^>]*>(.*?)</div>'),
        ('body', r'<body[^>]*>(.*?)</body>'),
    ]


NOISE_PATTERNS = [
    r'点击上方.*关注',
    r'上一篇',
    r'下一篇',
    r'责任编辑',
    r'版权所有',
    r'ICP备',
]


def assess_content(text: str, title: str):
    warnings = []
    score = 0
    word_count = len(text)
    lowered = text.lower()

    if word_count >= 800:
        score += 3
    elif word_count >= 300:
        score += 2
    elif word_count >= 120:
        score += 1
    else:
        warnings.append('content_too_short')

    paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
    long_paragraphs = [p for p in paragraphs if len(p) >= 40]
    if len(long_paragraphs) >= 3:
        score += 2
    elif len(long_paragraphs) >= 1:
        score += 1
    else:
        warnings.append('few_long_paragraphs')

    title_tokens = [t for t in re.split(r'[\s\-_:：，,（）()]+', title) if len(t) >= 2]
    if title_tokens and any(tok in text for tok in title_tokens[:4]):
        score += 1
    else:
        warnings.append('title_tokens_not_found_in_text')

    noise_hits = sum(1 for pat in NOISE_PATTERNS if re.search(pat, text, re.I))
    if noise_hits >= 3 and word_count < 400:
        warnings.append('possible_shell_or_noise_heavy_page')

    if 'javascript' in lowered and word_count < 120:
        warnings.append('possible_dynamic_page')

    if word_count < 120 or 'possible_shell_or_noise_heavy_page' in warnings:
        quality = 'low'
    elif word_count < 300 or len(warnings) >= 2:
        quality = 'medium'
    else:
        quality = 'high'

    needs_browser_review = quality == 'low' or 'possible_dynamic_page' in warnings
    return {
        'word_count': word_count,
        'quality': quality,
        'warnings': warnings,
        'needs_browser_review': needs_browser_review,
        'paragraph_count': len(paragraphs),
        'long_paragraph_count': len(long_paragraphs),
        'score': score,
    }


def extract_best_content(raw: str, title: str):
    best = {
        'candidate': '',
        'content': '',
        'quality': {
            'word_count': 0,
            'score': -1,
            'quality': 'low',
            'warnings': ['no_content'],
            'needs_browser_review': True,
            'paragraph_count': 0,
            'long_paragraph_count': 0,
        },
    }
    for candidate_name, pat in body_candidates():
        body = pick(pat, raw)
        if not body:
            continue
        body_text = html_to_text(body)
        quality = assess_content(body_text, title)
        if quality['score'] > best['quality']['score']:
            best = {
                'candidate': candidate_name,
                'content': body_text,
                'quality': quality,
            }

    if not best['content']:
        fallback_text = html_to_text(raw)
        quality = assess_content(fallback_text, title)
        best = {
            'candidate': 'full-html-fallback',
            'content': fallback_text,
            'quality': quality,
        }
    return best


def chrome_path() -> str:
    candidates = [
        '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
        '/Applications/Chromium.app/Contents/MacOS/Chromium',
    ]
    for c in candidates:
        if pathlib.Path(c).exists():
            return c
    return ''


def fetch_rendered_dom(url: str, virtual_time_budget: int) -> str:
    chrome = chrome_path()
    if not chrome:
        raise RuntimeError('未找到 Chrome/Chromium，无法获取浏览器渲染后的 DOM')
    result = run([
        chrome,
        '--headless=new',
        '--disable-gpu',
        '--no-first-run',
        f'--virtual-time-budget={virtual_time_budget}',
        '--dump-dom',
        url,
    ])
    return result.stdout


def fetch_browser_visible_text(url: str, out_path: pathlib.Path, virtual_time_budget: int):
    chrome = chrome_path()
    if not chrome:
        raise RuntimeError('未找到 Chrome/Chromium，无法执行浏览器文本提取')
    script = """
const fs = require('fs');
(async () => {
  const url = process.argv[1];
  const out = process.argv[2];
  const chrome = process.env.CHROME_BIN;
  const { chromium } = require('playwright');
  const browser = await chromium.launch({ headless: true, executablePath: chrome });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle', timeout: 45000 });
  await page.evaluate(async () => {
    await new Promise(r => setTimeout(r, 2000));
    window.scrollTo(0, document.body.scrollHeight);
    await new Promise(r => setTimeout(r, 1500));
    window.scrollTo(0, 0);
  });
  const payload = await page.evaluate(() => {
    const title = document.title || '';
    const bodyText = (document.body && document.body.innerText) ? document.body.innerText : '';
    const metas = {};
    for (const el of document.querySelectorAll('meta')) {
      const k = el.getAttribute('property') || el.getAttribute('name');
      const v = el.getAttribute('content');
      if (k && v) metas[k] = v;
    }
    return { title, bodyText, metas };
  });
  fs.writeFileSync(out, JSON.stringify(payload), 'utf8');
  await browser.close();
})().catch(err => { console.error(String(err)); process.exit(1); });
"""
    child_env = {
        'PATH': os.environ.get('PATH', ''),
        'HOME': os.environ.get('HOME', ''),
        'CHROME_BIN': chrome,
    }
    subprocess.run([
        'node',
        '-e',
        script,
        url,
        str(out_path),
    ], check=True, env=child_env)


def extract(url: str, raw: str, enable_browser_fallback: bool, virtual_time_budget: int, browser_text_temp: Optional[pathlib.Path] = None):
    title = extract_title(raw)
    source = extract_source(raw)
    publish = extract_publish(raw)
    best = extract_best_content(raw, title)
    text_source = 'static_html'
    browser_fallback_used = False
    browser_fallback_error = ''

    if enable_browser_fallback and best['quality']['needs_browser_review']:
        try:
            rendered = fetch_rendered_dom(url, virtual_time_budget)
            rendered_title = extract_title(rendered) or title
            browser_best = extract_best_content(rendered, rendered_title)
            if browser_best['quality']['score'] > best['quality']['score']:
                title = rendered_title or title
                source = extract_source(rendered) or source
                publish = extract_publish(rendered) or publish
                best = browser_best
                text_source = 'browser_rendered_dom'
                browser_fallback_used = True
        except Exception as e:
            browser_fallback_error = str(e)

    if enable_browser_fallback and best['quality']['needs_browser_review'] and browser_text_temp is not None:
        try:
            fetch_browser_visible_text(url, browser_text_temp, virtual_time_budget)
            browser_payload = json.loads(browser_text_temp.read_text(encoding='utf-8'))
            browser_text = (browser_payload.get('bodyText') or '').strip()
            browser_title = (browser_payload.get('title') or '').strip() or title
            browser_quality = assess_content(browser_text, browser_title)
            if browser_quality['score'] > best['quality']['score']:
                title = browser_title
                source = browser_payload.get('metas', {}).get('author', '') or source
                publish = browser_payload.get('metas', {}).get('article:published_time', '') or publish
                best = {
                    'candidate': 'browser-visible-text',
                    'content': browser_text,
                    'quality': browser_quality,
                }
                text_source = 'browser_visible_text'
                browser_fallback_used = True
        except Exception as e:
            if browser_fallback_error:
                browser_fallback_error += ' || '
            browser_fallback_error += f'browser_visible_text:{e}'

    return {
        'title': title,
        'source': source,
        'publish': publish,
        'url': url,
        'content': best['content'],
        'content_candidate': best['candidate'],
        'content_quality': best['quality'],
        'text_source': text_source,
        'browser_fallback_used': browser_fallback_used,
        'browser_fallback_error': browser_fallback_error,
    }


def sanitize_filename(name: str) -> str:
    name = re.sub(r'[\\/:*?"<>|\n\r]+', '_', name)
    name = re.sub(r'\s+', ' ', name).strip()
    return name[:80] or 'webpage-export'


def build_text(doc: dict) -> str:
    warnings = ', '.join(doc['content_quality']['warnings']) if doc['content_quality']['warnings'] else '无'
    return (
        f"标题：{doc['title']}\n"
        f"来源：{doc['source'] or '未提取到'}\n"
        f"发布时间：{doc['publish'] or '未直接提取到'}\n"
        f"原始链接：{doc['url']}\n"
        f"文本来源：{doc['text_source']}\n"
        f"正文候选：{doc['content_candidate']}\n"
        f"提取质量：{doc['content_quality']['quality']}\n"
        f"字数：{doc['content_quality']['word_count']}\n"
        f"告警：{warnings}\n\n"
        f"正文：\n\n{doc['content']}\n"
    )


def write_txt(text: str, path: pathlib.Path):
    path.write_text(text, encoding='utf-8')


def write_json(payload: dict, path: pathlib.Path):
    path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding='utf-8')


def write_docx(txt_path: pathlib.Path, docx_path: pathlib.Path):
    if which('textutil') is None:
        raise RuntimeError('textutil 不可用，无法生成 docx')
    subprocess.run(['textutil', '-convert', 'docx', str(txt_path), '-output', str(docx_path)], check=True)


def write_pdf(url: str, pdf_path: pathlib.Path, virtual_time_budget: int):
    chrome = chrome_path()
    if not chrome:
        raise RuntimeError('未找到 Chrome/Chromium，无法生成 PDF')
    subprocess.run([
        chrome,
        '--headless=new',
        '--disable-gpu',
        '--no-first-run',
        f'--virtual-time-budget={virtual_time_budget}',
        f'--print-to-pdf={pdf_path}',
        url,
    ], check=True)


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument('url')
    ap.add_argument('--outdir', default=str(DEFAULT_ARCHIVE / 'temp'))
    ap.add_argument('--basename', default='')
    ap.add_argument('--docx', action='store_true')
    ap.add_argument('--pdf', action='store_true')
    ap.add_argument('--json', action='store_true', help='always emit metadata json (default true)')
    ap.add_argument('--no-json', dest='json', action='store_false')
    ap.add_argument('--virtual-time-budget', type=int, default=15000)
    ap.add_argument('--disable-browser-fallback', action='store_true')
    ap.set_defaults(json=True)
    args = ap.parse_args()

    outdir = pathlib.Path(args.outdir)
    outdir.mkdir(parents=True, exist_ok=True)

    metadata = {
        'url': args.url,
        'status': 'failed',
        'title': '',
        'source': '',
        'publish': '',
        'content_candidate': '',
        'content_quality': {},
        'text_source': '',
        'browser_fallback_used': False,
        'browser_fallback_error': '',
        'paths': {},
        'warnings': [],
        'steps': {
            'txt': 'pending',
            'docx': 'skipped',
            'pdf': 'skipped',
        },
    }

    raw = fetch_html(args.url)
    browser_text_temp = outdir / '.browser_text_tmp.json'
    doc = extract(args.url, raw, not args.disable_browser_fallback, args.virtual_time_budget, browser_text_temp=browser_text_temp)
    basename = args.basename or sanitize_filename(doc['title'])
    txt_path = outdir / f'{basename}.txt'
    json_path = outdir / f'{basename}.json'

    metadata.update({
        'title': doc['title'],
        'source': doc['source'],
        'publish': doc['publish'],
        'content_candidate': doc['content_candidate'],
        'content_quality': doc['content_quality'],
        'text_source': doc['text_source'],
        'browser_fallback_used': doc['browser_fallback_used'],
        'browser_fallback_error': doc['browser_fallback_error'],
        'paths': {'txt': str(txt_path)},
    })
    metadata['warnings'].extend(doc['content_quality']['warnings'])
    if doc['content_quality']['needs_browser_review']:
        metadata['warnings'].append('needs_browser_review')
    if doc['browser_fallback_used']:
        metadata['warnings'].append('browser_fallback_used')
    if doc['browser_fallback_error']:
        metadata['warnings'].append(f'browser_fallback_error:{doc["browser_fallback_error"]}')

    text = build_text(doc)
    write_txt(text, txt_path)
    metadata['steps']['txt'] = 'success'

    if args.docx:
        docx_path = outdir / f'{basename}.docx'
        metadata['paths']['docx'] = str(docx_path)
        try:
            write_docx(txt_path, docx_path)
            metadata['steps']['docx'] = 'success'
        except Exception as e:
            metadata['steps']['docx'] = 'failed'
            metadata['warnings'].append(f'docx_failed:{e}')

    if args.pdf:
        pdf_path = outdir / f'{basename}.pdf'
        metadata['paths']['pdf'] = str(pdf_path)
        try:
            write_pdf(args.url, pdf_path, args.virtual_time_budget)
            if pdf_path.exists() and pdf_path.stat().st_size > 1024:
                metadata['steps']['pdf'] = 'success'
            else:
                metadata['steps']['pdf'] = 'failed'
                metadata['warnings'].append('pdf_failed:missing_or_too_small')
        except Exception as e:
            metadata['steps']['pdf'] = 'failed'
            metadata['warnings'].append(f'pdf_failed:{e}')

    succeeded = [k for k, v in metadata['steps'].items() if v == 'success']
    if metadata['steps']['txt'] == 'success' and len(succeeded) == 1 and (args.docx or args.pdf):
        metadata['status'] = 'partial'
    elif metadata['steps']['txt'] == 'success':
        metadata['status'] = 'success'
    else:
        metadata['status'] = 'failed'

    if args.json:
        write_json(metadata, json_path)
        metadata['paths']['json'] = str(json_path)

    print(f'TXT={txt_path}')
    print(f'TITLE={doc["title"]}')
    print(f'SOURCE={doc["source"]}')
    print(f'TEXT_SOURCE={doc["text_source"]}')
    print(f'QUALITY={doc["content_quality"]["quality"]}')
    print(f'WORDS={doc["content_quality"]["word_count"]}')
    print(f'STATUS={metadata["status"]}')
    if args.json:
        print(f'JSON={json_path}')

    if args.docx and 'docx' in metadata['paths'] and metadata['steps']['docx'] == 'success':
        print(f'DOCX={metadata["paths"]["docx"]}')

    if args.pdf and 'pdf' in metadata['paths'] and metadata['steps']['pdf'] == 'success':
        print(f'PDF={metadata["paths"]["pdf"]}')

    if metadata['warnings']:
        print('WARNINGS=' + ' | '.join(metadata['warnings']))


if __name__ == '__main__':
    try:
        main()
    except subprocess.CalledProcessError as e:
        sys.stderr.write(e.stderr or str(e))
        sys.exit(e.returncode or 1)
    except Exception as e:
        sys.stderr.write(str(e) + '\n')
        sys.exit(1)
ClawHub Testing Automation+2
Y@clawhub-lilw-yezi-1505a4903f