@clawhub-lilw-yezi-1505a4903f
Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages....
---
name: webpage-export
description: Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages. Useful for archiving articles, policy pages, WeChat posts, official notices, and other webpages before downstream analysis or sharing.
---
# Webpage Export
Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference.
## Core workflow
1. Run `scripts/export_webpage.py <url>` to create a TXT snapshot first.
2. Treat TXT as the baseline extracted record.
3. Add `--docx` when the user wants a Word document.
4. Add `--pdf` when Chrome/Chromium is available and the user wants a PDF.
5. Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents.
6. Save outputs to an explicit `--outdir` when the user provides one; otherwise let the script use its local default export folder under the current working directory.
7. For accuracy-sensitive work, keep original title, original URL, and extracted source metadata.
## Commands
### TXT only
```bash
python3 scripts/export_webpage.py "<url>"
```
### TXT + DOCX
```bash
python3 scripts/export_webpage.py "<url>" --docx
```
### TXT + PDF
```bash
python3 scripts/export_webpage.py "<url>" --pdf
```
### TXT + DOCX + PDF with explicit output folder
```bash
python3 scripts/export_webpage.py "<url>" --docx --pdf --outdir ./exports/temp
```
## Runtime requirements
- Requires `python3`.
- Requires `curl` for baseline webpage fetching.
- PDF export requires Chrome or Chromium.
- Browser-assisted fallback requires `node` and the `playwright` package.
- DOCX export on macOS requires `textutil`.
## Safety and execution notes
- This skill fetches arbitrary URLs and may use a headless browser for difficult pages.
- Browser-assisted fallback executes page JavaScript and should be used only when needed.
- Prefer explicit `--outdir` values for production or shared environments.
## What the script does
- Fetch the page with `curl`
- Extract title/source/publish-time when available
- Try multiple body candidates before falling back to a full-page text snapshot
- Score extraction quality and emit warnings for suspicious/partial results
- Strip HTML into readable text for a TXT snapshot
- Convert TXT to DOCX using `textutil` on macOS
- Render webpage to PDF using Chrome/Chromium headless printing when available
- Emit a JSON metadata file with status, paths, word count, quality, and warnings
## Format choice
- Prefer **TXT** as the baseline extracted record.
- Prefer **DOCX** when the user wants an editable or shareable document.
- Prefer **PDF** when the user wants page-like rendering or easier direct viewing.
- For important work, do not treat PDF as the only source of truth.
## Chrome/Chromium PDF path
When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation.
Read `references/chrome-pdf-guide.md` when:
- you need the exact Chrome PDF logic
- PDF output is incomplete or suspicious
- Chrome emits warnings and you need to judge whether the result is still usable
- you need fallback decisions
## Accuracy and fallbacks
Read `references/accuracy-and-fallbacks.md` when:
- source accuracy matters
- webpage metadata is incomplete
- a field cannot be extracted cleanly
- you need fallback behavior after a partial extraction
## Delivery decisions
Read `references/delivery-rules.md` when:
- deciding whether to deliver TXT, DOCX, PDF, or a combination
- preparing files for downstream agents or user delivery
- choosing archive placement under the local workspace
## Limitations
- Some highly dynamic or anti-bot pages may extract only partially.
- PDF depends on Chrome/Chromium being installed.
- DOCX depends on macOS `textutil`.
- If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up.
## Accuracy rule
Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.
FILE:references/accuracy-and-fallbacks.md
# Accuracy and Fallbacks
## Accuracy rules
- Preserve the original title whenever possible; do not rewrite it for storage.
- Preserve the original URL.
- Keep source/account/publisher separate from your own summary.
- If publish time is not directly extractable, mark it as missing instead of guessing.
- Treat PDF generation as a rendering task, not a content-understanding task.
## Export order
1. Export TXT first. This is the ground-truth extracted text snapshot.
2. Export DOCX next when the user wants an editable/shareable document.
3. Export PDF with Chrome/Chromium when the user wants page-like fidelity.
## Fallbacks
- If PDF fails, still keep TXT and DOCX.
- If webpage extraction is partial, keep the URL and note that extraction needs browser/manual review.
- If the page is dynamic or anti-bot protected, use browser-assisted inspection before claiming the page is unavailable.
- Treat `STATUS=partial` as usable-but-incomplete output: keep the TXT and metadata JSON, then decide whether a browser-assisted retry is worth it.
- Use the JSON metadata file as the authoritative machine-readable summary of success, partial success, warnings, and output paths.
- Prefer browser-rendered DOM fallback when static HTML extraction is too short or obviously shell-like; record whether fallback was used and do not hide the difference between static and browser-rendered text.
## Storage
Default archive root:
a local `webpage-exports/` folder under the current working directory, unless the task explicitly sets another output path.
Recommended subdirectories:
- `raw/` for original downloads
- `processed/` for cleaned outputs
- `temp/` for testing or intermediate exports
FILE:references/chrome-pdf-guide.md
# Chrome PDF Guide
## Why prefer Chrome/Chromium for PDF
Use Chrome/Chromium headless printing as the preferred PDF path when the goal is to preserve webpage layout with better Chinese rendering and fewer乱码 issues than ad-hoc PDF generation.
Compared with plain text PDF generation, Chrome PDF is better for:
- WeChat articles
- Official policy pages
- Long-form web articles
- Pages where the user expects visual page fidelity
## Standard command pattern
The script wraps this pattern internally:
```bash
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' \
--headless=new \
--disable-gpu \
--no-first-run \
--virtual-time-budget=15000 \
--print-to-pdf=/path/to/output.pdf \
'<url>'
```
## When to use Chrome PDF
Use Chrome PDF when:
- the user explicitly asks for PDF
- the page contains Chinese text and plain PDF generation caused乱码 before
- the page is primarily a normal article/detail page
- layout fidelity matters more than editability
## When NOT to rely on Chrome PDF alone
Do not rely on Chrome PDF alone when:
- you still need structured text extraction for summarization or database fields
- the page is heavily dynamic and content may not finish loading in time
- the page is access-controlled and requires login/cookies not available in headless mode
- the page contains attachments that should be downloaded separately instead of merely rendered
In those cases:
- still keep TXT as the extraction baseline
- optionally keep DOCX for editable delivery
- treat PDF as one output, not the only source of truth
## Required companion outputs
For accuracy-sensitive work, keep at least:
- TXT snapshot
- original URL
- title/source metadata
For user delivery, optionally add:
- DOCX
- PDF
## Common issues
### 1. PDF generated successfully but content is incomplete
Likely cause:
- page did not finish rendering before print
Action:
- increase `--virtual-time-budget`
- re-run once
- if still incomplete, inspect the page with browser-assisted workflow
### 2. Chrome prints the shell page but not the article body
Likely cause:
- dynamic rendering, anti-bot behavior, or blocked resources
Action:
- keep TXT extraction result
- record that browser/manual review is needed
- do not pretend the PDF is complete if the article body is missing
### 3. Chrome path not found
Check these common locations:
- `/Applications/Google Chrome.app/Contents/MacOS/Google Chrome`
- `/Applications/Chromium.app/Contents/MacOS/Chromium`
If neither exists:
- install Chrome/Chromium
- fall back to TXT/DOCX until PDF path is restored
### 4. Console warnings appear during headless print
Some Chrome/macOS warnings do not block PDF generation.
Judge by outcome:
- if PDF file is generated and content is correct, treat as success
- if file is missing or content is broken, treat as failure and fall back
## Success criteria
A Chrome-generated PDF is only considered valid when:
- the file exists
- Chinese text renders correctly
- the main body is present
- title/source can still be matched against the original page
If any of these fail, keep TXT/DOCX and mark PDF as failed or incomplete.
FILE:references/delivery-rules.md
# Delivery Rules
## Output priority
Choose outputs based on the real request, but follow this priority logic:
1. TXT = baseline extraction record
2. DOCX = editable/shareable output
3. PDF = layout-preserving output
For accuracy-sensitive work, do not deliver only PDF without keeping TXT or source metadata.
## Recommended combinations
### If the user says "extract webpage content"
Deliver:
- TXT
- extracted metadata summary in chat if useful
### If the user says "make it a Word file"
Deliver:
- TXT
- DOCX
### If the user says "make it a PDF"
Deliver:
- TXT
- PDF
### If the user says "archive this webpage"
Deliver/store:
- TXT
- original URL
- metadata
- optionally DOCX/PDF
## Accuracy first
Never optimize for format over correctness.
If PDF looks pretty but content is incomplete, prefer the TXT/DOCX as the trustworthy output and explicitly note the PDF issue.
## Storage guidance
Recommended root:
`./exports/` or another explicit folder chosen for the task.
Typical usage:
- `temp/` for tests and intermediate exports
- `raw/` for original download artifacts
- `processed/` for cleaned and finalized deliverables
FILE:scripts/export_webpage.py
#!/usr/bin/env python3
import argparse
import html
import json
import os
import pathlib
import re
import subprocess
import sys
from shutil import which
from typing import Optional
UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
DEFAULT_ARCHIVE = pathlib.Path.cwd() / 'webpage-exports'
def run(cmd):
return subprocess.run(cmd, check=True, capture_output=True, text=True)
def fetch_html(url: str) -> str:
result = run(['curl', '-L', '--max-time', '30', '-A', UA, url])
return result.stdout
def pick(pattern: str, text: str) -> str:
m = re.search(pattern, text, re.S | re.I)
return m.group(1).strip() if m else ''
def html_to_text(fragment: str) -> str:
fragment = re.sub(r'<script.*?</script>', '', fragment, flags=re.S | re.I)
fragment = re.sub(r'<style.*?</style>', '', fragment, flags=re.S | re.I)
fragment = re.sub(r'<noscript.*?</noscript>', '', fragment, flags=re.S | re.I)
fragment = re.sub(r'<br\s*/?>', '\n', fragment, flags=re.I)
fragment = re.sub(r'</p>|</section>|</div>|</li>|</h\d>|</tr>|</article>|</main>', '\n', fragment, flags=re.I)
fragment = re.sub(r'<li[^>]*>', '• ', fragment, flags=re.I)
fragment = re.sub(r'<[^>]+>', '', fragment)
fragment = html.unescape(fragment)
fragment = fragment.replace('\xa0', ' ')
fragment = re.sub(r'\n{3,}', '\n\n', fragment)
fragment = re.sub(r'[ \t]+', ' ', fragment)
return fragment.strip()
def extract_title(raw: str) -> str:
return (
pick(r'<meta property="og:title" content="(.*?)"', raw)
or pick(r'<meta name="twitter:title" content="(.*?)"', raw)
or pick(r'<title[^>]*>(.*?)</title>', raw)
or '网页内容提取'
)
def extract_source(raw: str) -> str:
return (
pick(r'id="js_name">\s*(.*?)\s*</a>', raw)
or pick(r'<meta name="author" content="(.*?)"', raw)
or pick(r'<meta property="og:article:author" content="(.*?)"', raw)
or pick(r'<meta property="article:author" content="(.*?)"', raw)
or ''
)
def extract_publish(raw: str) -> str:
return (
pick(r'id="publish_time"[^>]*>\s*(.*?)\s*</', raw)
or pick(r'<meta property="article:published_time" content="(.*?)"', raw)
or pick(r'<time[^>]*datetime="(.*?)"', raw)
or ''
)
def body_candidates():
return [
('wechat-js-content', r'<div class="rich_media_content js_underline_content[^"]*"\s+id="js_content"[^>]*>(.*?)</div>'),
('article', r'<article[^>]*>(.*?)</article>'),
('main', r'<main[^>]*>(.*?)</main>'),
('post-content', r'<div[^>]+class="[^"]*(?:post-content|article-content|content-detail|content-main|entry-content|content-wrapper|detail-content)[^"]*"[^>]*>(.*?)</div>'),
('body', r'<body[^>]*>(.*?)</body>'),
]
NOISE_PATTERNS = [
r'点击上方.*关注',
r'上一篇',
r'下一篇',
r'责任编辑',
r'版权所有',
r'ICP备',
]
def assess_content(text: str, title: str):
warnings = []
score = 0
word_count = len(text)
lowered = text.lower()
if word_count >= 800:
score += 3
elif word_count >= 300:
score += 2
elif word_count >= 120:
score += 1
else:
warnings.append('content_too_short')
paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
long_paragraphs = [p for p in paragraphs if len(p) >= 40]
if len(long_paragraphs) >= 3:
score += 2
elif len(long_paragraphs) >= 1:
score += 1
else:
warnings.append('few_long_paragraphs')
title_tokens = [t for t in re.split(r'[\s\-_::,,()()]+', title) if len(t) >= 2]
if title_tokens and any(tok in text for tok in title_tokens[:4]):
score += 1
else:
warnings.append('title_tokens_not_found_in_text')
noise_hits = sum(1 for pat in NOISE_PATTERNS if re.search(pat, text, re.I))
if noise_hits >= 3 and word_count < 400:
warnings.append('possible_shell_or_noise_heavy_page')
if 'javascript' in lowered and word_count < 120:
warnings.append('possible_dynamic_page')
if word_count < 120 or 'possible_shell_or_noise_heavy_page' in warnings:
quality = 'low'
elif word_count < 300 or len(warnings) >= 2:
quality = 'medium'
else:
quality = 'high'
needs_browser_review = quality == 'low' or 'possible_dynamic_page' in warnings
return {
'word_count': word_count,
'quality': quality,
'warnings': warnings,
'needs_browser_review': needs_browser_review,
'paragraph_count': len(paragraphs),
'long_paragraph_count': len(long_paragraphs),
'score': score,
}
def extract_best_content(raw: str, title: str):
best = {
'candidate': '',
'content': '',
'quality': {
'word_count': 0,
'score': -1,
'quality': 'low',
'warnings': ['no_content'],
'needs_browser_review': True,
'paragraph_count': 0,
'long_paragraph_count': 0,
},
}
for candidate_name, pat in body_candidates():
body = pick(pat, raw)
if not body:
continue
body_text = html_to_text(body)
quality = assess_content(body_text, title)
if quality['score'] > best['quality']['score']:
best = {
'candidate': candidate_name,
'content': body_text,
'quality': quality,
}
if not best['content']:
fallback_text = html_to_text(raw)
quality = assess_content(fallback_text, title)
best = {
'candidate': 'full-html-fallback',
'content': fallback_text,
'quality': quality,
}
return best
def chrome_path() -> str:
candidates = [
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
'/Applications/Chromium.app/Contents/MacOS/Chromium',
]
for c in candidates:
if pathlib.Path(c).exists():
return c
return ''
def fetch_rendered_dom(url: str, virtual_time_budget: int) -> str:
chrome = chrome_path()
if not chrome:
raise RuntimeError('未找到 Chrome/Chromium,无法获取浏览器渲染后的 DOM')
result = run([
chrome,
'--headless=new',
'--disable-gpu',
'--no-first-run',
f'--virtual-time-budget={virtual_time_budget}',
'--dump-dom',
url,
])
return result.stdout
def fetch_browser_visible_text(url: str, out_path: pathlib.Path, virtual_time_budget: int):
chrome = chrome_path()
if not chrome:
raise RuntimeError('未找到 Chrome/Chromium,无法执行浏览器文本提取')
script = """
const fs = require('fs');
(async () => {
const url = process.argv[1];
const out = process.argv[2];
const chrome = process.env.CHROME_BIN;
const { chromium } = require('playwright');
const browser = await chromium.launch({ headless: true, executablePath: chrome });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle', timeout: 45000 });
await page.evaluate(async () => {
await new Promise(r => setTimeout(r, 2000));
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1500));
window.scrollTo(0, 0);
});
const payload = await page.evaluate(() => {
const title = document.title || '';
const bodyText = (document.body && document.body.innerText) ? document.body.innerText : '';
const metas = {};
for (const el of document.querySelectorAll('meta')) {
const k = el.getAttribute('property') || el.getAttribute('name');
const v = el.getAttribute('content');
if (k && v) metas[k] = v;
}
return { title, bodyText, metas };
});
fs.writeFileSync(out, JSON.stringify(payload), 'utf8');
await browser.close();
})().catch(err => { console.error(String(err)); process.exit(1); });
"""
child_env = {
'PATH': os.environ.get('PATH', ''),
'HOME': os.environ.get('HOME', ''),
'CHROME_BIN': chrome,
}
subprocess.run([
'node',
'-e',
script,
url,
str(out_path),
], check=True, env=child_env)
def extract(url: str, raw: str, enable_browser_fallback: bool, virtual_time_budget: int, browser_text_temp: Optional[pathlib.Path] = None):
title = extract_title(raw)
source = extract_source(raw)
publish = extract_publish(raw)
best = extract_best_content(raw, title)
text_source = 'static_html'
browser_fallback_used = False
browser_fallback_error = ''
if enable_browser_fallback and best['quality']['needs_browser_review']:
try:
rendered = fetch_rendered_dom(url, virtual_time_budget)
rendered_title = extract_title(rendered) or title
browser_best = extract_best_content(rendered, rendered_title)
if browser_best['quality']['score'] > best['quality']['score']:
title = rendered_title or title
source = extract_source(rendered) or source
publish = extract_publish(rendered) or publish
best = browser_best
text_source = 'browser_rendered_dom'
browser_fallback_used = True
except Exception as e:
browser_fallback_error = str(e)
if enable_browser_fallback and best['quality']['needs_browser_review'] and browser_text_temp is not None:
try:
fetch_browser_visible_text(url, browser_text_temp, virtual_time_budget)
browser_payload = json.loads(browser_text_temp.read_text(encoding='utf-8'))
browser_text = (browser_payload.get('bodyText') or '').strip()
browser_title = (browser_payload.get('title') or '').strip() or title
browser_quality = assess_content(browser_text, browser_title)
if browser_quality['score'] > best['quality']['score']:
title = browser_title
source = browser_payload.get('metas', {}).get('author', '') or source
publish = browser_payload.get('metas', {}).get('article:published_time', '') or publish
best = {
'candidate': 'browser-visible-text',
'content': browser_text,
'quality': browser_quality,
}
text_source = 'browser_visible_text'
browser_fallback_used = True
except Exception as e:
if browser_fallback_error:
browser_fallback_error += ' || '
browser_fallback_error += f'browser_visible_text:{e}'
return {
'title': title,
'source': source,
'publish': publish,
'url': url,
'content': best['content'],
'content_candidate': best['candidate'],
'content_quality': best['quality'],
'text_source': text_source,
'browser_fallback_used': browser_fallback_used,
'browser_fallback_error': browser_fallback_error,
}
def sanitize_filename(name: str) -> str:
name = re.sub(r'[\\/:*?"<>|\n\r]+', '_', name)
name = re.sub(r'\s+', ' ', name).strip()
return name[:80] or 'webpage-export'
def build_text(doc: dict) -> str:
warnings = ', '.join(doc['content_quality']['warnings']) if doc['content_quality']['warnings'] else '无'
return (
f"标题:{doc['title']}\n"
f"来源:{doc['source'] or '未提取到'}\n"
f"发布时间:{doc['publish'] or '未直接提取到'}\n"
f"原始链接:{doc['url']}\n"
f"文本来源:{doc['text_source']}\n"
f"正文候选:{doc['content_candidate']}\n"
f"提取质量:{doc['content_quality']['quality']}\n"
f"字数:{doc['content_quality']['word_count']}\n"
f"告警:{warnings}\n\n"
f"正文:\n\n{doc['content']}\n"
)
def write_txt(text: str, path: pathlib.Path):
path.write_text(text, encoding='utf-8')
def write_json(payload: dict, path: pathlib.Path):
path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding='utf-8')
def write_docx(txt_path: pathlib.Path, docx_path: pathlib.Path):
if which('textutil') is None:
raise RuntimeError('textutil 不可用,无法生成 docx')
subprocess.run(['textutil', '-convert', 'docx', str(txt_path), '-output', str(docx_path)], check=True)
def write_pdf(url: str, pdf_path: pathlib.Path, virtual_time_budget: int):
chrome = chrome_path()
if not chrome:
raise RuntimeError('未找到 Chrome/Chromium,无法生成 PDF')
subprocess.run([
chrome,
'--headless=new',
'--disable-gpu',
'--no-first-run',
f'--virtual-time-budget={virtual_time_budget}',
f'--print-to-pdf={pdf_path}',
url,
], check=True)
def main():
ap = argparse.ArgumentParser()
ap.add_argument('url')
ap.add_argument('--outdir', default=str(DEFAULT_ARCHIVE / 'temp'))
ap.add_argument('--basename', default='')
ap.add_argument('--docx', action='store_true')
ap.add_argument('--pdf', action='store_true')
ap.add_argument('--json', action='store_true', help='always emit metadata json (default true)')
ap.add_argument('--no-json', dest='json', action='store_false')
ap.add_argument('--virtual-time-budget', type=int, default=15000)
ap.add_argument('--disable-browser-fallback', action='store_true')
ap.set_defaults(json=True)
args = ap.parse_args()
outdir = pathlib.Path(args.outdir)
outdir.mkdir(parents=True, exist_ok=True)
metadata = {
'url': args.url,
'status': 'failed',
'title': '',
'source': '',
'publish': '',
'content_candidate': '',
'content_quality': {},
'text_source': '',
'browser_fallback_used': False,
'browser_fallback_error': '',
'paths': {},
'warnings': [],
'steps': {
'txt': 'pending',
'docx': 'skipped',
'pdf': 'skipped',
},
}
raw = fetch_html(args.url)
browser_text_temp = outdir / '.browser_text_tmp.json'
doc = extract(args.url, raw, not args.disable_browser_fallback, args.virtual_time_budget, browser_text_temp=browser_text_temp)
basename = args.basename or sanitize_filename(doc['title'])
txt_path = outdir / f'{basename}.txt'
json_path = outdir / f'{basename}.json'
metadata.update({
'title': doc['title'],
'source': doc['source'],
'publish': doc['publish'],
'content_candidate': doc['content_candidate'],
'content_quality': doc['content_quality'],
'text_source': doc['text_source'],
'browser_fallback_used': doc['browser_fallback_used'],
'browser_fallback_error': doc['browser_fallback_error'],
'paths': {'txt': str(txt_path)},
})
metadata['warnings'].extend(doc['content_quality']['warnings'])
if doc['content_quality']['needs_browser_review']:
metadata['warnings'].append('needs_browser_review')
if doc['browser_fallback_used']:
metadata['warnings'].append('browser_fallback_used')
if doc['browser_fallback_error']:
metadata['warnings'].append(f'browser_fallback_error:{doc["browser_fallback_error"]}')
text = build_text(doc)
write_txt(text, txt_path)
metadata['steps']['txt'] = 'success'
if args.docx:
docx_path = outdir / f'{basename}.docx'
metadata['paths']['docx'] = str(docx_path)
try:
write_docx(txt_path, docx_path)
metadata['steps']['docx'] = 'success'
except Exception as e:
metadata['steps']['docx'] = 'failed'
metadata['warnings'].append(f'docx_failed:{e}')
if args.pdf:
pdf_path = outdir / f'{basename}.pdf'
metadata['paths']['pdf'] = str(pdf_path)
try:
write_pdf(args.url, pdf_path, args.virtual_time_budget)
if pdf_path.exists() and pdf_path.stat().st_size > 1024:
metadata['steps']['pdf'] = 'success'
else:
metadata['steps']['pdf'] = 'failed'
metadata['warnings'].append('pdf_failed:missing_or_too_small')
except Exception as e:
metadata['steps']['pdf'] = 'failed'
metadata['warnings'].append(f'pdf_failed:{e}')
succeeded = [k for k, v in metadata['steps'].items() if v == 'success']
if metadata['steps']['txt'] == 'success' and len(succeeded) == 1 and (args.docx or args.pdf):
metadata['status'] = 'partial'
elif metadata['steps']['txt'] == 'success':
metadata['status'] = 'success'
else:
metadata['status'] = 'failed'
if args.json:
write_json(metadata, json_path)
metadata['paths']['json'] = str(json_path)
print(f'TXT={txt_path}')
print(f'TITLE={doc["title"]}')
print(f'SOURCE={doc["source"]}')
print(f'TEXT_SOURCE={doc["text_source"]}')
print(f'QUALITY={doc["content_quality"]["quality"]}')
print(f'WORDS={doc["content_quality"]["word_count"]}')
print(f'STATUS={metadata["status"]}')
if args.json:
print(f'JSON={json_path}')
if args.docx and 'docx' in metadata['paths'] and metadata['steps']['docx'] == 'success':
print(f'DOCX={metadata["paths"]["docx"]}')
if args.pdf and 'pdf' in metadata['paths'] and metadata['steps']['pdf'] == 'success':
print(f'PDF={metadata["paths"]["pdf"]}')
if metadata['warnings']:
print('WARNINGS=' + ' | '.join(metadata['warnings']))
if __name__ == '__main__':
try:
main()
except subprocess.CalledProcessError as e:
sys.stderr.write(e.stderr or str(e))
sys.exit(e.returncode or 1)
except Exception as e:
sys.stderr.write(str(e) + '\n')
sys.exit(1)