@clawhub-huang888596-fadeb0acf8
Extract structured paper records from one or more local PDFs, arXiv links, DOI links, or general paper URLs, then classify the papers and write an academic s...
---
name: paper-cluster-survey-v2-2
description: Extract structured paper records from one or more local PDFs, arXiv links, DOI links, or general paper URLs, then classify the papers and write an academic survey review. Use when the task involves mixed paper sources, URL-first literature collection, PDF-based review drafting, taxonomy building, or producing a formal literature review from a paper set. By default, provide one classification table and one integrated review for the full corpus; only write separate reviews for each category when the user explicitly asks for per-category reviews.
---
# Paper Cluster Survey V2.2
## Overview
Turn raw paper URLs and PDFs into usable review inputs. Extract structured metadata and text evidence first, then classify the papers, produce a classification table, and write a review that follows common academic survey conventions instead of a rigid fill-in-the-blanks template.
## Workflow
### 1. Normalize the source set
- Accept multiple local PDF paths, arXiv URLs, DOI URLs, and general paper URLs.
- Use `scripts/normalize-sources.mjs` when the source set is mixed or should be stored as a reusable manifest.
- Preserve the original source string for traceability.
### 2. Extract paper records before reasoning
- Use `scripts/extract-paper-records.mjs` to turn PDFs and URLs into structured records before classification.
- The extraction pass should gather as much of the following as possible:
- `title`
- `authors`
- `year`
- `venue`
- `abstract`
- `task`
- `method`
- `datasets`
- `metrics`
- `main_contribution`
- `limitations`
- `source`
- `extraction_notes`
- Treat extracted records as the primary context for classification and survey drafting.
- If important fields are missing, only fall back to direct source reading for the specific missing details.
Read [extraction-pipeline.md](./references/extraction-pipeline.md) when deciding how much to trust the extracted fields and when to re-open the raw source.
### 3. Verify evidence quality
- Do not classify from titles alone when abstract or body text is available.
- Prefer abstract, introduction, and method section.
- Mark uncertain inferences explicitly.
- If the extractor had to fall back to weak methods, keep claims conservative.
### 4. Design the classification scheme
- Produce a classification scheme before writing the review.
- Prefer task-based categories first.
- If tasks are too similar, classify by method family.
- Use application-domain categories only when they best explain the corpus.
- Keep the taxonomy shallow unless the corpus is large.
- Assign one primary category to each paper unless the user explicitly wants multi-label grouping.
Read [taxonomy-guidelines.md](./references/taxonomy-guidelines.md) when the category design is ambiguous.
### 5. Output the classification table
- Always provide one classification table before the review.
- Include:
- paper
- year
- category
- rationale
- evidence used
- Keep rationales brief and evidence-based.
### 6. Decide the review shape
Default rule:
- Write one integrated literature review for the entire corpus after the classification table.
Exception:
- If the user explicitly asks for "each category write a survey", "分别写综述", "per-category review", or equivalent, write separate review sections for each category.
### 7. Write the review in academic survey style
The review must read like a normal survey paper, not a bullet summary dump.
- Use a concise academic title.
- Include an abstract when the output is formal enough to justify it.
- Include keywords when they help position the review.
- Use an introduction that explains background, significance, scope, source selection, and the organizing logic of the review.
- Organize the main body by the most defensible basis for the corpus:
- chronology
- research themes
- method families
- viewpoints or schools
- End with a conclusion or concluding discussion.
- Add future directions, outlook, or open problems when the corpus supports them.
- List references in GB/T 7714 style when bibliographic data is available.
Typical sections in a strong review are:
- title
- abstract
- keywords
- introduction
- themed main sections
- discussion, conclusion, or both
- future directions or open problems when useful
- references
Not every output needs every section. Match the structure to the user's request, the corpus size, and the field while staying recognizably review-like.
Read [review-paper-style.md](./references/review-paper-style.md) when drafting the prose review or choosing section structure.
### 8. Keep the prose review-like
- Prefer connected academic prose over bullet dumps.
- Use tables only to support comparison, not replace the review.
- Do not invent datasets, metrics, or reference details.
- If extracted metadata is incomplete, keep partial references and state what is missing.
## Output Contract
Return results in this order unless the user asks otherwise:
1. Corpus summary
2. Classification scheme
3. Classification table
4. Formal review article
5. References
If the user wants structured output, read [output-schema.md](./references/output-schema.md).
## Bundled Scripts
### `scripts/normalize-sources.mjs`
- Normalize mixed PDF and URL inputs into a JSON manifest.
- Use when the source set is large, mixed, or should be reused.
### `scripts/extract-paper-records.mjs`
- Fetch URLs, resolve likely paper metadata, and extract paper text evidence from URLs or PDFs.
- Prefer this script before asking the model to reason over a large source set.
- Use its output as the main context object for classification and review drafting.
### `scripts/render-formal-review-template.mjs`
- Render a flexible academic-review scaffold from structured paper records.
- Default to one integrated review.
- Use `--per-category` only when the user explicitly asks for separate category reviews.
## Quality Bar
- Run extraction before classification unless the user already gave structured paper records.
- Keep classification and review consistent with extracted evidence.
- Use raw source re-reading only to fill important gaps.
- If the extractor had to rely on weak fallbacks, say so.
FILE:agents/openai.yaml
interface:
display_name: "Paper Cluster Survey 2.2"
short_description: "Extract, classify, and review papers from PDFs or URLs"
default_prompt: "Use $paper-cluster-survey-v2-2 to extract structured paper records from the provided PDFs or URLs, build a classification table, and write an academic survey review."
policy:
allow_implicit_invocation: true
FILE:README.md
# Paper Cluster Survey
Extract, classify, and review academic papers from PDFs or URLs.
## Overview
Paper Cluster Survey is an OpenClaw skill that transforms raw paper sources (local PDFs, arXiv links, DOI links, or paper URLs) into structured metadata records, classifies them, and produces an academic survey review following common scholarly conventions.
## Features
- **Multi-source support**: Local PDFs, arXiv URLs, DOI links, and general paper URLs
- **Structured extraction**: Extracts title, authors, year, venue, abstract, methods, datasets, and metrics
- **Intelligent classification**: Task-based, method-based, or application-based categorization
- **Academic output**: Generates review articles in standard academic survey format
- **Evidence tracking**: Every classification includes source evidence
## Workflow
1. **Normalize sources** - Convert mixed inputs into a reusable JSON manifest
2. **Extract records** - Pull structured metadata from PDFs and URLs
3. **Verify quality** - Check extraction confidence levels
4. **Design taxonomy** - Create classification scheme based on research tasks or methods
5. **Generate review** - Write academic survey following scholarly conventions
## Usage
### Normalize sources
```bash
node scripts/normalize-sources.mjs --out sources.json paper1.pdf https://arxiv.org/abs/1234.5678
```
### Extract paper records
```bash
node scripts/extract-paper-records.mjs --manifest sources.json --out papers.json
```
### Render formal review
```bash
node scripts/render-formal-review-template.mjs --in papers.json --out review.md
```
For per-category reviews:
```bash
node scripts/render-formal-review-template.mjs --in papers.json --out review.md --per-category
```
## Output Structure
The skill produces:
1. **Corpus summary** - Overview of the paper collection
2. **Classification scheme** - Taxonomy design rationale
3. **Classification table** - Papers with categories and evidence
4. **Formal review** - Academic survey article
5. **References** - GB/T 7714 formatted bibliography
## Extraction Trust Levels
| Level | Source | Fields |
|-------|--------|--------|
| High | PDF text (pdftotext/mutool/pypdf) | Full metadata |
| Medium | HTML metadata tags | Title, abstract, authors |
| Low | Text fallback (strings) | Partial data only |
## Requirements
- Node.js 18+ (ES Modules)
- Optional: `pdftotext`, `mutool`, or `python3 + pypdf` for PDF extraction
## Project Structure
```
paper-cluster-survey-v2-2/
├── agents/
│ └── openai.yaml # AI Agent configuration
├── references/
│ ├── extraction-pipeline.md
│ ├── output-schema.md
│ ├── review-paper-style.md
│ └── taxonomy-guidelines.md
├── scripts/
│ ├── extract-paper-records.mjs
│ ├── normalize-sources.mjs
│ └── render-formal-review-template.mjs
├── SKILL.md
└── README.md
```
## License
MIT License
## References
See the `references/` directory for detailed guidelines on extraction pipelines, output schemas, review paper style, and taxonomy design.
FILE:references/extraction-pipeline.md
# Extraction Pipeline
The bundled extractor is designed to reduce context load before classification and review drafting.
## What the extractor does
- Normalize local paths and URLs into source records
- Fetch HTML pages and capture paper-oriented metadata when present
- Follow likely PDF URLs for direct paper downloads
- Extract text from PDFs using the best available local tool
- Build structured paper records with provenance notes
## Extraction trust levels
Treat fields differently depending on how they were obtained:
- High trust:
- citation meta tags from the source page
- PDF text extracted by `pdftotext`, `mutool`, or `python3 + pypdf`
- Medium trust:
- HTML title and meta description
- year inferred from citation metadata or DOI landing pages
- Low trust:
- title or abstract inferred from noisy text fallback
- PDF text recovered only via `strings`
## When to reopen the raw source
Re-open the raw source when:
- title and abstract are missing or obviously wrong
- classification depends on fine-grained method differences
- the review needs exact datasets, metrics, or limitations
- the extractor reports fallback warnings
## Common extractor outputs
The extractor tries to populate:
- `title`
- `authors`
- `year`
- `venue`
- `abstract`
- `text_excerpt`
- `source`
- `extraction_method`
- `extraction_notes`
Use the notes field as part of quality control. If the extractor had to degrade, say so in the final analysis when it matters.
FILE:references/output-schema.md
# Output Schema
Use this when the user asks for a strict structure or when the review will be reused downstream.
## Default deliverable order
1. Corpus summary
2. Classification scheme
3. Classification table
4. Formal literature review
5. References
## Classification table
| Paper | Year | Category | Rationale | Evidence |
| --- | --- | --- | --- | --- |
## Typical review shape
This is a common review-paper shape, not a mandatory template. Adapt section names and ordering when another structure is more natural for the corpus.
```md
# <Title>
## 摘要
## 关键词
## 引言
## <按时间、主题、方法或观点组织的主体部分>
### ...
### ...
## 讨论 / 结论
## 展望 / 未来研究方向
## 参考文献
[1] ...
```
## JSON shape
```json
{
"corpus_summary": {
"total_papers": 0,
"time_span": {"start": null, "end": null},
"dominant_topics": [],
"notes": []
},
"classification_scheme": {
"basis": "task|method|application",
"rationale": "",
"categories": [
{
"name": "",
"definition": ""
}
]
},
"classification_table": [
{
"paper": "",
"year": null,
"category": "",
"rationale": "",
"evidence": []
}
],
"papers": [
{
"source_id": "paper-001",
"title": "",
"authors": [],
"year": null,
"venue": "",
"abstract": "",
"task": "",
"method": "",
"datasets": [],
"metrics": [],
"main_contribution": "",
"limitations": "",
"source": "",
"extraction_notes": []
}
],
"review": {
"title": "",
"abstract": "",
"keywords": [],
"introduction": "",
"body_sections": [
{
"heading": "",
"content": ""
}
],
"discussion": "",
"conclusion": "",
"future_directions": "",
"references": []
}
}
```
FILE:references/review-paper-style.md
# Review Paper Style
The review should read like a survey article, not like a notebook or meeting memo.
## Common structure
Most academic reviews share these elements, but the exact section names and order vary by field, journal, and language:
- title
- abstract
- keywords
- introduction
- several main sections organized around a clear logic
- discussion, conclusion, or both
- future directions or open problems when useful
- references
Treat this as a style guide, not a rigid template.
## Section guidance
### Title
- Summarize the topic, scope, and object.
- Keep it concise and academic.
### Abstract
- Use when the output is intended to resemble a paper or formal report.
- Briefly cover background, purpose, scope, main findings, and outlook.
### Keywords
- Usually provide 3-6 keywords.
- Prefer field terms, method terms, and task terms.
### Introduction
Usually cover:
- research background and significance
- scope of the review
- time span and source selection if relevant
- organizing logic of the review
### Main body
Choose the organizing principle that best fits the corpus:
- chronological development
- research themes
- methods
- viewpoints or schools
The body should usually include:
- research status
- important results, theories, and methods
- common ground across studies
- disagreements and divergences
- weaknesses and gaps
### Discussion and conclusion
- Use one combined section or two separate sections depending on the length and formality of the review.
- Summarize the overall research status.
- Point out the main unresolved problems, contradictions, and blank spots.
### Future directions
- Include when the corpus supports forward-looking synthesis.
- Keep this grounded in the literature, not generic optimism.
### References
- Format references in GB/T 7714 style when metadata is sufficient.
- If fields are missing, keep the partial record and note the missing parts rather than inventing them.
## Style rules
- Prefer complete paragraphs over bullet dumps.
- Use transition sentences between sections.
- Use tables only to support the prose review, not replace it.
- Keep claims calibrated to the available evidence.
FILE:references/taxonomy-guidelines.md
# Taxonomy Guidelines
Use these rules to keep the classification defensible and useful for the review.
## Priority order
Prefer classification by:
1. Research task
2. Method family
3. Application domain
Only prioritize application domain when the user explicitly wants an application-focused survey.
## Design constraints
- Use one consistent basis across the whole corpus.
- Keep category names short and academically natural.
- Avoid mixed-level labels such as one category by task and another by dataset.
- Avoid vague buckets such as "other" or "miscellaneous" unless the corpus truly requires one residual bucket.
## Corpus size guidance
- 1-3 papers: keep the taxonomy shallow and note that the grouping is provisional.
- 4-12 papers: 2-5 categories are usually enough.
- Larger corpora: add one subcategory layer only when it clearly improves the review.
## Evidence order
Use these sources in order of trust:
1. Extracted abstract
2. Extracted introduction and method snippets
3. Extracted evaluation snippets
4. Title and metadata
If extraction quality is weak, reopen the raw source selectively instead of pretending the evidence is stronger than it is.
FILE:scripts/extract-paper-records.mjs
#!/usr/bin/env node
import fs from "node:fs";
import os from "node:os";
import path from "node:path";
import process from "node:process";
import { spawnSync } from "node:child_process";
function printHelp() {
console.log(`Usage:
node scripts/extract-paper-records.mjs [--manifest FILE] [--out FILE] [source...]
node scripts/extract-paper-records.mjs --stdin [--out FILE]
Extract structured paper records from local PDFs, paper URLs, and HTML pages.
`);
}
function isHttpUrl(value) {
try {
const url = new URL(value);
return url.protocol === "http:" || url.protocol === "https:";
} catch {
return false;
}
}
function normalizeUrl(value) {
const url = new URL(value);
if (url.hostname === "arxiv.org" && url.pathname.startsWith("/pdf/")) {
url.pathname = url.pathname.replace(/^\/pdf\//, "/abs/").replace(/\.pdf$/i, "");
}
url.hash = "";
return url.toString();
}
function inferKind(value) {
if (isHttpUrl(value)) {
const lower = value.toLowerCase();
if (lower.endsWith(".pdf") || lower.includes("arxiv.org")) {
return "paper_url";
}
return "url";
}
if (value.toLowerCase().endsWith(".pdf")) {
return "pdf";
}
return "path";
}
function titleHintFromPath(value) {
return path.basename(value, path.extname(value))
.replace(/[_-]+/g, " ")
.replace(/\s+/g, " ")
.trim();
}
function normalizeSource(value, index) {
const kind = inferKind(value);
const record = {
source_id: `paper-String(index + 1).padStart(3, "0")`,
original: value,
kind,
};
if (kind === "pdf" || kind === "path") {
record.path = path.resolve(value);
record.exists = fs.existsSync(record.path);
record.title_hint = titleHintFromPath(value);
return record;
}
record.url = normalizeUrl(value);
record.title_hint = "";
return record;
}
function parseArgs(argv) {
const args = [];
let manifestFile = "";
let outFile = "";
let readStdin = false;
for (let i = 0; i < argv.length; i += 1) {
const arg = argv[i];
if (arg === "--help") {
printHelp();
process.exit(0);
}
if (arg === "--manifest") {
manifestFile = argv[i + 1] || "";
i += 1;
continue;
}
if (arg === "--out") {
outFile = argv[i + 1] || "";
i += 1;
continue;
}
if (arg === "--stdin") {
readStdin = true;
continue;
}
args.push(arg);
}
return { args, manifestFile, outFile, readStdin };
}
async function readStdinLines() {
const chunks = [];
for await (const chunk of process.stdin) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString("utf8")
.split(/\r?\n/)
.map((line) => line.trim())
.filter(Boolean);
}
function loadManifest(manifestFile) {
const raw = fs.readFileSync(manifestFile, "utf8");
const parsed = JSON.parse(raw);
if (Array.isArray(parsed)) {
return parsed;
}
if (Array.isArray(parsed.sources)) {
return parsed.sources;
}
throw new Error("Manifest JSON must be an array or an object with a sources array.");
}
function decodeEntities(value) {
return value
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, "\"")
.replace(/'/g, "'");
}
function stripHtml(value) {
return decodeEntities(
value
.replace(/<script[\s\S]*?<\/script>/gi, " ")
.replace(/<style[\s\S]*?<\/style>/gi, " ")
.replace(/<[^>]+>/g, " ")
.replace(/\s+/g, " ")
.trim(),
);
}
function extractMetaTag(html, attrName, attrValue) {
const regex = new RegExp(`<meta[^>]*attrName=["']attrValue["'][^>]*content=["']([^"']+)["'][^>]*>`, "i");
const match = html.match(regex);
return match ? decodeEntities(match[1].trim()) : "";
}
function extractRepeatedMetaTag(html, attrName, attrValue) {
const regex = new RegExp(`<meta[^>]*attrName=["']attrValue["'][^>]*content=["']([^"']+)["'][^>]*>`, "ig");
const values = [];
let match;
while ((match = regex.exec(html)) !== null) {
values.push(decodeEntities(match[1].trim()));
}
return values;
}
function extractTitleFromHtml(html) {
const candidates = [
extractMetaTag(html, "name", "citation_title"),
extractMetaTag(html, "property", "og:title"),
].filter(Boolean);
if (candidates.length > 0) {
return candidates[0];
}
const titleMatch = html.match(/<title[^>]*>([\s\S]*?)<\/title>/i);
return titleMatch ? stripHtml(titleMatch[1]) : "";
}
function extractAbstractFromHtml(html) {
const candidates = [
extractMetaTag(html, "name", "citation_abstract"),
extractMetaTag(html, "name", "description"),
extractMetaTag(html, "property", "og:description"),
].filter(Boolean);
return candidates[0] || "";
}
function extractYear(value) {
const match = String(value || "").match(/\b(19|20)\d{2}\b/);
return match ? Number(match[0]) : null;
}
function inferTitleFromText(text, titleHint = "") {
const lines = text.split(/\r?\n/).map((line) => line.trim()).filter(Boolean);
for (const line of lines.slice(0, 12)) {
if (line.length >= 12 && line.length <= 220 && !/^abstract$/i.test(line) && !/^(introduction|摘要)$/i.test(line)) {
return line;
}
}
return titleHint || "";
}
function extractAbstractFromText(text) {
const compact = text.replace(/\r/g, "");
const match = compact.match(/(?:^|\n)\s*(abstract|摘要)\s*[::]?\s*\n?([\s\S]{40,2500}?)(?:\n\s*(keywords|index terms|1\.?\s+introduction|introduction|关键词)\b|$)/i);
if (!match) {
return "";
}
return match[2].replace(/\s+/g, " ").trim();
}
function extractAuthorsFromText(text) {
const lines = text.split(/\r?\n/).map((line) => line.trim()).filter(Boolean);
const candidates = lines.slice(1, 6);
for (const line of candidates) {
if (line.length < 8 || line.length > 180) {
continue;
}
if (/@/.test(line) || /\b(university|institute|school|college|laboratory|department)\b/i.test(line)) {
continue;
}
if (/,/.test(line) || /\band\b/i.test(line)) {
return line.split(/,|\band\b/).map((part) => part.trim()).filter(Boolean);
}
}
return [];
}
function summarizeExtraction(record, updates) {
return {
source_id: record.source_id,
source: record.url || record.path || record.original,
kind: record.kind,
title: updates.title || record.title_hint || "",
authors: updates.authors || [],
year: updates.year ?? null,
venue: updates.venue || "",
abstract: updates.abstract || "",
text_excerpt: updates.text_excerpt || "",
extraction_method: updates.extraction_method || "",
extraction_notes: updates.extraction_notes || [],
pdf_url: updates.pdf_url || "",
};
}
function runCommand(command, args) {
return spawnSync(command, args, {
encoding: "utf8",
maxBuffer: 10 * 1024 * 1024,
});
}
function extractPdfText(filePath, notes) {
const toolChecks = [
{ name: "pdftotext", args: [filePath, "-"] },
{ name: "mutool", args: ["draw", "-F", "txt", filePath] },
];
for (const tool of toolChecks) {
const exists = runCommand("sh", ["-lc", `command -v tool.name`]);
if (exists.status !== 0) {
continue;
}
const result = runCommand(tool.name, tool.args);
if (result.status === 0 && result.stdout.trim()) {
return { method: tool.name, text: result.stdout };
}
notes.push(`tool.name was available but did not extract readable text.`);
}
const pythonCheck = runCommand("sh", ["-lc", "python3 -c \"import pypdf\""]);
if (pythonCheck.status === 0) {
const script = [
"from pypdf import PdfReader",
"import sys",
"reader = PdfReader(sys.argv[1])",
"chunks = []",
"for page in reader.pages[:5]:",
" text = page.extract_text() or ''",
" if text:",
" chunks.append(text)",
"print('\\n'.join(chunks))",
].join("\n");
const result = runCommand("python3", ["-c", script, filePath]);
if (result.status === 0 && result.stdout.trim()) {
return { method: "python3+pypdf", text: result.stdout };
}
notes.push("python3+pypdf was available but did not extract readable text.");
}
const stringsResult = runCommand("strings", [filePath]);
if (stringsResult.status === 0 && stringsResult.stdout.trim()) {
notes.push("Fell back to strings-based extraction; text quality may be poor.");
return { method: "strings", text: stringsResult.stdout };
}
notes.push("No PDF text extractor succeeded.");
return { method: "unavailable", text: "" };
}
async function fetchSource(url) {
const response = await fetch(url, {
redirect: "follow",
headers: {
"user-agent": "paper-cluster-survey-v2-2/1.0",
accept: "text/html,application/pdf;q=0.9,*/*;q=0.8",
},
});
const contentType = response.headers.get("content-type") || "";
const buffer = Buffer.from(await response.arrayBuffer());
return {
url: response.url,
ok: response.ok,
status: response.status,
contentType,
buffer,
};
}
function writeTempPdf(buffer) {
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "paper-survey-"));
const filePath = path.join(tmpDir, "source.pdf");
fs.writeFileSync(filePath, buffer);
return { tmpDir, filePath };
}
async function extractFromPdfPath(record) {
const notes = [];
if (!record.exists) {
notes.push("Local file does not exist.");
return summarizeExtraction(record, {
title: record.title_hint,
extraction_method: "missing-file",
extraction_notes: notes,
});
}
const extracted = extractPdfText(record.path, notes);
const text = extracted.text;
return summarizeExtraction(record, {
title: inferTitleFromText(text, record.title_hint),
authors: extractAuthorsFromText(text),
year: extractYear(text),
abstract: extractAbstractFromText(text),
text_excerpt: text.slice(0, 4000).replace(/\s+/g, " ").trim(),
extraction_method: extracted.method,
extraction_notes: notes,
});
}
async function extractFromUrl(record) {
const notes = [];
try {
const fetched = await fetchSource(record.url);
if (!fetched.ok) {
notes.push(`HTTP fetched.status while fetching source.`);
return summarizeExtraction(record, {
extraction_method: "http-error",
extraction_notes: notes,
});
}
if (/application\/pdf/i.test(fetched.contentType) || /\.pdf(?:$|\?)/i.test(fetched.url)) {
const { tmpDir, filePath } = writeTempPdf(fetched.buffer);
try {
const pdfRecord = await extractFromPdfPath({
...record,
path: filePath,
exists: true,
});
pdfRecord.source = fetched.url;
pdfRecord.pdf_url = fetched.url;
pdfRecord.extraction_notes = [...pdfRecord.extraction_notes, ...notes];
return pdfRecord;
} finally {
fs.rmSync(tmpDir, { recursive: true, force: true });
}
}
const html = fetched.buffer.toString("utf8");
const title = extractTitleFromHtml(html);
const abstract = extractAbstractFromHtml(html);
const authors = extractRepeatedMetaTag(html, "name", "citation_author");
const venue = extractMetaTag(html, "name", "citation_journal_title") || extractMetaTag(html, "name", "citation_conference_title");
const year = extractYear(
extractMetaTag(html, "name", "citation_publication_date")
|| extractMetaTag(html, "name", "citation_date")
|| html,
);
const pdfUrl = extractMetaTag(html, "name", "citation_pdf_url");
const textExcerpt = stripHtml(html).slice(0, 4000);
if (!title) {
notes.push("HTML title metadata was missing or weak.");
}
if (!abstract) {
notes.push("Abstract metadata was missing; only general page text was available.");
}
return summarizeExtraction(record, {
title,
authors,
year,
venue,
abstract,
text_excerpt: textExcerpt,
extraction_method: "html-metadata",
extraction_notes: notes,
pdf_url: pdfUrl,
});
} catch (error) {
notes.push(error instanceof Error ? error.message : String(error));
return summarizeExtraction(record, {
extraction_method: "fetch-error",
extraction_notes: notes,
});
}
}
async function extractRecord(record) {
if (record.kind === "pdf") {
return extractFromPdfPath(record);
}
if (record.kind === "path") {
const lower = (record.path || "").toLowerCase();
if (lower.endsWith(".html") || lower.endsWith(".htm")) {
const html = fs.readFileSync(record.path, "utf8");
return summarizeExtraction(record, {
title: extractTitleFromHtml(html) || record.title_hint,
authors: extractRepeatedMetaTag(html, "name", "citation_author"),
year: extractYear(html),
venue: extractMetaTag(html, "name", "citation_journal_title") || extractMetaTag(html, "name", "citation_conference_title"),
abstract: extractAbstractFromHtml(html),
text_excerpt: stripHtml(html).slice(0, 4000),
extraction_method: "local-html",
extraction_notes: [],
pdf_url: extractMetaTag(html, "name", "citation_pdf_url"),
});
}
const text = fs.existsSync(record.path) ? fs.readFileSync(record.path, "utf8") : "";
const notes = [];
if (!text) {
notes.push("Local path could not be read as text.");
}
return summarizeExtraction(record, {
title: inferTitleFromText(text, record.title_hint),
authors: extractAuthorsFromText(text),
year: extractYear(text),
abstract: extractAbstractFromText(text),
text_excerpt: text.slice(0, 4000).replace(/\s+/g, " ").trim(),
extraction_method: "local-text",
extraction_notes: notes,
});
}
return extractFromUrl(record);
}
async function main() {
const { args, manifestFile, outFile, readStdin } = parseArgs(process.argv.slice(2));
const stdinValues = readStdin ? await readStdinLines() : [];
const rawSources = [...args, ...stdinValues];
const sources = manifestFile ? loadManifest(manifestFile) : rawSources.map(normalizeSource);
if (sources.length === 0) {
printHelp();
process.exit(1);
}
const papers = [];
for (const source of sources) {
papers.push(await extractRecord(source));
}
const output = {
generated_at: new Date().toISOString(),
total_papers: papers.length,
papers,
};
const payload = `JSON.stringify(output, null, 2)\n`;
if (outFile) {
fs.writeFileSync(outFile, payload, "utf8");
} else {
process.stdout.write(payload);
}
}
main().catch((error) => {
console.error(error instanceof Error ? error.message : String(error));
process.exit(1);
});
FILE:scripts/normalize-sources.mjs
#!/usr/bin/env node
import fs from "node:fs";
import path from "node:path";
import process from "node:process";
function printHelp() {
console.log(`Usage:
node scripts/normalize-sources.mjs [--stdin] [--out FILE] <source...>
Normalize local paper PDFs and paper URLs into a JSON manifest.
`);
}
function isHttpUrl(value) {
try {
const url = new URL(value);
return url.protocol === "http:" || url.protocol === "https:";
} catch {
return false;
}
}
function normalizeUrl(value) {
const url = new URL(value);
if (url.hostname === "arxiv.org" && url.pathname.startsWith("/pdf/")) {
url.pathname = url.pathname.replace(/^\/pdf\//, "/abs/").replace(/\.pdf$/i, "");
}
url.hash = "";
return url.toString();
}
function inferKind(value) {
if (isHttpUrl(value)) {
const lower = value.toLowerCase();
if (lower.includes("arxiv.org") || lower.endsWith(".pdf")) {
return "paper_url";
}
return "url";
}
if (value.toLowerCase().endsWith(".pdf")) {
return "pdf";
}
return "path";
}
function titleHintFromPath(value) {
return path.basename(value, path.extname(value))
.replace(/[_-]+/g, " ")
.replace(/\s+/g, " ")
.trim();
}
function normalizeSource(value, index) {
const kind = inferKind(value);
const record = {
source_id: `paper-String(index + 1).padStart(3, "0")`,
original: value,
kind,
};
if (kind === "pdf" || kind === "path") {
record.path = path.resolve(value);
record.exists = fs.existsSync(record.path);
record.title_hint = titleHintFromPath(value);
return record;
}
record.url = normalizeUrl(value);
record.title_hint = "";
return record;
}
async function readStdinLines() {
const chunks = [];
for await (const chunk of process.stdin) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString("utf8")
.split(/\r?\n/)
.map((line) => line.trim())
.filter(Boolean);
}
function parseArgs(argv) {
const args = [];
let readStdin = false;
let outFile = "";
for (let i = 0; i < argv.length; i += 1) {
const arg = argv[i];
if (arg === "--help") {
printHelp();
process.exit(0);
}
if (arg === "--stdin") {
readStdin = true;
continue;
}
if (arg === "--out") {
outFile = argv[i + 1] || "";
i += 1;
continue;
}
args.push(arg);
}
return { args, readStdin, outFile };
}
async function main() {
const { args, readStdin, outFile } = parseArgs(process.argv.slice(2));
const stdinValues = readStdin ? await readStdinLines() : [];
const inputs = [...args, ...stdinValues];
if (inputs.length === 0) {
printHelp();
process.exit(1);
}
const manifest = {
generated_at: new Date().toISOString(),
total_sources: inputs.length,
sources: inputs.map(normalizeSource),
};
const payload = `JSON.stringify(manifest, null, 2)\n`;
if (outFile) {
fs.writeFileSync(outFile, payload, "utf8");
} else {
process.stdout.write(payload);
}
}
main().catch((error) => {
console.error(error instanceof Error ? error.message : String(error));
process.exit(1);
});
FILE:scripts/render-formal-review-template.mjs
#!/usr/bin/env node
import fs from "node:fs";
import process from "node:process";
function printHelp() {
console.log(`Usage:
node scripts/render-formal-review-template.mjs --in FILE [--out FILE] [--per-category]
Render a flexible academic-review scaffold from structured paper records.
Expected input: a JSON array of paper records or an object with a "papers" array.
`);
}
function parseArgs(argv) {
let inputFile = "";
let outFile = "";
let perCategory = false;
for (let i = 0; i < argv.length; i += 1) {
const arg = argv[i];
if (arg === "--help") {
printHelp();
process.exit(0);
}
if (arg === "--in") {
inputFile = argv[i + 1] || "";
i += 1;
continue;
}
if (arg === "--out") {
outFile = argv[i + 1] || "";
i += 1;
continue;
}
if (arg === "--per-category") {
perCategory = true;
}
}
if (!inputFile) {
printHelp();
process.exit(1);
}
return { inputFile, outFile, perCategory };
}
function readRecords(inputFile) {
const raw = fs.readFileSync(inputFile, "utf8");
const parsed = JSON.parse(raw);
if (Array.isArray(parsed)) {
return parsed;
}
if (Array.isArray(parsed.papers)) {
return parsed.papers;
}
throw new Error("Input JSON must be an array or an object with a papers array.");
}
function groupByCategory(records) {
const grouped = new Map();
for (const record of records) {
const category = record.category || "Uncategorized";
if (!grouped.has(category)) {
grouped.set(category, []);
}
grouped.get(category).push(record);
}
return grouped;
}
function renderClassificationTable(records) {
const lines = [];
lines.push("## 分类表");
lines.push("| 论文 | 年份 | 分类 | 分类理由 | 依据 |");
lines.push("| --- | --- | --- | --- | --- |");
for (const record of records) {
lines.push(`| record.title || "Untitled" | record.year || "" | record.category || "Uncategorized" | record.classification_rationale || record.rationale || "" | (record.evidence || record.extraction_notes || []).join(";") |`);
}
lines.push("");
return lines;
}
function renderReviewSkeleton(title) {
return [
`# title`,
"",
"## 摘要",
"",
"## 关键词",
"关键词1;关键词2;关键词3",
"",
"## 引言",
"",
"## 主体部分",
"### 研究现状与问题脉络",
"",
"### 代表性方法、理论与成果",
"",
"### 共同点、分歧与不足",
"",
"## 讨论与结论",
"",
"## 未来方向",
"",
"## 参考文献",
"[1] ",
"",
];
}
function renderIntegratedReview(records) {
const lines = [];
lines.push("# Corpus Summary");
lines.push(`- Total papers: records.length`);
lines.push(`- Categories: groupByCategory(records).size`);
lines.push("");
lines.push(...renderClassificationTable(records));
lines.push(...renderReviewSkeleton("基于给定文献语料的综述"));
return `lines.join("\n")\n`;
}
function renderPerCategoryReviews(records) {
const lines = [];
const grouped = groupByCategory(records);
lines.push("# Corpus Summary");
lines.push(`- Total papers: records.length`);
lines.push(`- Categories: grouped.size`);
lines.push("");
lines.push(...renderClassificationTable(records));
for (const [category, items] of grouped.entries()) {
lines.push(...renderReviewSkeleton(`category研究综述`));
lines.push("### 附:本类论文");
for (const item of items) {
lines.push(`- item.title || "Untitled"item.year ? ` (${item.year)` : ""}`);
}
lines.push("");
}
return `lines.join("\n")\n`;
}
function main() {
const { inputFile, outFile, perCategory } = parseArgs(process.argv.slice(2));
const records = readRecords(inputFile);
const markdown = perCategory ? renderPerCategoryReviews(records) : renderIntegratedReview(records);
if (outFile) {
fs.writeFileSync(outFile, markdown, "utf8");
} else {
process.stdout.write(markdown);
}
}
try {
main();
} catch (error) {
console.error(error instanceof Error ? error.message : String(error));
process.exit(1);
}