slight-leaves

@clawhub-slight-leaves-fb49268ec4
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
Paper Repro Triage
Skill
中文论文复现执行工作流。用于用户上传或提供深度学习、机器学习、LLM、CV、NLP、多模态、数据集、benchmark、prompt 工程或 agent 论文的 PDF、arXiv 链接、论文主页、项目页、标题摘要或源码线索，并要求判断可复现性、搜索官方代码、检查本地源码、追踪数据集论文源码、定位数据处理代码、自...
---
name: paper-repro-triage
description: 中文论文复现执行工作流。用于用户上传或提供深度学习、机器学习、LLM、CV、NLP、多模态、数据集、benchmark、prompt 工程或 agent 论文的 PDF、arXiv 链接、论文主页、项目页、标题摘要或源码线索，并要求判断可复现性、搜索官方代码、检查本地源码、追踪数据集论文源码、定位数据处理代码、自动 clone 仓库，或在无线上/本地源码但具备复现条件时生成符合常见 PyTorch 开源项目直觉的复现工程。最终写入 Markdown 报告，聊天只返回极简中文摘要。
---

# 论文复现初筛、源码溯源与复现工程生成

## 总原则

本技能用于把论文分析从“聊天式建议”升级为“面向复现的执行工作流”。回答必须使用中文。除非工具权限、网络、审批或用户环境阻止，否则不要只告诉用户去执行命令；应优先使用可用工具完成可执行动作。

每次触发后，聊天回复第一行必须输出：

```text
[paper-repro-triage active]
```

详细结果必须写入 Markdown 文件，聊天只返回极简摘要。

## 强制行为

1. **详细结果写入 Markdown**：默认写到当前 agent workspace 下的 `paper-repro-workspace/<paper-slug>/repro-report.md`。
2. **聊天内容极简**：只输出报告路径、主论文源码状态、数据集源码状态、复现工程状态、是否需要复现、是否能复现、核心原因。
3. **不要输出“下一步建议”作为流程终点**：如果当前流程能继续执行，就继续执行；聊天摘要和报告末尾只写“未完成项/人工确认项”。
4. **先找源码，再谈复现**：必须按“线上官方代码 → 本地主论文源码 → 数据集论文源码 → 无主论文源码复现工程”的顺序执行。不能因为 GitHub 没搜到就立即从零复现。
5. **数据集源码或 baseline 源码不能替代主论文源码**：如果只找到数据集相关源码、baseline 代码、第三方实现或旧方法代码，必须继续判断是否要生成主论文复现工程。
6. **遇到代码仓库优先自动 clone**：主论文官方仓库、数据集论文官方仓库、项目页仓库都应优先 clone 到 workspace；但 clone 前必须先查本地是否已有相关源码。
7. **重复目录跳过 clone**：如果 clone 目标路径下已有同名源码文件夹，不要再次 clone，不要自动 `git pull`，不要覆盖，不要改用时间戳目录；应读取现有目录做只读检查，并在报告与聊天摘要中写明 `已存在，跳过 clone`。
8. **遇到数据集必须做源码溯源**：不用下载数据集论文 PDF，也不用下载数据集本体；只搜索数据集原论文、项目页、arXiv 摘要页、Papers with Code、GitHub/GitLab/Hugging Face 线索，判断是否有官方源码、处理脚本或 benchmark 代码。
9. **必须定位数据处理代码**：对主论文源码、baseline 源码和数据集相关源码，都要定位数据加载、预处理、划分、特征抽取、标注解析、benchmark 构建等代码位置，并写入报告。
10. **无主论文源码时必须尝试生成复现工程**：当线上没有官方源码、本地没有主论文源码，且论文证据支持“可以直接复现”或“部分可复现”时，必须生成 PyTorch 复现工程；不能只写方案。
11. **生成工程要符合常见开源直觉**：默认采用“根目录入口 + 四个代码目录 + 一个复现文档目录”的简洁 PyTorch 结构：根目录保留 `main.py`、`config.py`、`run.py`；代码放入 `data/`、`models/`、`engine/`、`utils/`；`requirements.txt`、`paper-spec.yaml`、`evidence-map.md`、`repro-notes.md` 统一放入 `repro-docs/`。该结构是最低基本盘，可按论文需要扩展，但不要默认生成 `configs/` 多 YAML 目录、独立 `losses/` 目录、`scripts/` 训练脚本或 `.sh` 文件。
12. **不要伪造复现结果**：可以生成代码和 smoke check，但不能声称已经复现论文结果。论文未给出的超参数、模块或处理步骤必须标注为 `ASSUMPTION` 或 `TODO`。
13. **主论文源码存在时必须停在代码导读阶段**：如果已找到、已 clone、已跳过 clone 或本地已存在主论文官方/高度可信源码，本次技能流程的终点是“仓库导读 + 数据处理代码定位 + 写入报告 + 极简摘要”。不得继续修复源码、配置数据目录、安装依赖、下载数据、运行训练、运行评测或执行 inference。
14. **“复现”默认表示复现分析与准备**：用户只说“复现这篇论文”“重新跑一遍”“处理这篇论文”时，不代表允许训练；只有用户明确说“运行训练/开始训练/跑通训练/执行评测/下载数据/修复代码并运行”，才进入运行类任务。
15. **运行类任务不属于本技能自动阶段**：即使 exec 权限是 full/ask=off，本技能也不能自动安装依赖、下载数据、改官方代码或跑训练。

## 输入

接受以下输入：论文 PDF、arXiv 链接、论文主页链接、项目页链接、论文标题/摘要/正文片段、GitHub/GitLab 链接，以及“判断是否值得复现”“找代码”“自动 clone”“读仓库”“整理实验配置”“查数据集论文源码”“生成复现工程”“写 md 报告”等请求。

## 必须优先使用的工具

根据当前 OpenClaw 环境中可用的工具执行：

1. 使用 PDF 工具或文件读取能力抽取论文正文、附录、脚注、表格、图注和参考文献。
2. 使用 web/search/fetch 类工具读取 arXiv 页面、项目页、论文中出现的外部链接、数据集原论文页面和公开代码页面。
3. 使用 exec/shell 工具执行仓库和文件相关命令，例如 `git clone`、`python scripts/bootstrap_repo.py`、`python scripts/find_local_code.py`、`python scripts/inspect_repo_data_processing.py`、`python scripts/build_paper_spec.py`、`python scripts/scaffold_repro_project.py`、`python scripts/inspect_repro_project.py`、`dir`、`find`、写入 `.md` 文件。
4. Windows cmd 环境优先使用 Python 脚本：`python ...`；如果 `python` 不可用，尝试 `py ...`。
5. 不使用 `.sh` 作为默认路径；本技能不生成 `.sh` 训练脚本。
6. 如果 exec 不可用、被拒绝、网络失败或 Python 不可用，必须在报告中说明失败原因和退化路径。

## 工作区约定

1. 优先在当前 agent workspace 下创建 `paper-repro-workspace/`。
2. 对每篇主论文创建安全目录名：`paper-repro-workspace/<paper-slug>/`。
3. 详细报告：`paper-repro-workspace/<paper-slug>/repro-report.md`。
4. 主论文代码：`paper-repro-workspace/<paper-slug>/main-code/<repo-name>/`。
5. 数据集论文或数据集项目代码：`paper-repro-workspace/<paper-slug>/dataset-code/<dataset-slug>/<repo-name>/`。
6. 本地手动放置源码可位于：`paper-repro-workspace/<paper-slug>/local-code/`。
7. 无代码生成工程目录不得固定为 `repro-implementation`。必须根据论文框架、方法、模型或任务名生成：`paper-repro-workspace/<paper-slug>/<framework-or-method-slug>-reproduction/`。如果只能做 baseline，目录名必须包含 `baseline`。
8. 不要在用户系统随机目录中 clone 或生成代码，不要覆盖已有目录。

## 执行边界与停止条件

- **主论文源码存在即停止在代码导读阶段**：主论文官方/高度可信源码已 clone、已存在或本地已找到时，只做仓库导读、入口定位、数据处理代码定位、写报告和极简摘要。
- **禁止自动运行阶段**：主论文源码存在时，不安装依赖、不下载数据、不修复源码路径、不设置真实数据目录、不运行训练/评估/推理、不生成新的 `<method-slug>-reproduction/` 工程。
- **无主论文源码才生成复现工程**：只有线上和本地都没有主论文源码，并且论文可直接复现或部分可复现时，才生成 `<method-slug>-reproduction/`。
- **数据集源码和 baseline 源码不能替代主论文源码**：它们只能作为数据处理或实现参考证据；如果主论文没有源码，仍需判断并生成主论文复现工程。
- **后续短句不自动跑训练**：报告产出后，用户只说“复现/继续/重新跑一遍”时，默认重新执行本技能流程，不得擅自开始训练；明确要求训练时才视为新的运行任务。

## 总体流程

### 第 1 步：读取论文证据

从论文 PDF、arXiv 页面或用户提供文本中提取：标题、作者、年份、会议或期刊、摘要、核心贡献、方法、实验、附录、脚注、代码可用性声明、数据集、指标、baseline、训练细节、图表标题和图注、明确的 GitHub/GitLab/项目页/Hugging Face/数据集链接。

如果无法读取 PDF 或附件，先说明缺失的工具或输入，不要编造论文内容。

### 第 2 步：论文类型分类

必须给出一个主类型，必要时给出次类型。可选类型：综述论文、方法论文、提示词工程论文、基准评测论文、资源论文、理论论文、系统论文。

### 第 3 步：可复现性判定

使用 `references/reproducibility-rubric.md`。只能输出以下四个标签之一：可以直接复现、部分可复现、不具备实际可复现性、不是复现目标。

必须区分“能不能复现”和“需不需要复现”。不要把“有论文描述”误判成“可以直接复现”。

### 第 4 步：主论文代码线索搜索

必须主动搜索论文证据中的代码线索：PDF URL、脚注、附录、作者说明、arXiv abstract 页面、project page、supplementary material、OpenReview 页面、`code is available`、`source code`、`implementation`、`official repository`、`github`、`project page` 等。

如果发现多个仓库，优先判断作者官方仓库。无法确认时，标记为“官方性未验证”。

### 第 5 步：本地主论文源码检查

在进入无代码复现前，必须检查本地是否已有主论文相关源码。优先使用：

```text
python scripts/find_local_code.py --paper-slug <paper-slug> --name <paper-title-or-method> --workspace .
```

检查范围包括：`paper-repro-workspace/<paper-slug>/main-code/`、`paper-repro-workspace/<paper-slug>/local-code/`、当前 agent workspace、环境变量 `PAPER_REPRO_LOCAL_CODE_ROOTS`。数据集代码目录可以作为辅助证据，但不能直接判定为主论文源码。

如果本地找到高可信主论文源码，不进入无代码复现路径，而是进入“本地代码路径”：读取 README、依赖、训练入口、评测入口、配置、模型、数据处理代码，并写入报告。

### 第 6 步：数据集论文与数据集源码溯源

当主论文使用或发布数据集、benchmark 或标注资源时，必须执行此步骤。详细流程见 `references/dataset-source-tracing.md`。

对每个关键数据集，必须：

1. 提取数据集名称、简称、引用编号、数据集论文标题、项目页、数据下载页和脚注。
2. 检索数据集原论文、项目页、Papers with Code、GitHub/GitLab/Hugging Face 线索。
3. clone 前先检查本地是否已有相关源码。
4. 找到官方或可能官方源码后 clone 或跳过 clone。
5. 使用 `scripts/inspect_repo_data_processing.py` 或等价只读检查定位数据处理代码。
6. 报告数据处理代码文件、入口命令、关键函数/类、README 证据和对主论文复现的影响。

### 第 7 步：有主论文代码时自动执行并导读

如果发现主论文官方/高度可信代码，必须：

1. 记录“检测到主论文代码仓库，进入自动仓库路径”。
2. clone 前判断目标路径是否已有同名源码文件夹；若已有，跳过 clone，只读检查。
3. Windows 优先执行：`python scripts/bootstrap_repo.py <repo-url> <paper-slug> main-code`；如 `python` 不可用，尝试 `py scripts/bootstrap_repo.py ...`。
4. clone 成功或发现现有目录后，继续做仓库导读，不能停在“已经 clone”或“已存在”。
5. 使用 `scripts/inspect_repo_data_processing.py <repo-path>` 定位数据处理代码。
6. 报告本地路径、clone 状态、重复目录提醒、依赖文件、安装命令候选、训练/推理/评测入口、数据集准备方式、配置文件、模型文件、训练文件、评测文件、数据处理文件。
7. 完成第 6 项后必须写入报告并结束本技能流程；不得继续安装依赖、修复源码、设置真实数据路径、下载数据、运行训练、运行评测或执行 inference。
8. “可以直接复现”只表示具备复现条件，不表示现在开始执行训练。

### 第 8 步：无主论文源码时生成复现工程

只要满足以下条件，就必须生成复现工程，而不是只给建议：

- 线上没有官方/可信主论文源码；
- 本地没有主论文源码；
- 论文不是综述、纯理论或非复现目标；
- 论文证据支持“可以直接复现”或“部分可复现”；
- 数据集、模型结构、训练循环、loss、指标至少能构造最小可行版本。

如果找到数据集相关源码或 baseline 源码，要将其作为数据处理和 baseline 证据输入复现工程，但不能终止主论文复现工程生成。

详细规则见 `references/no-code-reproduction.md`。

生成前必须先写 `paper-spec.yaml`。可以使用：

```text
python scripts/build_paper_spec.py <evidence-md> --out paper-repro-workspace/<paper-slug>/paper-spec.yaml
```

然后生成工程：

```text
python scripts/scaffold_repro_project.py paper-repro-workspace/<paper-slug>/paper-spec.yaml --out paper-repro-workspace/<paper-slug>/<framework-or-method-slug>-reproduction
```

生成后运行静态检查：

```text
python scripts/inspect_repro_project.py paper-repro-workspace/<paper-slug>/<framework-or-method-slug>-reproduction
```

不自动安装依赖，不下载大数据，不运行训练。轻量 `py_compile` 和文件完整性检查可以自动执行。

### 第 9 步：写入 Markdown 报告

最终必须把详细内容写入：`paper-repro-workspace/<paper-slug>/repro-report.md`。

报告模板见 `references/output-template.md`。必须记录：论文信息、分类、可复现性、代码搜索、主论文源码、本地源码、数据集源码、数据处理代码位置、复现工程生成结果、执行过的命令、不能复现原因、未完成项/人工确认项。

## 聊天输出格式

聊天中不要输出长报告。聊天回复只输出：

```markdown
[paper-repro-triage active]

- 报告文件：`paper-repro-workspace/<paper-slug>/repro-report.md`
- 主论文源码：已 clone / 已存在，跳过 clone / 本地已存在 / 未找到 / 等待审批 / clone 失败
- 数据集源码：已 clone N 个 / 已存在，跳过 clone N 个 / 本地已存在 N 个 / 未找到 / 部分找到 / 未检索
- 数据处理代码：已定位 N 处 / 未定位 / 不适用
- 复现工程：已生成 / 仅生成 skeleton / 未生成，路径：`paper-repro-workspace/<paper-slug>/<implementation-slug>/`
- 是否需要复现：需要 / 不需要 / 建议只做部分复现
- 是否能复现：可以直接复现 / 部分可复现 / 不具备实际可复现性 / 不是复现目标
- 核心原因：一句话说明；如果能复现则写“无核心阻碍”
- 执行边界：未运行训练 / 未安装依赖 / 未下载数据；如已存在主论文源码，写“已停在代码导读阶段”
```

## 安全与诚实规则

- 不要伪造已经执行过的命令。
- 不要伪造仓库文件名。
- 不要伪造 Markdown 文件已经写入。
- 不要声称精确复现，除非代码、数据、配置和评测协议都足够充分。
- 不要把第三方复现仓库当成官方仓库。
- 不要自动安装未知依赖、下载大数据、修复官方源码路径、设置真实数据目录或运行训练/评测/推理脚本；clone、跳过重复 clone、只读仓库检查、生成复现工程、静态检查可以自动执行。
- 所有论文未明确给出的超参数、路径、模型维度、loss 权重、数据处理细节必须标注 `ASSUMPTION`。
- 如果只能生成 baseline，必须命名为 baseline，不能命名为 paper reproduction。
- 如果生成的代码含 `TODO` 或 `NotImplementedError`，报告必须列出。

## 资源

- 可复现性判定标准：`references/reproducibility-rubric.md`
- Markdown 报告模板：`references/output-template.md`
- 数据集论文源码溯源流程：`references/dataset-source-tracing.md`
- 无代码复现工程流程：`references/no-code-reproduction.md`
- 仓库 bootstrap 脚本：`scripts/bootstrap_repo.py`
- 本地源码查找：`scripts/find_local_code.py`
- 数据处理代码定位：`scripts/inspect_repo_data_processing.py`
- paper spec 草稿：`scripts/build_paper_spec.py`
- 复现工程生成：`scripts/scaffold_repro_project.py`
- 复现工程检查：`scripts/inspect_repro_project.py`

FILE:scripts/bootstrap_repo.py
#!/usr/bin/env python3
"""Clone or inspect a GitHub/GitLab repository for paper reproduction workspaces.

Usage:
  python scripts/bootstrap_repo.py <repo-url> [paper-slug] [bucket]

The script is intentionally read-only after clone: it does not install dependencies,
download data, run training, or perform git pull on existing directories.
"""
from __future__ import annotations

import argparse
import os
import re
import subprocess
import sys
from pathlib import Path
from typing import Iterable

ALLOWED_PREFIXES = (
    "https://github.com/",
    "[email protected]:",
    "https://gitlab.com/",
    "[email protected]:",
)

DEPENDENCY_PATTERNS = (
    "requirements", "environment", "pyproject.toml", "setup.py", "setup.cfg",
    "Pipfile", "Dockerfile", "conda", "poetry.lock", "package.json",
)
ENTRY_RE = re.compile(r"^(train|main|run|eval|test|infer|demo).*\.(py|ipynb|sh|cmd|ps1)$", re.I)


def safe_component(value: str, default: str = "paper") -> str:
    value = value.lower().replace("\\", "/")
    value = re.sub(r"[^a-z0-9._/-]+", "-", value)
    value = re.sub(r"/{2,}", "/", value).strip("/")
    value = re.sub(r"(^|/)-+", r"\1", value)
    value = re.sub(r"-+(/|$)", r"\1", value)
    return value or default


def repo_name_from_url(url: str) -> str:
    name = url.rstrip("/").split("/")[-1]
    if ":" in name and url.startswith("git@"):
        name = name.split(":")[-1]
    if name.endswith(".git"):
        name = name[:-4]
    return re.sub(r"[^A-Za-z0-9._-]+", "-", name) or "repo"


def run(cmd: list[str], cwd: Path | None = None) -> tuple[int, str]:
    try:
        proc = subprocess.run(cmd, cwd=str(cwd) if cwd else None, text=True,
                              stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=300)
        return proc.returncode, proc.stdout
    except FileNotFoundError as exc:
        return 127, str(exc)
    except subprocess.TimeoutExpired as exc:
        return 124, (exc.stdout or "") + "\n[timeout] command timed out"


def rel_files(root: Path, max_depth: int = 2) -> list[str]:
    out: list[str] = []
    if not root.exists():
        return out
    for path in root.rglob("*"):
        try:
            rel = path.relative_to(root)
        except ValueError:
            continue
        if len(rel.parts) > max_depth:
            continue
        if any(part in {".git", "__pycache__", ".venv", "node_modules"} for part in rel.parts):
            continue
        out.append(str(rel).replace("\\", "/") + ("/" if path.is_dir() else ""))
    return sorted(out)


def find_files(root: Path, max_depth: int, predicate) -> list[str]:
    matches: list[str] = []
    if not root.exists():
        return matches
    for path in root.rglob("*"):
        if not path.is_file():
            continue
        try:
            rel = path.relative_to(root)
        except ValueError:
            continue
        if len(rel.parts) > max_depth:
            continue
        if any(part in {".git", "__pycache__", ".venv", "node_modules"} for part in rel.parts):
            continue
        if predicate(path):
            matches.append(str(rel).replace("\\", "/"))
    return sorted(matches)


def find_readme(root: Path) -> Path | None:
    for name in ("README.md", "README.rst", "README.txt", "readme.md"):
        candidate = root / name
        if candidate.exists():
            return candidate
    for path in root.rglob("README*"):
        try:
            if len(path.relative_to(root).parts) <= 2 and path.is_file():
                return path
        except ValueError:
            pass
    return None


def read_head(path: Path, max_lines: int = 160) -> str:
    try:
        with path.open("r", encoding="utf-8", errors="replace") as f:
            return "".join(line for _, line in zip(range(max_lines), f))
    except OSError as exc:
        return f"[read failed] {exc}"


def main(argv: list[str] | None = None) -> int:
    parser = argparse.ArgumentParser(description="Clone or inspect repo in paper-repro-workspace")
    parser.add_argument("repo_url")
    parser.add_argument("paper_slug", nargs="?", default="paper")
    parser.add_argument("bucket", nargs="?", default="main-code")
    args = parser.parse_args(argv)

    repo_url = args.repo_url.strip()
    if not repo_url.startswith(ALLOWED_PREFIXES):
        print(f"错误：仓库地址看起来不是 GitHub/GitLab URL：{repo_url}", file=sys.stderr)
        return 2

    safe_slug = safe_component(args.paper_slug, "paper")
    safe_bucket = safe_component(args.bucket, "main-code")
    root_dir = Path("paper-repro-workspace") / safe_slug / safe_bucket
    root_dir.mkdir(parents=True, exist_ok=True)

    repo_name = repo_name_from_url(repo_url)
    target_dir = root_dir / repo_name
    clone_status = "已 clone"
    clone_note = "新克隆仓库。"
    remote_url = ""
    command_summary = ""

    if target_dir.exists():
        clone_status = "已存在，跳过 clone"
        command_summary = "未执行 git clone，只做现有目录检查。"
        if (target_dir / ".git").exists():
            code, origin = run(["git", "-C", str(target_dir), "remote", "get-url", "origin"])
            remote_url = origin.strip() if code == 0 else ""
            if remote_url == repo_url:
                clone_note = "目标目录已存在且 origin 与目标仓库一致；按规则不重复 clone，也不自动 git pull。"
            elif remote_url:
                clone_note = f"目标目录已存在且是 git 仓库，但 origin 与目标仓库不一致：{remote_url}；按规则不覆盖、不重复 clone。"
            else:
                clone_note = "目标目录已存在且是 git 仓库，但没有读取到 origin；按规则不重复 clone。"
        else:
            clone_note = "目标目录已存在但不是 git 仓库；按规则不覆盖、不重复 clone。"
    else:
        command_summary = f"git clone {repo_url} {target_dir}"
        code, output = run(["git", "clone", repo_url, str(target_dir)])
        if code != 0:
            print("=== 执行脚本 ===")
            print("脚本：bootstrap_repo.py")
            print(f"命令：{command_summary}")
            print(output)
            return code

    print("=== 执行脚本 ===")
    print("脚本：bootstrap_repo.py")
    print(f"命令：{command_summary}")
    print("")
    print("=== 克隆结果 ===")
    print(f"仓库状态：{clone_status}")
    print(f"重复目录提醒：{clone_note}")
    print(f"本地路径：{target_dir}")
    print(f"远程地址：{repo_url}")
    if remote_url:
        print(f"现有 origin：{remote_url}")

    print("\n=== 顶层结构 ===")
    for item in rel_files(target_dir, 2)[:120]:
        print(item)

    print("\n=== 常见依赖文件 ===")
    dep_files = find_files(target_dir, 3, lambda p: any(token.lower() in p.name.lower() for token in DEPENDENCY_PATTERNS))
    for item in dep_files:
        print(item)

    print("\n=== 常见入口候选 ===")
    entry_files = find_files(target_dir, 4, lambda p: bool(ENTRY_RE.match(p.name)))
    for item in entry_files[:80]:
        print(item)

    print("\n=== README 摘要候选 ===")
    readme = find_readme(target_dir)
    if readme:
        print(f"README 文件：{readme}")
        print(read_head(readme, 160))
    else:
        print("未在前两层目录找到 README。")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:scripts/build_paper_spec.py
#!/usr/bin/env python3
"""Build a paper-spec.yaml draft from extracted evidence text.

This script intentionally creates a conservative draft. The model should edit the
YAML using paper evidence before scaffolding implementation code.
"""
from __future__ import annotations

import argparse
import re
from pathlib import Path


def slugify(value: str, default: str = "paper") -> str:
    value = re.sub(r"[^a-z0-9]+", "-", value.lower()).strip("-")
    return value[:80] or default


def read_text(path: Path) -> str:
    return path.read_text(encoding="utf-8", errors="replace")


def infer_title(text: str) -> str:
    for line in text.splitlines()[:80]:
        clean = line.strip().strip("# ")
        if 8 <= len(clean) <= 180 and not clean.lower().startswith(("abstract", "introduction", "arxiv")):
            return clean
    return "UNKNOWN"


def infer_method(title: str) -> str:
    words = re.findall(r"[A-Za-z0-9]+", title)
    if not words:
        return "paper"
    # Prefer acronym-like tokens or first two content words.
    acronyms = [w for w in words if len(w) >= 3 and w.upper() == w]
    if acronyms:
        return acronyms[0]
    return "-".join(words[:3])


def main() -> int:
    parser = argparse.ArgumentParser(description="Create paper-spec.yaml draft")
    parser.add_argument("evidence_md")
    parser.add_argument("--out", required=True)
    args = parser.parse_args()

    evidence_path = Path(args.evidence_md)
    text = read_text(evidence_path) if evidence_path.exists() else ""
    title = infer_title(text)
    method = infer_method(title)
    paper_slug = slugify(title)
    method_slug = slugify(method, paper_slug)
    implementation_slug = f"{method_slug}-reproduction"

    yaml = f"""# Generated by build_paper_spec.py. Edit with paper evidence before scaffolding.
paper:
  title: "{title}"
  year: "UNKNOWN"
  venue: "UNKNOWN"
  task: "TODO"
  modality: "TODO"
  method_name: "{method}"
  paper_slug: "{paper_slug}"
  implementation_slug: "{implementation_slug}"

architecture:
  type: "TODO"
  modules:
    - "TODO"
  inputs:
    - "TODO"
  outputs:
    - "TODO"
  loss_terms:
    - "TODO"

datasets:
  - name: "TODO"
    role: "train/eval"
    source_paper: "TODO"
    access: "TODO"
    preprocessing: "TODO"
    local_source_code: "TODO"
    data_processing_files:
      - "TODO"

training:
  seed: 42 # ASSUMPTION: debug default unless paper specifies.
  optimizer: "TODO"
  learning_rate: "TODO"
  batch_size: "TODO"
  epochs_or_steps: "TODO"
  scheduler: "TODO"
  augmentations:
    - "TODO"
  hardware: "TODO"

evaluation:
  metrics:
    - "TODO"
  protocol: "TODO"
  baselines:
    - "TODO"

evidence_status:
  code_found: false
  local_code_found: false
  reproducibility: "TODO"
  missing_fields:
    - "TODO"
  assumptions:
    - "ASSUMPTION: fields marked TODO must be filled from paper evidence before running training."
"""
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(yaml, encoding="utf-8")
    print("=== 执行脚本 ===")
    print("脚本：build_paper_spec.py")
    print(f"输入证据：{evidence_path}")
    print(f"输出文件：{out}")
    print(f"建议工程目录名：{implementation_slug}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:scripts/find_local_code.py
#!/usr/bin/env python3
"""Find local source repositories before cloning or creating from-scratch implementations."""
from __future__ import annotations

import argparse
import json
import os
import re
import subprocess
from pathlib import Path

INDICATORS = {
    "git": 8,
    "readme": 3,
    "requirements": 2,
    "pyproject": 2,
    "setup": 2,
    "train": 3,
    "eval": 2,
    "model": 2,
    "dataset": 2,
    "config": 2,
}


def norm(value: str) -> str:
    return re.sub(r"[^a-z0-9]+", " ", value.lower()).strip()


def safe_slug(value: str) -> str:
    return re.sub(r"[^a-z0-9._-]+", "-", value.lower()).strip("-") or "paper"


def split_terms(value: str) -> list[str]:
    return [t for t in norm(value).split() if len(t) >= 2]


def run_origin(path: Path) -> str:
    if not (path / ".git").exists():
        return ""
    try:
        proc = subprocess.run(["git", "-C", str(path), "remote", "get-url", "origin"], text=True,
                              stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, timeout=10)
        return proc.stdout.strip() if proc.returncode == 0 else ""
    except Exception:
        return ""


def is_repo_like(path: Path) -> bool:
    if not path.is_dir():
        return False
    names = {p.name.lower() for p in path.iterdir() if p.exists()}
    if ".git" in names:
        return True
    if any(name.startswith("readme") for name in names):
        return True
    if any(name in names for name in ["requirements.txt", "pyproject.toml", "setup.py", "environment.yml", "environment.yaml"]):
        return True
    if any(name in names for name in ["src", "models", "model", "datasets", "data", "configs", "config"]):
        return True
    return False


def score_repo(path: Path, terms: list[str], repo_url: str = "") -> tuple[int, list[str], str]:
    score = 0
    reasons: list[str] = []
    name_norm = norm(path.name)
    for term in terms:
        if term in name_norm:
            score += 5
            reasons.append(f"目录名匹配：{term}")
    origin = run_origin(path)
    if repo_url and origin and origin.rstrip("/").lower() == repo_url.rstrip("/").lower():
        score += 30
        reasons.append("git origin 与目标 URL 一致")
    if (path / ".git").exists():
        score += INDICATORS["git"]
        reasons.append("包含 .git")
    for child in path.iterdir() if path.exists() else []:
        lower = child.name.lower()
        if lower.startswith("readme"):
            score += INDICATORS["readme"]
            reasons.append(f"包含 {child.name}")
        if lower.startswith("requirements") or lower.startswith("environment"):
            score += INDICATORS["requirements"]
            reasons.append(f"包含依赖文件 {child.name}")
        if lower == "pyproject.toml":
            score += INDICATORS["pyproject"]
            reasons.append("包含 pyproject.toml")
        if lower in {"src", "models", "model"}:
            score += INDICATORS["model"]
            reasons.append(f"包含模型/源码目录 {child.name}")
        if lower in {"datasets", "dataset", "data"}:
            score += INDICATORS["dataset"]
            reasons.append(f"包含数据目录 {child.name}")
        if lower in {"configs", "config"}:
            score += INDICATORS["config"]
            reasons.append(f"包含配置目录 {child.name}")
        if re.match(r"^(train|main|run|eval|test).*\.(py|ipynb|sh|cmd)$", lower):
            score += INDICATORS["train"]
            reasons.append(f"包含入口候选 {child.name}")
    return score, reasons, origin


def gather_roots(workspace: Path, paper_slug: str, extra_roots: list[str]) -> list[Path]:
    roots = [
        workspace / "paper-repro-workspace" / paper_slug / "main-code",
        workspace / "paper-repro-workspace" / paper_slug / "dataset-code",
        workspace / "paper-repro-workspace" / paper_slug / "local-code",
        workspace,
    ]
    env_roots = os.environ.get("PAPER_REPRO_LOCAL_CODE_ROOTS", "")
    for item in re.split(r"[;:]", env_roots):
        if item.strip():
            roots.append(Path(item.strip()))
    roots.extend(Path(r) for r in extra_roots)
    unique: list[Path] = []
    seen: set[str] = set()
    for root in roots:
        try:
            key = str(root.resolve())
        except Exception:
            key = str(root)
        if key not in seen:
            seen.add(key)
            unique.append(root)
    return unique


def walk_candidates(root: Path, max_depth: int) -> list[Path]:
    out: list[Path] = []
    if not root.exists() or not root.is_dir():
        return out
    root = root.resolve()
    stack = [(root, 0)]
    while stack:
        path, depth = stack.pop()
        if path.name in {".git", "__pycache__", ".venv", "node_modules"}:
            continue
        if is_repo_like(path):
            out.append(path)
            if (path / ".git").exists() and path != root:
                continue
        if depth < max_depth:
            try:
                children = [p for p in path.iterdir() if p.is_dir()]
            except OSError:
                children = []
            stack.extend((child, depth + 1) for child in children)
    return out


def main() -> int:
    parser = argparse.ArgumentParser(description="Find local source code candidates")
    parser.add_argument("--name", action="append", default=[], help="paper, method, dataset, or repo name")
    parser.add_argument("--repo-url", default="")
    parser.add_argument("--paper-slug", default="paper")
    parser.add_argument("--workspace", default=".")
    parser.add_argument("--root", action="append", default=[])
    parser.add_argument("--max-depth", type=int, default=4)
    parser.add_argument("--json", action="store_true")
    args = parser.parse_args()

    paper_slug = safe_slug(args.paper_slug)
    terms: list[str] = []
    for name in args.name:
        terms.extend(split_terms(name))
    if args.repo_url:
        repo_tail = args.repo_url.rstrip("/").split("/")[-1].replace(".git", "")
        terms.extend(split_terms(repo_tail))
    terms = sorted(set(terms))

    roots = gather_roots(Path(args.workspace), paper_slug, args.root)
    results = []
    seen: set[str] = set()
    for root in roots:
        for cand in walk_candidates(root, args.max_depth):
            key = str(cand.resolve())
            if key in seen:
                continue
            seen.add(key)
            score, reasons, origin = score_repo(cand, terms, args.repo_url)
            if score <= 0 and terms:
                continue
            results.append({
                "path": str(cand),
                "score": score,
                "origin": origin,
                "reasons": reasons,
            })
    results.sort(key=lambda r: r["score"], reverse=True)

    payload = {"script": "find_local_code.py", "terms": terms, "roots": [str(r) for r in roots], "results": results[:20]}
    if args.json:
        print(json.dumps(payload, ensure_ascii=False, indent=2))
    else:
        print("=== 执行脚本 ===")
        print("脚本：find_local_code.py")
        print(f"检索词：{', '.join(terms) if terms else '(无)'}")
        print("=== 本地源码候选 ===")
        if not results:
            print("未找到高相关本地源码候选。")
        for item in results[:20]:
            print(f"score={item['score']} path={item['path']}")
            if item["origin"]:
                print(f"  origin={item['origin']}")
            for reason in item["reasons"][:8]:
                print(f"  - {reason}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:scripts/inspect_repo_data_processing.py
#!/usr/bin/env python3
"""Inspect repository files and identify likely data processing code."""
from __future__ import annotations

import argparse
import json
import re
from pathlib import Path

KEYWORDS = [
    "dataset", "dataloader", "data_loader", "datamodule", "preprocess", "prepare",
    "process", "processing", "transform", "augment", "crop", "resize", "split",
    "annotation", "label", "metadata", "extract", "feature", "frames", "tokenize",
    "download", "convert", "benchmark", "loader", "sampler",
]
CODE_EXTS = {".py", ".ipynb", ".sh", ".cmd", ".ps1", ".yaml", ".yml", ".json", ".txt", ".md"}
SKIP_DIRS = {".git", "__pycache__", ".venv", "venv", "node_modules", "dist", "build"}
DEF_RE = re.compile(r"^\s*(class|def)\s+([A-Za-z_][A-Za-z0-9_]*)")
SECTION_RE = re.compile(r"^#{1,4}\s+(.+)")


def rel(path: Path, root: Path) -> str:
    try:
        return str(path.relative_to(root)).replace("\\", "/")
    except ValueError:
        return str(path)


def read_text(path: Path, limit: int = 200_000) -> str:
    try:
        return path.read_text(encoding="utf-8", errors="replace")[:limit]
    except Exception:
        return ""


def score_path(path: Path, text: str) -> tuple[int, list[str]]:
    lower_path = str(path).lower().replace("\\", "/")
    lower_text = text.lower()
    score = 0
    reasons: list[str] = []
    for kw in KEYWORDS:
        if kw in lower_path:
            score += 5
            reasons.append(f"路径包含 {kw}")
        count = lower_text.count(kw)
        if count:
            score += min(count, 5)
    if "torch.utils.data" in lower_text:
        score += 12
        reasons.append("包含 torch.utils.data")
    if "class " in text and "dataset" in lower_text:
        score += 8
        reasons.append("可能定义 Dataset 类")
    if "if __name__" in lower_text and any(k in lower_text for k in ["preprocess", "prepare", "dataset", "data"]):
        score += 5
        reasons.append("可能是可执行数据处理脚本")
    return score, reasons


def extract_symbols(text: str, limit: int = 20) -> list[str]:
    out: list[str] = []
    for line in text.splitlines():
        m = DEF_RE.match(line)
        if m:
            out.append(f"{m.group(1)} {m.group(2)}")
        if len(out) >= limit:
            break
    return out


def extract_readme_sections(path: Path, text: str) -> list[str]:
    lines = text.splitlines()
    sections: list[str] = []
    capture = False
    buf: list[str] = []
    title = ""
    for line in lines:
        m = SECTION_RE.match(line)
        if m:
            if capture and buf:
                sections.append(title + "\n" + "\n".join(buf[:20]))
            title = line
            capture = any(k in m.group(1).lower() for k in ["data", "dataset", "preprocess", "prepare", "download", "training", "benchmark"])
            buf = []
        elif capture:
            buf.append(line)
    if capture and buf:
        sections.append(title + "\n" + "\n".join(buf[:20]))
    return sections[:6]


def main() -> int:
    parser = argparse.ArgumentParser(description="Locate data processing code in a repository")
    parser.add_argument("repo_path")
    parser.add_argument("--json", action="store_true")
    args = parser.parse_args()

    root = Path(args.repo_path)
    if not root.exists():
        print(f"错误：路径不存在：{root}")
        return 2

    candidates = []
    readme_sections = []
    for path in root.rglob("*"):
        if any(part in SKIP_DIRS for part in path.parts):
            continue
        if not path.is_file() or path.suffix.lower() not in CODE_EXTS:
            continue
        text = read_text(path)
        score, reasons = score_path(path, text)
        if path.name.lower().startswith("readme"):
            readme_sections.extend({"file": rel(path, root), "section": s} for s in extract_readme_sections(path, text))
        if score >= 6:
            candidates.append({
                "path": rel(path, root),
                "score": score,
                "reasons": reasons[:8],
                "symbols": extract_symbols(text),
            })
    candidates.sort(key=lambda x: x["score"], reverse=True)
    payload = {"script": "inspect_repo_data_processing.py", "repo_path": str(root), "candidates": candidates[:50], "readme_sections": readme_sections[:10]}

    if args.json:
        print(json.dumps(payload, ensure_ascii=False, indent=2))
    else:
        print("=== 执行脚本 ===")
        print("脚本：inspect_repo_data_processing.py")
        print(f"仓库路径：{root}")
        print("\n=== 数据处理代码候选 ===")
        if not candidates:
            print("未定位到明显数据处理代码候选。")
        for item in candidates[:30]:
            print(f"score={item['score']} file={item['path']}")
            for r in item["reasons"]:
                print(f"  - {r}")
            if item["symbols"]:
                print(f"  symbols: {', '.join(item['symbols'][:12])}")
        print("\n=== README 数据相关章节候选 ===")
        if not readme_sections:
            print("未定位到 README 数据相关章节。")
        for sec in readme_sections[:6]:
            print(f"--- {sec['file']} ---")
            print(sec["section"][:1200])
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:scripts/inspect_repro_project.py
#!/usr/bin/env python3
"""Inspect a generated reproduction project without installing dependencies or training."""
from __future__ import annotations

import argparse
import py_compile
from pathlib import Path

REQUIRED = [
    "README.md",
    "repro-docs/repro-notes.md",
    "repro-docs/evidence-map.md",
    "repro-docs/paper-spec.yaml",
    "repro-docs/requirements.txt",
    "config.py",
    "main.py",
    "run.py",
    "data/__init__.py",
    "data/dataset.py",
    "data/preprocess.py",
    "models/__init__.py",
    "models/model.py",
    "engine/__init__.py",
    "engine/train.py",
    "engine/evaluate.py",
    "utils/__init__.py",
    "utils/common.py",
    "utils/metrics.py",
]

FORBIDDEN_DEFAULTS = [
    "configs/default.yaml",
    "configs/debug.yaml",
    "configs/ablation.yaml",
    "losses/paper_loss.py",
    "loss.py",
    "scripts/train.sh",
    "scripts/eval.sh",
    "scripts/train.cmd",
    "scripts/eval.cmd",
]


def read(path: Path) -> str:
    try:
        return path.read_text(encoding="utf-8", errors="replace")
    except Exception:
        return ""


def main() -> int:
    parser = argparse.ArgumentParser(description="Inspect generated concise repro project")
    parser.add_argument("project_path")
    args = parser.parse_args()

    root = Path(args.project_path)
    print("=== 执行脚本 ===")
    print("脚本：inspect_repro_project.py")
    print(f"工程路径：{root}")

    if not root.exists():
        print("错误：工程路径不存在。")
        return 2

    print("\n=== 必需文件检查 ===")
    missing = []
    for rel in REQUIRED:
        path = root / rel
        if path.exists():
            print(f"OK {rel}")
        else:
            print(f"MISSING {rel}")
            missing.append(rel)

    print("\n=== 不应默认生成的旧结构检查 ===")
    forbidden_found = []
    for rel in FORBIDDEN_DEFAULTS:
        path = root / rel
        if path.exists():
            print(f"FOUND_OLD_STRUCTURE {rel}")
            forbidden_found.append(rel)
        else:
            print(f"OK_ABSENT {rel}")

    print("\n=== Python 静态编译检查 ===")
    compile_errors = []
    for path in sorted(root.rglob("*.py")):
        try:
            py_compile.compile(str(path), doraise=True)
            print(f"OK {path.relative_to(root)}")
        except Exception as exc:
            print(f"FAIL {path.relative_to(root)}: {exc}")
            compile_errors.append(str(path.relative_to(root)))

    print("\n=== TODO / ASSUMPTION / NotImplementedError 统计 ===")
    markers = {"TODO": 0, "ASSUMPTION": 0, "NotImplementedError": 0}
    marker_files = []
    for path in root.rglob("*"):
        if path.is_file() and path.suffix.lower() in {".py", ".md", ".yaml", ".yml", ".txt"}:
            text = read(path)
            counts = {k: text.count(k) for k in markers}
            if any(counts.values()):
                marker_files.append((str(path.relative_to(root)).replace("\\", "/"), counts))
                for k, v in counts.items():
                    markers[k] += v
    for k, v in markers.items():
        print(f"{k}: {v}")
    for file, counts in marker_files[:50]:
        parts = ", ".join(f"{k}={v}" for k, v in counts.items() if v)
        print(f"- {file}: {parts}")

    print("\n=== 结论 ===")
    if missing or compile_errors or forbidden_found:
        print("状态：部分通过")
        if missing:
            print("缺失文件：" + ", ".join(missing))
        if compile_errors:
            print("编译失败：" + ", ".join(compile_errors))
        if forbidden_found:
            print("发现旧结构：" + ", ".join(forbidden_found))
        return 1
    print("状态：通过静态检查。未安装依赖，未下载数据，未运行训练。")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:scripts/scaffold_repro_project.py
#!/usr/bin/env python3
"""Generate a concise, structured PyTorch reproduction project from paper-spec.yaml.

The generated layout is intentionally close to many small/medium research repos:
main.py + config.py + run.py at the root, with data/, models/, engine/ and utils/
subpackages. It avoids multiple YAML configs, shell scripts, and a separate losses
package by default.
"""
from __future__ import annotations

import argparse
import re
import shutil
from pathlib import Path


def parse_simple_yaml(path: Path) -> dict[str, str]:
    """Tiny YAML-ish parser for scalar values used by this skill.

    This avoids external dependencies. It is not a general YAML parser.
    """
    data: dict[str, str] = {}
    stack: list[str] = []
    for raw in path.read_text(encoding="utf-8", errors="replace").splitlines():
        if not raw.strip() or raw.lstrip().startswith("#"):
            continue
        indent = len(raw) - len(raw.lstrip(" "))
        line = raw.strip()
        if ":" not in line or line.startswith("-"):
            continue
        key, value = line.split(":", 1)
        key = key.strip()
        value = value.strip().strip('"\'')
        level = indent // 2
        stack = stack[:level]
        stack.append(key)
        if value:
            data[".".join(stack)] = value.split(" # ")[0].strip().strip('"\'')
    return data


def slugify(value: str, default: str = "paper") -> str:
    value = re.sub(r"[^a-z0-9]+", "-", value.lower()).strip("-")
    return value[:80] or default


def class_name(value: str, default: str = "PaperModel") -> str:
    words = re.findall(r"[a-zA-Z0-9]+", value)
    if not words:
        return default
    name = "".join(w[:1].upper() + w[1:] for w in words)
    if name and name[0].isdigit():
        name = "Paper" + name
    return name or default


def write(path: Path, text: str) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(text, encoding="utf-8")


def main() -> int:
    parser = argparse.ArgumentParser(description="Scaffold concise structured PyTorch reproduction project")
    parser.add_argument("paper_spec")
    parser.add_argument("--out", default="")
    parser.add_argument("--force", action="store_true")
    args = parser.parse_args()

    spec_path = Path(args.paper_spec)
    if not spec_path.exists():
        print(f"错误：paper spec 不存在：{spec_path}")
        return 2

    spec = parse_simple_yaml(spec_path)
    title = spec.get("paper.title", "UNKNOWN")
    method = spec.get("paper.method_name") or spec.get("paper.task") or "paper"
    implementation_slug = spec.get("paper.implementation_slug") or f"{slugify(method, 'paper')}-reproduction"
    out = Path(args.out) if args.out else (spec_path.parent / implementation_slug)
    model_cls = class_name(method)

    if out.exists() and any(out.iterdir()) and not args.force:
        print(f"错误：输出目录已存在且非空：{out}")
        print("为避免覆盖，不会生成。请指定新目录或手动清理。")
        return 3
    out.mkdir(parents=True, exist_ok=True)
    (out / "repro-docs").mkdir(parents=True, exist_ok=True)

    shutil.copyfile(spec_path, out / "repro-docs" / "paper-spec.yaml")

    for pkg in ["data", "models", "engine", "utils"]:
        write(out / pkg / "__init__.py", "")

    write(out / "repro-docs" / "requirements.txt", """torch
numpy
tqdm
""")

    write(out / "README.md", f"""# {title} - Reproduction Scaffold

This project was generated by `paper-repro-triage` because no official/local main-paper source code was found.
It is a conservative PyTorch scaffold, not a claim of successful reproduction.

## Layout

- `main.py`: command-line entrypoint.
- `config.py`: argparse defaults and hyperparameters.
- `run.py`: dispatches preprocess/train/eval/inference by mode.
- `data/`: dataset loading and preprocessing.
- `models/`: paper model definition.
- `engine/`: training and evaluation loops.
- `utils/`: metrics, seed, path and JSON helpers.

## repro-docs

- `requirements.txt`: minimal dependency list.
- `paper-spec.yaml`: structured paper evidence used to generate this scaffold; it is not the training config.
- `evidence-map.md`: mapping from generated code files to paper evidence, dataset-code evidence or explicit assumptions.
- `repro-notes.md`: limitations, missing details and manual checks before real training.

## Commands

```cmd
python -m pip install -r repro-docs/requirements.txt
python main.py --mode preprocess --dataset paper --data_root ./data
python main.py --mode train --dataset paper --data_root ./data
python main.py --mode eval --checkpoint outputs/best.pt
```

## Safety notes

- This scaffold does not download datasets automatically.
- This scaffold does not install dependencies automatically.
- Paper-missing details are marked as `ASSUMPTION` or `TODO`.
- Run small smoke tests before real training.
""")

    write(out / "repro-docs" / "repro-notes.md", f"""# Reproduction Notes

This file records what is still uncertain or unverified. It is the place for reproduction limitations, assumptions, missing data, and manual checks. Do not treat this scaffold as a completed paper reproduction until these notes are resolved.

## Status

- Source code found: false
- Local main-paper source code found: false
- Generated implementation directory: `{out.name}`

## Important limitations

- This code is a scaffold generated from paper evidence and explicit assumptions.
- Do not report paper-level reproduction results until real data, dependencies, training and evaluation have been confirmed.

## Assumptions to verify

- Model dimensions and module details marked TODO.
- Dataset file layout and preprocessing steps marked TODO.
- Loss weights and scheduler details marked TODO unless paper-spec.yaml states them.
""")

    write(out / "repro-docs" / "evidence-map.md", """# Evidence Map

This file links generated code to paper evidence, dataset-code evidence, baseline-code evidence, or explicit assumptions. If a code choice is not supported by the paper, mark it as `ASSUMPTION` or `TODO`.

| Code file | Purpose | Paper evidence | Assumption / TODO |
|---|---|---|---|
| config.py | Command-line args and hyperparameters | paper-spec.yaml training/evaluation sections | TODO fields require paper evidence |
| main.py | CLI entrypoint | common research-code pattern | no paper-specific assumption |
| run.py | mode dispatch | paper task flow | no paper-specific assumption |
| data/dataset.py | Dataset loader | datasets/preprocessing section | TODO: actual file layout |
| data/preprocess.py | Data processing hooks | dataset-source-tracing results | TODO: adapt to actual dataset |
| models/model.py | Model definition | architecture section | ASSUMPTION if architecture details missing |
| engine/train.py | Training loop and loss/objective | training + loss_terms section | ASSUMPTION for missing hyperparameters |
| engine/evaluate.py | Evaluation loop | evaluation section | TODO: exact metrics if missing |
| utils/metrics.py | Metrics | evaluation metrics section | TODO if metric undefined |
""")

    write(out / "config.py", f'''import argparse


def build_parser():
    parser = argparse.ArgumentParser(description="{title} reproduction scaffold")
    parser.add_argument('--dataset', default='paper', help='dataset name or alias')
    parser.add_argument('--mode', default='train', choices=['preprocess', 'train', 'eval', 'inference'])
    parser.add_argument('--data_root', default='./data')
    parser.add_argument('--output_dir', default='./outputs')
    parser.add_argument('--checkpoint', default='', help='checkpoint path for eval/inference')
    parser.add_argument('--epochs', type=int, default=1, help='ASSUMPTION: debug default; replace with paper value')
    parser.add_argument('--batch_size', type=int, default=32, help='ASSUMPTION unless paper specifies')
    parser.add_argument('--lr', type=float, default=1e-3, help='ASSUMPTION unless paper specifies')
    parser.add_argument('--weight_decay', type=float, default=0.0)
    parser.add_argument('--num_workers', type=int, default=0)
    parser.add_argument('--seed', type=int, default=42)
    parser.add_argument('--gpu', default='0')
    parser.add_argument('--hidden_dim', type=int, default=256, help='TODO/ASSUMPTION: replace with paper value')
    parser.add_argument('--input_dim', type=int, default=128, help='TODO/ASSUMPTION: replace with dataset feature dimension')
    parser.add_argument('--output_dim', type=int, default=2, help='TODO/ASSUMPTION: replace with task output size')
    parser.add_argument('--alpha', type=float, default=1.0, help='optional paper-specific loss weight')
    parser.add_argument('--beta', type=float, default=1.0, help='optional paper-specific loss weight')
    return parser


def parse_args(argv=None):
    return build_parser().parse_args(argv)
''')

    write(out / "main.py", '''import os

from config import parse_args
from run import Run
from utils.common import set_seed


if __name__ == '__main__':
    args = parse_args()

    # Set visible GPU before creating CUDA contexts.
    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu)

    set_seed(args.seed)
    print(args)
    Run(args).main()
''')

    write(out / "run.py", '''from data.preprocess import preprocess
from engine.train import train
from engine.evaluate import evaluate


class Run:
    def __init__(self, args):
        self.args = args

    def main(self):
        if self.args.mode == 'preprocess':
            return preprocess(self.args)
        if self.args.mode == 'train':
            return train(self.args)
        if self.args.mode in {'eval', 'inference'}:
            return evaluate(self.args)
        raise ValueError(f'Unsupported mode: {self.args.mode}')
''')

    write(out / "utils" / "common.py", '''import json
import random
from pathlib import Path

import numpy as np
import torch


def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


def get_device():
    return torch.device('cuda' if torch.cuda.is_available() else 'cpu')


def ensure_dir(path):
    Path(path).mkdir(parents=True, exist_ok=True)


def save_json(obj, path):
    path = Path(path)
    ensure_dir(path.parent)
    path.write_text(json.dumps(obj, ensure_ascii=False, indent=2), encoding='utf-8')
''')

    write(out / "data" / "dataset.py", '''# Data processing evidence:
# - Fill this section with dataset-source-tracing results.
# - If source repo processing files were found, list them here.
# TODO: Replace synthetic fallback with actual paper dataset parsing.

from pathlib import Path

import torch
from torch.utils.data import DataLoader, Dataset


class PaperDataset(Dataset):
    def __init__(self, root, split='train'):
        self.root = Path(root)
        self.split = split
        # TODO: Replace synthetic fallback with actual annotation/index loading.
        self.items = list(range(8))

    def __len__(self):
        return len(self.items)

    def __getitem__(self, index):
        # ASSUMPTION: vector input fallback for smoke checking.
        x = torch.randn(128)
        y = torch.tensor(index % 2, dtype=torch.long)
        return {'input': x, 'label': y}


def build_dataloader(args, split='train'):
    dataset = PaperDataset(args.data_root, split=split)
    shuffle = split == 'train'
    return DataLoader(
        dataset,
        batch_size=args.batch_size,
        shuffle=shuffle,
        num_workers=args.num_workers,
    )
''')

    write(out / "data" / "preprocess.py", '''from pathlib import Path


# TODO: Implement dataset-specific preprocessing based on paper evidence and
# dataset-source-tracing results. Do not download restricted datasets here.

def preprocess(args):
    data_root = Path(args.data_root)
    data_root.mkdir(parents=True, exist_ok=True)
    print(f'Preprocess placeholder. Data root: {data_root}')
    print('TODO: add annotation conversion, tokenizer/vocab construction, frame extraction, or feature extraction.')
''')

    write(out / "models" / "model.py", f'''# Model evidence:
# - Method name from paper-spec: {method}
# - TODO: Replace fallback MLP with paper-specific architecture.

from torch import nn


class {model_cls}(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(args.input_dim, args.hidden_dim),
            nn.ReLU(),
            nn.Linear(args.hidden_dim, args.output_dim),
        )

    def forward(self, batch):
        x = batch['input']
        return self.net(x)


def build_model(args):
    return {model_cls}(args)
''')

    write(out / "utils" / "metrics.py", '''import torch


def accuracy(logits, labels):
    # TODO: Replace or extend with exact paper metrics.
    preds = torch.argmax(logits, dim=-1)
    return (preds == labels).float().mean().item()
''')

    write(out / "engine" / "train.py", '''import torch
from torch import nn
from torch.optim import Adam
from tqdm import tqdm

from data.dataset import build_dataloader
from models.model import build_model
from utils.common import ensure_dir, get_device
from utils.metrics import accuracy


def build_criterion(args):
    # TODO: Replace with paper-specific objective if different.
    # Keep loss here unless the paper has a complex reusable loss module.
    return nn.CrossEntropyLoss()


def train(args):
    device = get_device()
    ensure_dir(args.output_dir)

    train_loader = build_dataloader(args, split='train')
    model = build_model(args).to(device)
    criterion = build_criterion(args)
    optimizer = Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

    best_path = f'{args.output_dir}/best.pt'
    for epoch in range(args.epochs):
        model.train()
        total_loss = 0.0
        total_acc = 0.0
        steps = 0
        for batch in tqdm(train_loader, desc=f'epoch {epoch + 1}/{args.epochs}'):
            batch = {k: v.to(device) if hasattr(v, 'to') else v for k, v in batch.items()}
            logits = model(batch)
            loss = criterion(logits, batch['label'])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_acc += accuracy(logits.detach(), batch['label'])
            steps += 1

        print({'epoch': epoch + 1, 'loss': total_loss / max(steps, 1), 'accuracy': total_acc / max(steps, 1)})
        torch.save({'model': model.state_dict(), 'args': vars(args)}, best_path)

    print(f'Saved checkpoint: {best_path}')
''')

    write(out / "engine" / "evaluate.py", '''import torch

from data.dataset import build_dataloader
from models.model import build_model
from utils.common import get_device, save_json
from utils.metrics import accuracy


def evaluate(args):
    device = get_device()
    loader = build_dataloader(args, split='test')
    model = build_model(args).to(device)

    if args.checkpoint:
        ckpt = torch.load(args.checkpoint, map_location=device)
        state = ckpt.get('model', ckpt)
        model.load_state_dict(state, strict=False)
    else:
        print('WARNING: no checkpoint provided; evaluating randomly initialized model.')

    model.eval()
    scores = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) if hasattr(v, 'to') else v for k, v in batch.items()}
            logits = model(batch)
            scores.append(accuracy(logits, batch['label']))

    result = {'accuracy_placeholder': sum(scores) / max(len(scores), 1)}
    save_json(result, f'{args.output_dir}/eval.json')
    print(result)
''')

    print("=== 执行脚本 ===")
    print("脚本：scaffold_repro_project.py")
    print(f"论文：{title}")
    print(f"方法：{method}")
    print(f"工程路径：{out}")
    print("结构：base layout (repro-docs/, main.py, config.py, run.py, data/, models/, engine/, utils/); can be extended when the paper requires")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

FILE:references/dataset-source-tracing.md
# 数据集论文与数据集源码溯源流程

目标：从主论文中识别关键数据集，轻量检索这些数据集的原论文或项目页是否有官方源码、处理代码或 benchmark 仓库。如果找到可信仓库，先查本地是否已有相关源码；没有才自动 clone。不下载论文 PDF，不下载数据集本体。

## 触发条件

主论文中出现以下信息时触发：

- 使用多个公开数据集进行实验。
- 发布新数据集、benchmark 或标注资源。
- 数据集对复现结论至关重要。
- 用户明确要求“爬取相关数据集论文”“找数据集论文源码”“克隆数据集代码”。

## 数据集优先级

如果数据集很多，默认最多处理 5 个：

1. 主实验表格中反复出现的数据集。
2. 论文新发布的数据集。
3. 与核心结论直接相关的数据集。
4. 需要特殊预处理或标注流程的数据集。
5. benchmark 评测协议依赖的数据集。

## 检索策略

对每个数据集，优先查：

- 主论文参考文献里的数据集原论文标题。
- 数据集名称 + `github`。
- 数据集名称 + `official code`。
- 数据集名称 + `project page`。
- 数据集论文标题 + `code`。
- Papers with Code 数据集页或任务页。
- arXiv abstract 页面里的 comments/code 链接。

## 本地优先规则

找到数据集相关仓库 URL 后，clone 前必须先检查本地已有源码：

```text
python scripts/find_local_code.py --paper-slug <paper-slug> --name <dataset-name-or-repo-name> --workspace .
```

检查范围：

- `paper-repro-workspace/<paper-slug>/dataset-code/`
- `paper-repro-workspace/<paper-slug>/local-code/`
- 当前 agent workspace
- 环境变量 `PAPER_REPRO_LOCAL_CODE_ROOTS`

如果本地存在同名或高相关源码，记录为 `本地已存在，跳过 clone`，并直接进入源码导读和数据处理代码定位。

## 仓库可信度判断

- 官方：论文作者、项目组织、数据集官网明确链接。
- 可能官方：作者主页或机构页链接，但没有明确“official”字样。
- 第三方：非作者维护、复现性质、社区实现。
- 未验证：只能通过搜索结果推测，缺少直接证据。

只自动 clone “官方”或“可能官方”的仓库。第三方仓库默认只记录 URL，不自动 clone，除非用户明确同意。

## 数据处理代码定位

对每个已经找到或本地存在的数据集源码，必须定位数据处理相关代码。优先使用：

```text
python scripts/inspect_repo_data_processing.py <repo-path>
```

必须检查并记录：

- 数据集类：`Dataset`、`DataModule`、`DataLoader`、`torch.utils.data.Dataset`。
- 数据加载文件：`dataset.py`、`datasets/*.py`、`data/*.py`、`loader.py`、`dataloader.py`。
- 预处理脚本：`preprocess.py`、`prepare_data.py`、`process_*.py`、`extract_*.py`、`convert_*.py`。
- 标注解析：`annotation`、`label`、`split`、`metadata`、`json/csv/txt` parsing。
- 特征抽取：`feature`、`extract_frames`、`tokenize`、`crop`、`resize`、`augment`。
- README / docs 中的 data preparation、preprocess、dataset setup 命令。

如果无法明确判断数据处理入口，要报告“未定位到明确数据处理入口”，并说明仅找到哪些候选文件。

## 报告字段

每个数据集至少记录：

- 数据集名称。
- 是否是主实验依赖。
- 原论文或项目页。
- 是否找到源码。
- 仓库 URL。
- 仓库可信度。
- clone / 本地状态。
- 本地路径。
- 数据处理代码位置。
- 数据处理入口命令或函数。
- README / 文件证据。
- 数据访问限制。
- 对主论文复现的影响。

## 禁止事项

- 不下载数据集本体。
- 不批量下载论文 PDF。
- 不绕过登录、申请、验证码、付费墙或授权限制。
- 不把第三方复现仓库伪装成官方仓库。
- 不声称数据集可用，除非找到明确下载或申请路径。

## 重复目录处理

当数据集源码仓库需要 clone 到 `paper-repro-workspace/<paper-slug>/dataset-code/<dataset-slug>/<repo-name>/` 时，必须先检查目标路径是否已经存在同名源码文件夹。

- 如果目标路径不存在：可以自动 `git clone`。
- 如果目标路径已经存在：不要再次 clone，不要自动 `git pull`，不要覆盖，也不要改用时间戳新目录；直接读取现有目录并在报告中标记 `已存在，跳过 clone`。
- 如果现有目录是 git 仓库：记录其 `origin`，并说明是否与目标仓库一致。
- 如果现有目录不是 git 仓库：记录目录冲突，提示用户手动处理或指定新目录。

聊天极简摘要中也必须体现数据集源码状态，例如：`数据集源码：本地已存在 1 个，已存在，跳过 clone 2 个`。

FILE:references/no-code-reproduction.md
# 无主论文源码时的复现工程生成流程

当主论文没有官方源码、线上没有可信主仓库、本地也没有主论文源码时，本技能进入无主论文源码复现路径。目标不是伪造完整结果，而是在证据足够时生成一个符合多数 PyTorch 论文仓库直觉、可审查、可继续开发的最小复现工程。

## 进入本流程前的硬性前提

本流程只在“无主论文源码”时进入。如果主论文官方/高度可信源码已经 clone、已存在或本地已找到，必须停止在仓库导读和报告阶段，不得生成新的复现工程，也不得运行训练。数据集源码、baseline 源码或相关论文源码不算主论文源码；它们不能阻止本流程。

## 生成条件

同时满足以下条件时必须生成复现工程：

- 论文是方法论文、系统论文、可执行 benchmark 方法，或资源论文中的 benchmark 使用流程。
- 可复现性结论为“可以直接复现”或“部分可复现”。
- 论文中能提取出最小可行实现所需信息：输入、输出、模型或流程模块、loss/objective、数据集/替代数据、评测指标。
- 缺失信息可以用明确假设补齐，并且不会改变论文核心方法。

如果只找到数据集相关源码、baseline 源码或旧方法源码，仍然要继续生成主论文复现工程。它们只能作为数据处理、baseline 或实现参考证据，不能替代主论文源码。

## 不生成 paper reproduction 的情况

以下情况不能生成 paper reproduction，只能生成 baseline 或实验设计记录：

- 综述、观点或理论分析为主。
- 关键数据、权重、系统或闭源 API 不可获得，且没有合理缩小版。
- 模型结构、loss、评测协议都不清楚。
- 论文目标是 benchmark 定义而不是方法复现。

如果能构造一个合理 baseline，但不能构造论文方法，目录名必须包含 `baseline`，并在报告中说明不是论文原方法复现。

## 目录命名

工程目录不得固定为 `repro-implementation`。根据论文框架、方法、模型或任务名生成：

```text
paper-repro-workspace/<paper-slug>/<framework-or-method-slug>-reproduction/
```

示例：

- SGAN：`sgan-reproduction/`
- MI2LaTeX：`mi2latex-reproduction/`
- Diffusion Transformer：`dit-reproduction/`
- VideoMAE：`videomae-reproduction/`
- 无法判断方法名：`<paper-slug>-reproduction/`

## 最低基本盘工程结构

默认生成“有结构但不重”的 PyTorch 工程。基本盘采用根目录入口文件 + 四个职责明确的代码目录 + 一个复现文档目录。该结构是最低推荐结构，不是不可更改的硬性模板；如果论文需要 tokenizer、decoder、retrieval、beam search、多阶段训练或特殊评测，可以在此基础上增加目录或文件。

```text
<method-slug>-reproduction/
├── README.md
├── repro-docs/
│   ├── requirements.txt
│   ├── paper-spec.yaml
│   ├── evidence-map.md
│   └── repro-notes.md
├── config.py
├── main.py
├── run.py
├── data/
│   ├── __init__.py
│   ├── dataset.py
│   └── preprocess.py
├── models/
│   ├── __init__.py
│   └── model.py
├── engine/
│   ├── __init__.py
│   ├── train.py
│   └── evaluate.py
└── utils/
    ├── __init__.py
    ├── common.py
    └── metrics.py
```

`repro-docs/` 只存放复现文档和依赖清单，不放训练入口代码。根目录保留 `main.py`、`config.py`、`run.py`，让用户一眼看到如何运行；真正实现放入 `data/`、`models/`、`engine/`、`utils/`。

## `repro-docs/` 四个文件的作用

### `repro-docs/requirements.txt`

作用：记录最小依赖清单，方便用户创建环境。只写运行脚手架和最小复现所需依赖，例如 `torch`、`numpy`、`tqdm`。不要把尚未验证的重型依赖、系统依赖或数据下载工具随意塞进去；如需特殊依赖，在 `repro-notes.md` 中说明待确认。

### `repro-docs/paper-spec.yaml`

作用：记录从论文中抽取出的结构化规格，是生成代码的证据输入，不是训练配置文件。它应包含任务、模型模块、输入输出、数据集、loss、训练超参数、评测指标、缺失项和假设。训练时的命令行参数仍由 `config.py` 管理。

### `repro-docs/evidence-map.md`

作用：记录“代码文件 ↔ 论文证据/数据集源码证据/明确假设”的映射。每个生成的核心文件都要能追溯依据，例如 `models/model.py` 来自方法章节或架构图，`data/preprocess.py` 来自数据集论文源码或主论文预处理描述。它用于防止把推测伪装成论文事实。

### `repro-docs/repro-notes.md`

作用：记录复现状态、限制、缺失信息和人工确认项。包括尚未下载的数据、没有安装的依赖、论文没给出的超参数、只能做 skeleton 的原因、数据访问限制、以及运行真实训练前必须确认的事项。

## 不默认生成的结构

不要默认生成以下内容，除非论文或用户明确需要：

- `configs/default.yaml`、`configs/debug.yaml`、`configs/ablation.yaml`：默认只用 `config.py` + argparse。复杂项目才可扩展配置文件。
- `losses/` 或 `loss.py`：默认把 loss 放在 `engine/train.py` 的 `build_criterion()` 中；只有多个可复用复杂 loss 时才拆出。
- `scripts/train.sh`、`scripts/eval.sh`、`.cmd`：默认只用 Python 命令行入口。
- `tests/`：默认不生成；如果用户要求工程测试，再生成 `tests/`。静态检查由 skill 自带 `inspect_repro_project.py` 完成。

## 代码文件职责

### `config.py`

唯一配置入口。使用 `argparse` 定义常用命令行参数、默认超参数和路径。不要默认生成多个 YAML 配置文件。

必须包含：

- `--mode train/eval/inference/preprocess`
- `--dataset`
- `--data_root`
- `--output_dir`
- `--epochs`
- `--batch_size`
- `--lr`
- `--seed`
- `--gpu`
- `--checkpoint`
- 论文特有参数，例如 `--alpha`、`--beta`、`--beam_size`、`--max_len` 等

论文未给出的默认值必须注释 `ASSUMPTION`。

### `main.py`

唯一命令行入口。负责解析参数、设置随机种子和 GPU，然后把配置交给 `Run(args).main()`。

推荐命令：

```cmd
python main.py --mode preprocess --dataset <dataset> --data_root <path>
python main.py --mode train --dataset <dataset> --data_root <path>
python main.py --mode eval --checkpoint outputs/best.pt
```

### `run.py`

统一调度文件。根据 `args.mode` 调用 `data.preprocess`、`engine.train`、`engine.evaluate` 或 inference 逻辑。

### `data/dataset.py`

数据集读取和基础 transform。数据集处理证据来自数据集源码溯源时，必须在文件头写明：相关仓库、处理脚本、README 证据和可复用部分。

### `data/preprocess.py`

数据准备、标注转换、tokenizer/vocab 构建、图像/视频/文本预处理、数据 split 构建等。只写代码和入口，不自动下载数据集本体。

### `models/model.py`

模型定义文件。包含论文核心模型类，例如 `PaperModel` 或具体方法名模型类。若某些模块缺少论文细节，可以写 TODO 或 baseline 模块，但必须在文件头标注 `ASSUMPTION`。

### `engine/train.py`

训练循环。loss 通常放在 `build_criterion()` 或训练步骤中，除非论文 loss 极其复杂或有多个可复用组件，否则不要生成独立 `loss.py` 或 `losses/` 目录。

必须包含：dataloader、model、optimizer/scheduler、loss/objective、epoch/step 循环、checkpoint 保存、logging。

### `engine/evaluate.py`

评测逻辑。包含 checkpoint 加载、test dataloader、指标计算和 `outputs/eval.json` 保存。指标必须来自论文；论文没给出时标注 TODO。

### `utils/metrics.py`

指标函数。只存放评估指标，不存放训练 loss。

### `utils/common.py`

随机种子、路径创建、JSON 保存、device 选择、简单 logging 等通用函数。

## 静态检查

生成后运行：

```text
python scripts/inspect_repro_project.py <implementation-path>
```

允许自动执行：文件完整性检查、`py_compile`、TODO/ASSUMPTION/NotImplementedError 统计。

禁止自动执行：安装依赖、下载大数据、运行训练、评测完整数据集。

## 报告要求

报告必须写：

- 是否生成复现工程。
- 工程路径。
- 文件清单。
- `repro-docs/` 四个文件的用途。
- 每个代码文件的作用。
- 每个代码文件依据的论文证据。
- 数据集相关源码如何影响 `data/dataset.py` 和 `data/preprocess.py`。
- 哪些超参数来自论文，哪些是 ASSUMPTION。
- 未完成项/人工确认项。

FILE:references/output-template.md
# Markdown 报告模板

本文件用于指导详细报告写入 `paper-repro-workspace/<paper-slug>/repro-report.md`。聊天回复只保留极简摘要，不要把本报告全文贴到聊天里，除非文件写入失败。

# 论文复现报告：<论文标题>

生成时间：<时间>  
主论文输入：<PDF / arXiv / URL / 用户文本>  
报告目录：`paper-repro-workspace/<paper-slug>/`

## 1. 结论摘要

| 项目 | 结论 |
|---|---|
| 论文类型 | 综述 / 方法 / 提示词工程 / 基准评测 / 资源 / 理论 / 系统 |
| 是否需要复现 | 需要 / 不需要 / 建议只做部分复现 |
| 是否能复现 | 可以直接复现 / 部分可复现 / 不具备实际可复现性 / 不是复现目标 |
| 主论文源码 | 已 clone / 已存在，跳过 clone / 本地已存在 / 未找到 / 等待审批 / clone 失败 |
| 数据集源码 | 已 clone N 个 / 已存在，跳过 clone N 个 / 本地已存在 N 个 / 未找到 / 部分找到 / 未检索 |
| 数据处理代码 | 已定位 N 处 / 未定位 / 不适用 |
| 复现工程 | 已生成 / 仅生成 skeleton / 未生成 |
| 核心阻碍 | 一句话说明 |
| 执行边界 | 未运行训练 / 未安装依赖 / 未下载数据 / 已停在代码导读阶段 |
| 报告文件 | `paper-repro-workspace/<paper-slug>/repro-report.md` |

## 2. 论文基础信息

- 标题：
- 作者：
- 年份：
- 会议/期刊/arXiv：
- 论文链接：
- 项目页：
- 主仓库：
- 相关数据集：

## 3. 论文分类

- 主类型：
- 次类型：
- 判断依据：

## 4. 可复现性结论

- 结论：可以直接复现 / 部分可复现 / 不具备实际可复现性 / 不是复现目标
- 是否建议复现：需要 / 不需要 / 建议只做部分复现
- 原因：

## 5. 证据摘要

| 维度 | 结论 | 证据 |
|---|---|---|
| 主论文官方代码 |  |  |
| 本地主论文源码 |  |  |
| 数据集源码 |  |  |
| 数据处理代码 |  |  |
| 训练配置 |  |  |
| 评测协议 |  |  |
| 硬件需求 |  |  |
| 主要阻碍 |  |  |

## 6. 主论文代码与自动执行结果

- 是否找到主论文代码：
- 仓库可信度：官方 / 可能官方 / 第三方 / baseline / 相关代码 / 未验证
- 仓库 URL：
- 自动执行状态：已 clone / 已存在，跳过 clone / 本地已存在 / 等待审批 / 执行失败 / 无代码可执行
- 本地路径：
- 重复目录提醒：
- 执行过的命令：

### 6.1 重复目录与跳过 clone 记录

- 是否出现同名源码文件夹：是 / 否
- 跳过 clone 的仓库：
- 使用的现有本地路径：
- 现有目录是否为 git 仓库：是 / 否 / 未知
- 现有 origin：
- 是否继续完成只读仓库检查：是 / 否

### 6.2 仓库导读

- README 关键信息：
- 依赖文件：
- 配置方式：
- 命令行入口：
- 训练入口：
- 评测入口：
- 推理入口：
- 数据集准备：
- 模型实现：
- 训练逻辑：
- 论文与代码差异：

### 6.3 主论文源码存在时的停止记录

- 是否停止在代码导读阶段：是 / 否
- 是否安装依赖：否
- 是否下载数据：否
- 是否修改官方源码：否
- 是否运行训练/评估/推理：否
- 停止原因：主论文源码已存在，本技能只完成复现准备、仓库导读和报告写入；运行训练属于新的显式运行任务。

## 7. 数据集论文与数据集源码溯源

| 数据集 | 是否主实验依赖 | 原论文/项目页 | 是否找到源码 | 仓库 URL | clone 状态 | 本地路径 | 数据处理代码位置 | 数据处理入口/命令 | 对主论文复现的影响 |
|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  | 已 clone / 已存在，跳过 clone / 本地已存在 / 未 clone |  |  |  |  |

### 7.1 数据处理代码定位明细

| 来源仓库 | 文件 | 类型 | 关键函数/类 | 证据 | 可复用方式 | 风险 |
|---|---|---|---|---|---|---|
|  |  | dataset / preprocess / tokenizer / split / feature / benchmark |  | README / 文件名 / 代码片段 |  |  |

### 7.2 数据集访问限制

- 需要申请的数据集：
- 闭源或私有数据：
- 只提供数据下载但无处理代码：
- 对复现的影响：

## 8. 架构或流程解读

- 图示类型：标准模型架构 / prompt 或 agent 流程 / 系统架构 / 不确定
- 模型或流程类型：
- 关键模块：
- 输入输出：
- loss / objective：
- 是否可按代码实现：

## 9. 实验配置清单

| 项目 | 论文给出的信息 | 源码/数据集代码中的信息 | 缺失或需要确认 |
|---|---|---|---|
| 数据集 |  |  |  |
| 预处理 |  |  |  |
| 模型 |  |  |  |
| loss / objective |  |  |  |
| optimizer |  |  |  |
| learning rate |  |  |  |
| batch size |  |  |  |
| epoch / steps |  |  |  |
| GPU / 显存 |  |  |  |
| 指标 |  |  |  |

## 10. 无主论文源码复现工程生成结果

仅在无主论文源码时填写。即使找到数据集源码或 baseline 源码，只要主论文源码不存在且部分可复现，也必须填写本节。

| 项目 | 结论 |
|---|---|
| 是否生成复现工程 | 已生成 / 仅生成 skeleton / 未生成 |
| 工程路径 | `paper-repro-workspace/<paper-slug>/<method-slug>-reproduction/` |
| 生成依据 | 论文证据 / 数据集源码证据 / baseline 证据 / 明确假设 |
| 是否通过静态检查 | 通过 / 部分通过 / 未运行 |
| 是否运行训练 | 未运行，需用户确认 |

### 10.1 生成工程结构

以下为最低基本盘结构，不是不可更改的硬性目录。生成工程至少应包含这些职责清晰的模块；如果论文需要额外模块，可以在此基础上增加目录或文件。

```text
<method-slug>-reproduction/
├── README.md
├── repro-docs/
│   ├── requirements.txt
│   ├── paper-spec.yaml
│   ├── evidence-map.md
│   └── repro-notes.md
├── config.py
├── main.py
├── run.py
├── data/
│   ├── __init__.py
│   ├── dataset.py
│   └── preprocess.py
├── models/
│   ├── __init__.py
│   └── model.py
├── engine/
│   ├── __init__.py
│   ├── train.py
│   └── evaluate.py
└── utils/
    ├── __init__.py
    ├── common.py
    └── metrics.py
```

### 10.2 `repro-docs/` 文件说明

| 文件 | 主要用途 | 注意事项 |
|---|---|---|
| `repro-docs/requirements.txt` | 最小依赖清单，用于创建复现环境 | 只写已确认或最小必要依赖；重型或未确认依赖写入 `repro-notes.md` |
| `repro-docs/paper-spec.yaml` | 论文证据规格，记录任务、模型、数据集、loss、训练和评测信息 | 不是训练配置；训练参数入口仍是 `config.py` |
| `repro-docs/evidence-map.md` | 映射每个代码文件对应的论文证据、数据集源码证据或假设 | 必须区分论文事实、源码证据和 ASSUMPTION |
| `repro-docs/repro-notes.md` | 记录复现限制、缺失信息、人工确认项和运行前注意事项 | 不要把未验证内容写成已完成结果 |

### 10.3 生成代码文件清单

| 文件 | 作用 | 依据 | 是否含假设 |
|---|---|---|---|
| `README.md` | 工程说明和运行命令 |  |  |
| `main.py` | 命令行入口，解析参数后交给 `Run(args).main()` |  |  |
| `config.py` | argparse 参数和超参数默认值 |  |  |
| `run.py` | 按 mode 调度 preprocess/train/eval/inference |  |  |
| `data/dataset.py` | 数据读取与 transform |  |  |
| `data/preprocess.py` | 数据处理脚本 |  |  |
| `models/model.py` | 模型定义 |  |  |
| `engine/train.py` | 训练循环与 loss/objective |  |  |
| `engine/evaluate.py` | 评测循环 |  |  |
| `utils/metrics.py` | 指标函数 |  |  |
| `utils/common.py` | seed、路径、日志等工具 |  |  |

### 10.4 config 参数

| 参数 | 默认值 | 来源 | 备注 |
|---|---|---|---|
|  |  | paper / dataset-code / baseline-code / assumption / todo |  |

### 10.5 model 定义

- 文件：`models/model.py`
- 类名：
- 输入：
- 输出：
- 关键模块：
- 论文依据：
- 缺失/假设：

### 10.6 train 定义

- 文件：`engine/train.py`
- loss / objective：
- optimizer：
- scheduler：
- checkpoint：
- logging：
- 论文依据：
- 缺失/假设：

### 10.7 evaluate 定义

- 文件：`engine/evaluate.py`
- 指标：
- protocol：
- checkpoint：
- 输出文件：
- 论文依据：
- 缺失/假设：

### 10.8 数据处理实现

| 数据集 | 数据处理文件 | 入口命令 | 来源证据 | 风险 |
|---|---|---|---|---|
|  | `data/preprocess.py` / `data/dataset.py` | `python main.py --mode preprocess ...` |  |  |

### 10.9 可执行命令

```cmd
python -m pip install -r repro-docs/requirements.txt
python main.py --mode preprocess --dataset <dataset> --data_root <path>
python main.py --mode train --dataset <dataset> --data_root <path>
python main.py --mode eval --checkpoint outputs/best.pt
```

### 10.10 未完成项 / 人工确认项

- 未下载数据：
- 未安装依赖：
- 论文缺失：
- 需要确认的假设：

## 11. 不能复现或不能精确复现的原因

- 原因 1：
- 原因 2：
- 原因 3：

## 12. 执行日志

| 时间 | 命令 / 工具 | 结果 | 备注 |
|---|---|---|---|
|  |  |  |  |

# 聊天极简摘要模板

```markdown
[paper-repro-triage active]

- 报告文件：`paper-repro-workspace/<paper-slug>/repro-report.md`
- 主论文源码：已 clone / 已存在，跳过 clone / 本地已存在 / 未找到 / 等待审批 / clone 失败
- 数据集源码：已 clone N 个 / 已存在，跳过 clone N 个 / 本地已存在 N 个 / 未找到 / 部分找到 / 未检索
- 数据处理代码：已定位 N 处 / 未定位 / 不适用
- 复现工程：已生成 / 仅生成 skeleton / 未生成，路径：`paper-repro-workspace/<paper-slug>/<implementation-slug>/`
- 是否需要复现：需要 / 不需要 / 建议只做部分复现
- 是否能复现：可以直接复现 / 部分可复现 / 不具备实际可复现性 / 不是复现目标
- 核心原因：一句话说明；如果能复现则写“无核心阻碍”
```

FILE:references/reproducibility-rubric.md
# 可复现性判定标准

只能使用以下四个结论之一：

1. 可以直接复现
2. 部分可复现
3. 不具备实际可复现性
4. 不是复现目标

## 可以直接复现

必须满足大部分条件：

- 有作者官方或高度可信代码仓库，或论文信息足以生成一个与论文方法一致的最小可运行实现。
- 数据集可公开获取，或论文提供明确申请流程。
- 训练和评测脚本、协议或足够清晰的流程可获得。
- 关键超参数、模型配置、预处理和指标足够明确。
- 不依赖不可获得的私有模型、私有数据或闭源 API。
- 硬件需求在用户可接受范围内，或有小规模可验证路径。

## 部分可复现

符合以下情况之一：

- 有代码，但数据集需要申请、部分缺失或预处理不完整。
- 有数据，但代码不完整或缺少训练脚本。
- 无官方源码，但核心方法、loss、数据管线和评测指标足以构造最小可行实现。
- 核心方法可实现，但超参数、消融或评测细节不足。
- 依赖大规模算力，但可以先复现缩小版或核心模块。
- prompt/agent 流程可以复现思路，但无法精确复现闭源模型行为。

## 不具备实际可复现性

符合以下情况之一：

- 无源码、无关键训练细节、无可得数据，且无法合理构造最小可行实现。
- 依赖私有数据、私有权重、内部日志或不可访问系统。
- 依赖闭源 API 的不可控行为，且 prompt、温度、版本、工具链缺失。
- 实验协议和指标定义不清，无法构造可靠对照。
- 所需硬件、数据规模或人工标注流程远超普通复现能力，且没有缩小版路径。

## 不是复现目标

符合以下情况之一：

- 综述、评论、观点文章或 survey。
- 主要贡献是概念框架或理论分析，没有可执行方法。
- 资源论文仅介绍数据集或平台，而用户目标不是复现数据构建过程。
- benchmark 论文只定义评测任务，用户更应使用其 benchmark，而不是“复现论文方法”。

## 是否需要复现的独立判断

“能不能复现”和“需不需要复现”分开判断：

- 如果论文是目标任务强相关方法论文，且代码/数据足够，通常“需要复现”。
- 如果只是综述或背景材料，通常“不需要复现”。
- 如果代码存在但算力或数据受限，通常“建议只做部分复现”。
- 如果没有代码但方法清楚，通常“建议生成最小复现工程，再由用户确认是否安装依赖/下载数据/训练”。
- 如果是数据集论文，通常优先复现数据加载、预处理和 benchmark 使用流程，而不是重新构建整个数据集。

FILE:agents/openai.yaml
interface:
  display_name: "论文复现执行器"
  short_description: "分析论文、检索数据集论文源码、自动克隆仓库并写入 Markdown 报告"
ClawHub Coding Research+2
S@clawhub-slight-leaves-fb49268ec4