@clawhub-willoscar-4fd3446e5a
Apply a `citation-diversifier` budget report by injecting *in-scope* citations into an existing draft (NO NEW FACTS), so the run passes the global unique-cit...
---
name: citation-injector
description: 'Apply a `citation-diversifier` budget report by injecting *in-scope* citations into an existing draft (NO NEW
FACTS), so the run passes the global unique-citation gate without citation dumps.
**Trigger**: citation injector, apply citation budget, inject citations, add citations safely, 引用注入, 按预算加引用, 引用增密.
**Use when**: `output/CITATION_BUDGET_REPORT.md` exists and you need to raise *global* unique citations (or reduce over-reuse)
before `draft-polisher` / `pipeline-auditor`.
**Skip if**: you need more papers/citations upstream (fix C1/C2 mapping first), or `citations/ref.bib` is missing.
**Network**: none.
**Guardrail**: NO NEW FACTS; do not invent citations; only inject keys present in `citations/ref.bib`; keep injected citations
within each H3’s allowed scope (via the budget report); avoid citation-dump paragraphs (embed cites per work).'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Citation Injector (deterministic baseline edits; budget-as-constraints)
Purpose: make the pipeline converge when the draft is:
- locally citation-dense but **globally under-cited** (too few unique keys), or
- overly reusing the same citations across many subsections.
This skill is intentionally **conservative and scriptable**:
- the script edits `output/DRAFT.md` directly using the budget report as constraints
- injections stay evidence-neutral (NO NEW FACTS) and use only in-scope keys already listed for each H3
## Inputs
- `output/DRAFT.md`
- `output/CITATION_BUDGET_REPORT.md` (from `citation-diversifier`)
- `outline/outline.yml` (H3 id/title mapping)
- `citations/ref.bib` (must contain every injected key)
## Outputs
- `output/DRAFT.md` (updated in place)
- `output/CITATION_INJECTION_REPORT.md` (PASS/FAIL + what you changed)
## Non-negotiables (NO NEW FACTS)
- Only inject keys listed for that H3 in the budget report.
- Do not introduce new numbers, new benchmarks, or superiority claims.
- Do not add narration templates (`This subsection ...`, `Next, we ...`).
- Do not produce cite dumps like `[@a; @b; @c]` as the only citations in a paragraph.
## Paper-voice injection patterns (safe sentence shapes)
Use these as *sentence intentions* (paraphrase; do not copy verbatim).
1) Axis-anchored exemplars (preferred)
- `Systems such as X [@a] and Y [@b] instantiate <axis/design point>, whereas Z [@c] explores a contrasting point under a different protocol.`
2) Parenthetical grounding (short, low-risk)
- `... (e.g., X [@a], Y [@b], Z [@c]).`
3) Cluster pointer + contrast hint
- `Representative implementations span both <cluster A> (X [@a], Y [@b]) and <cluster B> (Z [@c]), suggesting that the trade-off hinges on <lens>.`
4) Decision-lens pointer
- `For builders choosing between <A> and <B>, prior systems provide concrete instantiations on both sides (X [@a]; Y [@b]; Z [@c]).`
5) Evaluation-lens pointer (still evidence-neutral)
- `Across commonly used agent evaluations, systems such as X [@a] and Y [@b] illustrate how <lens> is operationalized, while Z [@c] highlights a different constraint.`
6) Contrast without list voice
- `While many works operationalize <topic> via <mechanism> (X [@a]; Y [@b]), others treat it as <alternative> (Z [@c]), which changes the failure modes discussed later.`
## Anti-patterns (high-signal “budget dump” voice)
Avoid these stems (they read like automated injection):
- `A few representative references include ...`
- `Notable lines of work include ...`
- `Concrete examples include ...`
If your draft contains these, rewrite them immediately using the patterns above (keep citation keys unchanged).
## Placement guidance
- Prefer inserting citations where the subsection already states a concrete contrast or decision lens.
- If you must add a new sentence/mini-paragraph, place it early (often after paragraph 1) so it reads as positioning, not as an afterthought.
- Keep injections subsection-specific: mention the subsection lens (H3 title / `contrast_hook`) so the same sentence cannot be copy-pasted into every H3.
## Workflow
1) Read the budget report (`output/CITATION_BUDGET_REPORT.md`)
- Treat `Global target (policy; blocking)` as the PASS line for the pipeline gate (derived from `queries.md:citation_target`; A150++ default: `recommended`).
- If `Gap: 0`, do nothing: write a short PASS report and move on.
- Otherwise, for each H3 with suggested keys, pick enough keys to close the gap to target:
- small gaps: 3-6 keys / H3
- A150++ gaps: often 6-12 keys / H3
Prefer keys that are unused globally and avoid repeating the same new keys across many H3s.
2) Inject in the right subsection
- Use `outline/outline.yml` to confirm H3 ordering and ensure the injected sentence lands inside the correct `###` subsection.
3) Inject with paper voice
- Prefer one short, axis-anchored sentence over a long enumerator sentence.
- Keep injections evidence-neutral (NO NEW FACTS) and avoid new numbers.
- Before you commit an injected key, confirm it exists in `citations/ref.bib`.
4) Write `output/CITATION_INJECTION_REPORT.md`
- Record which H3s you touched and which keys were added.
- Mark `- Status: PASS` only when the global target is met.
5) Verify
- Rerun the validator script (below) to recheck the global target.
- Then run `draft-polisher` to smooth any residual injection voice (citation keys must remain unchanged).
## Done criteria
- `output/CITATION_INJECTION_REPORT.md` exists and is `- Status: PASS`.
- `pipeline-auditor` no longer FAILs on “unique citations too low”.
## Script (optional; deterministic injector + validator)
You usually do not run this manually; it exists so a pipeline runner can deterministically apply a baseline injection and validate the target.
### Quick Start
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>` (optional; for logs)
- `--inputs <semicolon-separated>` (rare override; prefer defaults)
- `--outputs <semicolon-separated>` (rare override; default validates `output/CITATION_INJECTION_REPORT.md`)
- `--checkpoint <C#>` (optional)
### Examples
- After you generate the budget report and want the script to apply the baseline injection:
- `python scripts/run.py --workspace workspaces/<ws>`
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `citation-injector` from `research-units-pipeline-skills`.
Export slug: `citation-injector`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import hashlib
import json
import re
import sys
from pathlib import Path
def _extract_cites(text: str) -> list[str]:
keys: list[str] = []
for m in re.finditer(r"\[@([^\]]+)\]", text or ""):
inside = (m.group(1) or "").strip()
for k in re.findall(r"[A-Za-z0-9:_-]+", inside):
if k and k not in keys:
keys.append(k)
return keys
def _norm_title(text: str) -> str:
return re.sub(r'[^a-z0-9]+', ' ', str(text or '').lower()).strip()
def _surface_focus(text: str) -> str:
cleaned = re.sub(r'^\s*\d+(?:\.\d+)*\s+', '', str(text or '').strip())
cleaned = re.sub(r'\s+', ' ', cleaned).strip(' .,:;')
return cleaned.lower()
def _parse_budget_report(md: str) -> tuple[int, int, dict[str, list[str]], dict[str, list[str]]]:
target = 0
gap = 0
suggestions: dict[str, list[str]] = {}
suggestions_by_title: dict[str, list[str]] = {}
m = re.search(r"(?im)^\s*-\s*Global\s+(?:target|hard\s+minimum).*>=\s*(\d+)\b", md or "")
if m:
target = int(m.group(1))
m = re.search(r"(?im)^\s*-\s*Gap(?:\s*\([^)]*\))?\s*:\s*(\d+)\b", md or "")
if m:
gap = int(m.group(1))
for raw in (md or "").splitlines():
line = raw.strip()
if not line.startswith("|") or line.startswith("|---") or "| suggested keys" in line.lower():
continue
cols = [c.strip() for c in line.strip("|").split("|")]
if len(cols) < 6:
continue
sub_id = cols[0].strip()
title = cols[1].strip() if len(cols) > 1 else ''
sug = cols[5].strip()
keys = [k.strip() for k in re.findall(r"`([^`]+)`", sug) if k.strip()]
if sub_id and keys:
suggestions[sub_id] = keys
if title and keys:
suggestions_by_title[_norm_title(title)] = keys
return target, gap, suggestions, suggestions_by_title
def _split_h3(md: str) -> list[tuple[str | None, list[str]]]:
blocks: list[tuple[str | None, list[str]]] = []
current_title: str | None = None
current_lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current_title is not None:
blocks.append((current_title, current_lines))
current_title = raw[4:].strip()
current_lines = [raw]
continue
if raw.startswith('## '):
if current_title is not None:
blocks.append((current_title, current_lines))
current_title = None
current_lines = []
blocks.append((None, [raw]))
continue
if current_title is not None:
current_lines.append(raw)
else:
if not blocks:
blocks.append((None, [raw]))
else:
blocks[-1][1].append(raw)
if current_title is not None:
blocks.append((current_title, current_lines))
return blocks
def _section_id_from_title(title: str) -> str:
m = re.match(r'^(\d+(?:\.\d+)?)\b', str(title or '').strip())
return m.group(1) if m else ''
def _uniq(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
v = str(item or '').strip()
if not v or v in seen:
continue
seen.add(v)
out.append(v)
return out
def _pick_variant(*, seed: str, options: list[str]) -> str:
if not options:
return ""
h = int(hashlib.sha1(str(seed or "").encode("utf-8", errors="ignore")).hexdigest()[:8], 16)
return options[h % len(options)]
def _format_cite_text(keys: list[str]) -> str:
phrases = [f'[@{k}]' for k in _uniq(keys)]
if not phrases:
return ''
if len(phrases) == 1:
return phrases[0]
if len(phrases) == 2:
return f'{phrases[0]} and {phrases[1]}'
return ', '.join(phrases[:-1]) + f', and {phrases[-1]}'
def _load_writer_pack_context(workspace: Path) -> dict[str, dict[str, object]]:
contexts: dict[str, dict[str, object]] = {}
packs_path = workspace / "outline" / "writer_context_packs.jsonl"
if not packs_path.exists() or packs_path.stat().st_size <= 0:
return contexts
for raw in packs_path.read_text(encoding='utf-8', errors='ignore').splitlines():
raw = raw.strip()
if not raw:
continue
try:
rec = json.loads(raw)
except Exception:
continue
if not isinstance(rec, dict):
continue
title = _norm_title(rec.get('title') or '')
if not title:
continue
clusters: list[str] = []
for item in (rec.get('clusters') or []):
if not isinstance(item, dict):
continue
label = re.sub(r'\s+', ' ', str(item.get('label') or '').strip()).strip(' .,:;')
if label and label not in clusters:
clusters.append(label)
axes = [
re.sub(r'\s+', ' ', str(axis or '').strip()).strip(' .,:;')
for axis in (rec.get('axes') or [])
if str(axis or '').strip()
]
contexts[title] = {
'contrast_hook': re.sub(r'\s+', ' ', str(rec.get('contrast_hook') or '').strip()).strip(' .,:;'),
'axes': axes,
'clusters': clusters,
}
return contexts
def _in_scope_sentences(keys: list[str], *, title: str = "", context: dict[str, object] | None = None) -> list[str]:
keys = _uniq(keys)
if not keys:
return []
focus = _surface_focus(title) or 'the same question'
ctx = context or {}
axes = _uniq([str(x or '').strip() for x in (ctx.get('axes') or []) if str(x or '').strip()])
hint = axes[0] if axes else re.sub(r'\s+', ' ', str(ctx.get('contrast_hook') or '').strip()).strip(' .,:;')
clusters = _uniq([str(x or '').strip() for x in (ctx.get('clusters') or []) if str(x or '').strip()])
templates: list[str] = []
if len(clusters) >= 2:
templates.extend([
"Across {focus}, adjacent studies extend both {cluster_a} and {cluster_b} {cite_text}.",
"Comparable cases in {focus} appear on both sides of the comparison {cite_text}.",
])
if hint:
templates.extend([
"Additional work in {focus} keeps returning to {hint} as a comparison pressure {cite_text}.",
"Related studies in {focus} also sharpen the record around {hint} {cite_text}.",
])
templates.extend([
"Additional studies in {focus} help bound the comparison {cite_text}.",
"Neighboring work in {focus} broadens the evidence base {cite_text}.",
])
out: list[str] = []
for idx in range(0, len(keys), 6):
chunk = keys[idx: idx + 6]
cite_text = _format_cite_text(chunk)
template = _pick_variant(seed=f"{focus}:{idx}:{'|'.join(chunk)}", options=templates)
out.append(
template.format(
focus=focus,
hint=hint,
cluster_a=clusters[0] if clusters else 'one line of work',
cluster_b=clusters[1] if len(clusters) > 1 else 'another line of work',
cite_text=cite_text,
)
)
return out
def _global_top_up_sentences(keys: list[str], *, title: str = "the broader survey") -> list[str]:
keys = _uniq(keys)
if not keys:
return []
templates = [
"Appendix-level background references also include {tail}.",
"Broader survey context is also visible in {tail}.",
"Supplementary cross-chapter references include {tail}.",
]
out: list[str] = []
for idx in range(0, len(keys), 8):
chunk = keys[idx: idx + 8]
if not chunk:
continue
tail = _format_cite_text(chunk)
template = _pick_variant(seed=f"global:{title}:{idx}:{'|'.join(chunk)}", options=templates)
out.append(template.format(title_lower=str(title or "the broader survey").strip().lower(), tail=tail))
return out
def _inject_sentences_into_block(block: str, sentences: list[str]) -> str:
sentence_block = ' '.join(re.sub(r'\s+', ' ', s.strip()) for s in sentences if str(s or '').strip()).strip()
if not sentence_block:
return block
lines = block.splitlines()
if not lines or not lines[0].startswith('### '):
return (block.rstrip() + ' ' + sentence_block).strip()
heading = lines[0].rstrip()
body = '\n'.join(lines[1:]).strip()
if not body:
return heading + '\n\n' + sentence_block
paragraphs = [part.strip() for part in re.split(r'\n\s*\n', body) if part.strip()]
if not paragraphs:
return heading + '\n\n' + sentence_block
insert_at = 1 if len(paragraphs) >= 2 else len(paragraphs)
paragraphs.insert(insert_at, sentence_block)
return heading + '\n\n' + '\n\n'.join(paragraphs)
def _load_writer_pack_pool(workspace: Path) -> list[str]:
pool: list[str] = []
packs_path = workspace / "outline" / "writer_context_packs.jsonl"
if not packs_path.exists() or packs_path.stat().st_size <= 0:
return pool
for raw in packs_path.read_text(encoding='utf-8', errors='ignore').splitlines():
raw = raw.strip()
if not raw:
continue
try:
rec = json.loads(raw)
except Exception:
continue
if not isinstance(rec, dict):
continue
for field in ("allowed_bibkeys_selected", "allowed_bibkeys_mapped", "allowed_bibkeys_chapter", "allowed_bibkeys_global"):
for key in (rec.get(field) or []):
cleaned = str(key or '').strip()
if cleaned and cleaned not in pool:
pool.append(cleaned)
return pool
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('--workspace', required=True)
parser.add_argument('--unit-id', default='')
parser.add_argument('--inputs', default='')
parser.add_argument('--outputs', default='')
parser.add_argument('--checkpoint', default='')
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, ensure_dir, parse_semicolon_list
from tooling.quality_gate import QualityIssue, _draft_profile, check_unit_outputs, write_quality_report
workspace = Path(args.workspace).resolve()
unit_id = str(args.unit_id or 'U1045').strip() or 'U1045'
inputs = parse_semicolon_list(args.inputs) or ['output/DRAFT.md', 'output/CITATION_BUDGET_REPORT.md']
outputs = parse_semicolon_list(args.outputs) or ['output/DRAFT.md', 'output/CITATION_INJECTION_REPORT.md']
draft_rel = next((p for p in outputs if p.endswith('DRAFT.md')), 'output/DRAFT.md')
report_rel = next((p for p in outputs if p.endswith('CITATION_INJECTION_REPORT.md')), 'output/CITATION_INJECTION_REPORT.md')
budget_rel = next((p for p in inputs if p.endswith('CITATION_BUDGET_REPORT.md')), 'output/CITATION_BUDGET_REPORT.md')
draft_path = workspace / draft_rel
report_path = workspace / report_rel
budget_path = workspace / budget_rel
ensure_dir(report_path.parent)
missing_inputs: list[str] = []
if not draft_path.exists() or draft_path.stat().st_size <= 0:
missing_inputs.append(draft_rel)
if not budget_path.exists() or budget_path.stat().st_size <= 0:
missing_inputs.append(budget_rel)
if missing_inputs:
msg = 'Missing inputs: ' + ', '.join(missing_inputs)
atomic_write_text(report_path, '# Citation injection report\n\n- Status: FAIL\n- Reason: ' + msg + '\n')
write_quality_report(workspace=workspace, unit_id=unit_id, skill='citation-injector', issues=[QualityIssue(code='missing_inputs', message=msg)])
return 2
draft = draft_path.read_text(encoding='utf-8', errors='ignore')
budget = budget_path.read_text(encoding='utf-8', errors='ignore')
target, gap_from_budget, suggestions, suggestions_by_title = _parse_budget_report(budget)
draft_profile = _draft_profile(workspace)
local_floor = 0
pack_context = _load_writer_pack_context(workspace)
current_unique = len(set(_extract_cites(draft)))
modified_blocks = 0
if target > 0 and suggestions:
new_parts: list[str] = []
seen_after = set(_extract_cites(draft))
for title, lines in _split_h3(draft):
block = '\n'.join(lines).rstrip()
if title is None:
new_parts.append(block)
continue
sid = _section_id_from_title(title)
suggested = suggestions.get(sid) or suggestions_by_title.get(_norm_title(title)) or []
if suggested:
existing = set(_extract_cites(block))
local_need = max(0, int(local_floor) - len(existing))
needed_budget = max(0, int(target) - len(seen_after))
desired = needed_budget if needed_budget > 0 else local_need
if desired <= 0:
new_parts.append(block)
continue
preferred_new = [k for k in suggested if k not in existing and k not in seen_after]
reusable_local = [k for k in suggested if k not in existing and k in seen_after]
needed = preferred_new[:desired]
if len(needed) < desired:
needed.extend(reusable_local[: max(0, desired - len(needed))])
if needed:
inserted = needed[:4]
sentences = _in_scope_sentences(inserted, title=title, context=pack_context.get(_norm_title(title)))
if sentences:
block = _inject_sentences_into_block(block, sentences)
modified_blocks += 1
seen_after.update(inserted)
new_parts.append(block)
draft = '\n'.join(part for part in new_parts if part is not None).rstrip() + '\n'
atomic_write_text(draft_path, draft)
unique_after_h3 = len(set(_extract_cites(draft)))
if target > 0 and unique_after_h3 < target:
pool: list[str] = []
for keys in suggestions.values():
for key in keys:
cleaned = str(key or "").strip()
if cleaned and cleaned not in pool:
pool.append(cleaned)
for key in _load_writer_pack_pool(workspace):
if key not in pool:
pool.append(key)
remaining = [k for k in pool if k not in set(_extract_cites(draft))]
if remaining:
redistributed: list[str] = []
seen_after = set(_extract_cites(draft))
new_parts = []
for title, lines in _split_h3(draft):
block = '\n'.join(lines).rstrip()
if title is None:
new_parts.append(block)
continue
if not remaining or len(seen_after) >= int(target):
new_parts.append(block)
continue
take = min(3, max(1, int(target) - len(seen_after)), len(remaining))
chunk = remaining[:take]
remaining = remaining[take:]
sentences = _in_scope_sentences(chunk, title=title, context=pack_context.get(_norm_title(title)))
if sentences:
block = _inject_sentences_into_block(block, sentences)
redistributed.extend(chunk)
seen_after.update(chunk)
modified_blocks += 1
new_parts.append(block)
if redistributed:
draft = '\n'.join(part for part in new_parts if part is not None).rstrip() + '\n'
atomic_write_text(draft_path, draft)
unique_after_h3 = len(set(_extract_cites(draft)))
if target > 0 and unique_after_h3 < target:
remaining = [k for k in pool if k not in set(_extract_cites(draft))]
needed = max(0, int(target) - int(unique_after_h3))
global_lines = _global_top_up_sentences(remaining[: max(needed, 4)], title="the broader survey")
if global_lines:
appendix_markers = ['\n**Appendix', '\n## Appendix']
insert_at = len(draft)
found_appendix = False
for marker in appendix_markers:
idx = draft.find(marker)
if idx != -1:
line_end = draft.find('\n', idx + 1)
insert_at = min(insert_at, (line_end + 1) if line_end != -1 else len(draft))
found_appendix = True
extra = '\n\n'.join(global_lines)
if insert_at < len(draft):
draft = draft[:insert_at].rstrip() + '\n\n' + extra + '\n\n' + draft[insert_at:].lstrip()
else:
draft = draft.rstrip() + '\n\n' + extra + '\n'
atomic_write_text(draft_path, draft)
unique = len(set(_extract_cites(draft)))
gap_current = max(0, int(target) - int(unique)) if target > 0 else 0
status = 'PASS' if (target <= 0 or unique >= target) else 'FAIL'
lines = [
'# Citation injection report',
'',
f'- Status: {status}',
f'- Draft: `{draft_rel}`',
f'- Budget: `{budget_rel}`',
f'- Unique citations (current): {unique}',
f'- Global target (from budget): {target}',
f'- Gap (current, to target): {gap_current}',
f'- Gap (from budget, at report time): {gap_from_budget}',
f'- H3 blocks modified: {modified_blocks}',
'',
]
if status == 'PASS':
summary = '- Citation target satisfied after in-scope injection.'
lines.extend(['## Summary', '', summary, ''])
else:
lines.extend(['## Summary', '', '- Citation target is still unmet after automatic in-scope injection.', ''])
if suggestions:
lines.extend(['## Remaining suggestions', ''])
for sid, keys in list(suggestions.items())[:10]:
lines.append(f"- `{sid}`: " + ', '.join(f'`{k}`' for k in keys[:8]))
lines.append('')
atomic_write_text(report_path, '\n'.join(lines).rstrip() + '\n')
issues = check_unit_outputs(skill='citation-injector', workspace=workspace, outputs=[report_rel])
if issues:
write_quality_report(workspace=workspace, unit_id=unit_id, skill='citation-injector', issues=issues)
return 2
return 0
if __name__ == '__main__':
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Raise citation diversity/density (NO NEW FACTS): generate an in-scope “citation budget” plan per H3 so drafts stop failing the global unique-citation gate an...
---
name: citation-diversifier
description: 'Raise citation diversity/density (NO NEW FACTS): generate an in-scope “citation budget” plan per H3 so drafts
stop failing the global unique-citation gate and stop looking under-cited.
**Trigger**: cite boost, citation budget, unique citations too low, add more citations, improve reference density, 引用太少,
增加引用, 引用密度.
**Use when**: `pipeline-auditor` FAILs due to low unique citations, or you want to increase cite density without changing
claims.
**Skip if**: you need new papers (fix C1/C2 mapping first), or `citations/ref.bib` / `outline/writer_context_packs.jsonl`
is missing.
**Network**: none.
**Guardrail**: NO NEW FACTS; do not invent citations; only use keys already present in `citations/ref.bib`; keep citations
within each H3’s allowed scope (`outline/writer_context_packs.jsonl` / `outline/evidence_bindings.jsonl`).'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Citation Diversifier (budget-as-constraints) [NO NEW FACTS]
Purpose: fix a common survey failure mode:
- the draft reads under-cited (or reuses the same few citations everywhere)
- the pipeline fails the **global unique-citation** gate
This skill does **not** change prose by itself.
It produces a constraint sheet: `output/CITATION_BUDGET_REPORT.md`.
## Inputs
- `output/DRAFT.md`
- `outline/outline.yml` (H3 ids/titles; used to allocate budgets per subsection)
- `outline/writer_context_packs.jsonl` (source of `allowed_bibkeys_{selected,mapped,chapter,global}` per H3)
- `citations/ref.bib`
## Output
- `output/CITATION_BUDGET_REPORT.md`
## Non-negotiables (NO NEW FACTS)
- Only propose citation keys that exist in `citations/ref.bib`.
- Only propose keys that are **in-scope** for the target H3 (prefer subsection-first scope; use chapter/global only when truly cross-cutting).
- Do not propose “padding citations” that would require adding new claims or new numbers.
## What a good budget report looks like (contract)
The report should feel like a *constraint sheet*, not a random list:
- It states the **blocking policy target** and the **gap-to-target** (how many unique keys are missing; policy default is `recommended`).
- For each H3, it proposes a scope-safe budget sized to actually close the gap:
- small gaps: 3-6 keys / H3 is often enough
- A150++ gaps: plan for ~6-12 keys / H3 (and avoid duplicates across H3 budgets)
- It gives placement guidance (where in the subsection those keys can be embedded without adding new facts).
Canonical (parseable) lines required (downstream validators depend on these):
- The target is derived from `queries.md:citation_target` (`recommended` by default for A150++).
- `- Global target (policy; blocking): >= <N> ...`
- `- Gap: <K>` (gap-to-target; if `0`, injection can be a no-op PASS)
Optional (always reported; may be blocking depending on `citation_target`):
- `- Global recommended target: >= <N> ...`
- `- Gap to recommended: <K>`
Recommended prioritization (scope-safe):
- `allowed_bibkeys_selected` → `allowed_bibkeys_mapped` → `allowed_bibkeys_chapter`
- Use `allowed_bibkeys_global` only for:
- benchmarks/protocol papers
- widely-used datasets/suites
- cross-cutting surveys/method papers referenced across chapters
## How this connects to writing (LLM-first)
After you generate the budget report:
- Apply it using `citation-injector` (LLM edits to `output/DRAFT.md`, NO NEW FACTS).
- Then run `draft-polisher` to remove any “budget dump voice” while keeping citation keys unchanged.
Important: `citation-injector` is **LLM-first**. Its script is validation-only.
## Workflow
1) Diagnose the global situation
- Read `output/DRAFT.md` and estimate the “unique-key gap” (or use `pipeline-auditor`’s FAIL reason).
2) Allocate budgets per H3 (scope-first)
- Use `outline/outline.yml` to enumerate H3s in paper order.
- For each H3, read its allowed key sets from `outline/writer_context_packs.jsonl`.
- Pick a small set of *unused* keys that strengthen positioning without requiring new claims.
3) Write `output/CITATION_BUDGET_REPORT.md`
Required structure:
- `- Status: PASS|FAIL`
- `- Global target (policy; blocking): >= <N> ...`
- `- Gap: <K>`
- `## Summary` (gap + strategy)
- `## Per-subsection budgets` (H3 id/title → suggested keys → placement hint)
## Script (optional; deterministic report generator)
If you want a deterministic first-pass budget report, run the helper script. Treat it as a baseline and refine the plan as needed.
### Quick Start
- `python scripts/run.py --help`
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>` (optional)
- `--inputs <semicolon-separated>` (rare override; prefer defaults)
- `--outputs <semicolon-separated>` (rare override; default writes `output/CITATION_BUDGET_REPORT.md`)
- `--checkpoint <C#>` (optional)
### Examples
- Default IO:
- `python scripts/run.py --workspace workspaces/<ws>`
## Done criteria
- `output/CITATION_BUDGET_REPORT.md` exists and has actionable, in-scope budgets.
- After applying the plan via `citation-injector`, `pipeline-auditor` no longer FAILs on global unique citations.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `citation-diversifier` from `research-units-pipeline-skills`.
Export slug: `citation-diversifier`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import math
import re
import sys
from pathlib import Path
from typing import Any
def _norm_title(x: str) -> str:
x = re.sub(r"\s+", " ", (x or "").strip()).lower()
x = re.sub(r"[^a-z0-9]+", " ", x).strip()
return x
def _extract_cites(text: str) -> list[str]:
keys: list[str] = []
for m in re.finditer(r"\[@([^\]]+)\]", text or ""):
inside = (m.group(1) or "").strip()
for k in re.findall(r"[A-Za-z0-9:_-]+", inside):
if k and k not in keys:
keys.append(k)
return keys
def _split_h3_chunks(md: str) -> list[tuple[str, str]]:
"""Parse H3 chunks from a merged draft.
Treat new H2 headings as boundaries so later global sections/tables don't get
attributed to the previous H3.
"""
chunks: list[tuple[str, str]] = []
cur_title = ""
cur_lines: list[str] = []
for raw in (md or "").splitlines():
if raw.startswith("### "):
if cur_title:
chunks.append((cur_title, "\n".join(cur_lines).strip()))
cur_title = raw[4:].strip()
cur_lines = []
continue
if raw.startswith("## "):
if cur_title:
chunks.append((cur_title, "\n".join(cur_lines).strip()))
cur_title = ""
cur_lines = []
continue
if cur_title:
cur_lines.append(raw)
if cur_title:
chunks.append((cur_title, "\n".join(cur_lines).strip()))
return chunks
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, load_yaml, parse_semicolon_list, read_jsonl
from tooling.quality_gate import _citation_target, _draft_profile, _pipeline_profile
workspace = Path(args.workspace).resolve()
inputs = parse_semicolon_list(args.inputs) or [
"output/DRAFT.md",
"outline/outline.yml",
"outline/writer_context_packs.jsonl",
"citations/ref.bib",
]
outputs = parse_semicolon_list(args.outputs) or ["output/CITATION_BUDGET_REPORT.md"]
draft_rel = next((p for p in inputs if p.endswith("DRAFT.md")), "output/DRAFT.md")
outline_rel = next((p for p in inputs if p.endswith("outline.yml")), "outline/outline.yml")
packs_rel = next((p for p in inputs if p.endswith("writer_context_packs.jsonl")), "outline/writer_context_packs.jsonl")
bib_rel = next((p for p in inputs if p.endswith("ref.bib")), "citations/ref.bib")
out_rel = outputs[0] if outputs else "output/CITATION_BUDGET_REPORT.md"
out_path = workspace / out_rel
draft_path = workspace / draft_rel
outline_path = workspace / outline_rel
packs_path = workspace / packs_rel
bib_path = workspace / bib_rel
problems: list[str] = []
if not draft_path.exists() or draft_path.stat().st_size == 0:
problems.append(f"missing `{draft_rel}`")
if not outline_path.exists() or outline_path.stat().st_size == 0:
problems.append(f"missing `{outline_rel}`")
if not packs_path.exists() or packs_path.stat().st_size == 0:
problems.append(f"missing `{packs_rel}`")
if not bib_path.exists() or bib_path.stat().st_size == 0:
problems.append(f"missing `{bib_rel}`")
if problems:
report = "\n".join(
[
"# Citation budget report",
"",
"- Status: FAIL",
"- Reason:",
*[f" - {p}" for p in problems],
"",
]
)
atomic_write_text(out_path, report)
return 2
draft = draft_path.read_text(encoding="utf-8", errors="ignore")
outline = load_yaml(outline_path) if outline_path.exists() else []
packs = [r for r in read_jsonl(packs_path) if isinstance(r, dict)]
bib_text = bib_path.read_text(encoding="utf-8", errors="ignore")
bib_keys = set(re.findall(r"(?im)^@\w+\s*\{\s*([^,\s]+)\s*,", bib_text))
packs_by_sub: dict[str, dict[str, Any]] = {}
for rec in packs:
sid = str(rec.get("sub_id") or "").strip()
if sid:
packs_by_sub[sid] = rec
# Outline order + title mapping.
outline_order: list[str] = []
title_to_sid: dict[str, str] = {}
sid_to_title: dict[str, str] = {}
if isinstance(outline, list):
for sec in outline:
if not isinstance(sec, dict):
continue
for sub in sec.get("subsections") or []:
if not isinstance(sub, dict):
continue
sid = str(sub.get("id") or "").strip()
title = str(sub.get("title") or "").strip()
if not sid or not title:
continue
outline_order.append(sid)
title_to_sid[_norm_title(title)] = sid
sid_to_title[sid] = title
# Citations used in draft by H3.
found: dict[str, set[str]] = {}
unknown_h3: list[str] = []
for h3_title, body in _split_h3_chunks(draft):
sid = title_to_sid.get(_norm_title(h3_title))
if not sid:
unknown_h3.append(h3_title)
continue
found[sid] = set(_extract_cites(body))
used_global = set(_extract_cites(draft))
profile = _pipeline_profile(workspace)
draft_profile = _draft_profile(workspace)
citation_target = _citation_target(workspace)
# Global unique-citation targets: hard minimum vs recommended target.
# - Hard floor (A150++ survey): >=150 unique citations.
# - Recommended target: encourage using a larger fraction of the bib (e.g., ~55% when bib is large).
min_unique_hard = 0
min_unique_rec = 0
min_unique_struct = 0
min_unique_frac = 0
if profile == "arxiv-survey" and outline_order and bib_keys:
h3_n = len(set(outline_order))
floor = 0
if draft_profile == "deep":
per_h3 = 16
base = 40
frac = 0.60
floor = 165
else:
per_h3 = 14
base = 35
# Hard floor: >=150 unique citations. Recommended: ~55% of bib when bib is large.
frac = 0.50
floor = 150
min_unique_struct = base + per_h3 * h3_n
min_unique_frac = int(len(bib_keys) * frac)
min_unique_hard = max(min_unique_struct, min_unique_frac, floor)
min_unique_hard = min(min_unique_hard, len(bib_keys))
min_unique_rec = min_unique_hard
if draft_profile != "deep" and bib_keys:
min_unique_rec = max(min_unique_rec, int(len(bib_keys) * 0.55))
min_unique_rec = min(min_unique_rec, len(bib_keys))
gap_hard = max(0, int(min_unique_hard) - len(used_global)) if min_unique_hard else 0
gap_rec = max(0, int(min_unique_rec) - len(used_global)) if min_unique_rec else 0
target_policy = min_unique_hard
gap_policy = gap_hard
if citation_target == "recommended" and min_unique_rec:
target_policy = min_unique_rec
gap_policy = gap_rec
# Size the per-H3 budget against the *policy target* so the report can be used
# as an execution contract (not just a diagnostic).
desired_gap = gap_policy
# Dynamic per-H3 suggestion size so the budget can actually close the gap.
h3_n = max(1, len(set(outline_order))) if outline_order else 1
budget_per_h3 = 0
if desired_gap > 0:
budget_per_h3 = int(math.ceil(desired_gap / h3_n))
budget_per_h3 = max(6, min(12, budget_per_h3))
# Suggest per-H3 unused keys (prefer selected -> mapped -> chapter).
rows: list[dict[str, Any]] = []
suggested_anywhere: set[str] = set()
for sid in outline_order:
pack = packs_by_sub.get(sid) or {}
title = str(pack.get("title") or sid_to_title.get(sid) or sid).strip()
used_h3 = found.get(sid, set())
sel = [str(k).strip() for k in (pack.get("allowed_bibkeys_selected") or []) if str(k).strip()]
mapped = [str(k).strip() for k in (pack.get("allowed_bibkeys_mapped") or []) if str(k).strip()]
chapter = [str(k).strip() for k in (pack.get("allowed_bibkeys_chapter") or []) if str(k).strip()]
glob = [str(k).strip() for k in (pack.get("allowed_bibkeys_global") or []) if str(k).strip()]
def _cands(keys: list[str]) -> list[str]:
out: list[str] = []
for k in keys:
if bib_keys and k not in bib_keys:
continue
if k in used_global:
continue
if k in out:
continue
out.append(k)
return out
cand_sel = _cands(sel)
cand_map = _cands([k for k in mapped if k not in cand_sel])
cand_ch = _cands([k for k in chapter if k not in cand_sel and k not in cand_map])
cand_glob = _cands([k for k in glob if k not in cand_sel and k not in cand_map and k not in cand_ch])
suggest: list[str] = []
want = budget_per_h3 or 6
# Prefer globally-unused AND budget-unique keys first (maximize global unique lift).
for pool in (cand_sel, cand_map, cand_ch, cand_glob):
for k in pool:
if k in suggested_anywhere:
continue
suggest.append(k)
suggested_anywhere.add(k)
if len(suggest) >= want:
break
if len(suggest) >= want:
break
# If the local pool is too small, allow repeats across subsections as a last resort
# (still globally-unused in the draft).
if len(suggest) < want:
for pool in (cand_sel, cand_map, cand_ch, cand_glob):
for k in pool:
if k in suggest:
continue
suggest.append(k)
if len(suggest) >= want:
break
if len(suggest) >= want:
break
rows.append(
{
"sub_id": sid,
"title": title,
"unique_cites_in_h3": len(used_h3),
"unused_global_in_selected": len(cand_sel),
"unused_global_in_mapped": len(cand_map),
"suggest_keys": suggest,
}
)
status = "PASS" if gap_policy <= 0 else "FAIL"
lines: list[str] = [
"# Citation budget report",
"",
f"- Status: {status}",
f"- Draft: `{draft_rel}`",
f"- Bib entries: {len(bib_keys)}",
f"- Draft unique citations: {len(used_global)}",
f"- Draft profile: `{draft_profile}`",
f"- Citation target policy: `{citation_target}`",
"",
]
# Canonical, parseable lines for downstream validators (citation-injector):
# - `Global target ... >= N` (policy target; blocking)
# - `Gap: X` (gap-to-target; blocking)
if min_unique_hard or min_unique_rec:
hard_details = f"(struct={min_unique_struct}, hard_frac={min_unique_frac}, bib={len(bib_keys)})"
lines.append(f"- Global target (policy; blocking): >= {target_policy} {hard_details}")
lines.append(f"- Gap: {gap_policy}")
# Preserve both numbers for transparency/debugging.
lines.append(f"- Global hard minimum: >= {min_unique_hard} {hard_details}")
if min_unique_rec and min_unique_rec >= min_unique_hard:
rec_frac = int(len(bib_keys) * 0.55)
lines.append(f"- Global recommended target: >= {min_unique_rec} (rec_frac={rec_frac}, bib={len(bib_keys)})")
lines.append(f"- Gap to recommended: {gap_rec}")
if budget_per_h3:
lines.append(f"- Suggested keys per H3 (sizing): {budget_per_h3}")
lines.append("")
if unknown_h3:
lines.append("## Unmatched H3 headings")
lines.extend([f"- {t}" for t in unknown_h3[:12]])
lines.append("")
lines.append("## Per-H3 suggestions (unused global keys, in-scope)")
lines.append("")
add_hint = "add 3-6" if not budget_per_h3 else f"add {max(3, min(6, budget_per_h3))}-{budget_per_h3}"
lines.append(f"| H3 | title | unique cites | unused in selected | unused in mapped | suggested keys ({add_hint}) |")
lines.append("|---|---|---:|---:|---:|---|")
for r in rows:
sug = ", ".join([f"`{k}`" for k in (r["suggest_keys"] or [])]) or "(none)"
lines.append(
f"| {r['sub_id']} | {r['title']} | {r['unique_cites_in_h3']} | {r['unused_global_in_selected']} | {r['unused_global_in_mapped']} | {sug} |"
)
lines.extend(
[
"",
"## How to apply (NO NEW FACTS)",
"",
"- Prefer cite-embedding edits that do not change claims (paraphrase; avoid repeated stems):",
" - Axis-anchored exemplars: `... as seen in X [@a] and Y [@b] ...; Z [@c] illustrates a contrasting design point.`",
" - Parenthetical grounding (low risk): `... (e.g., X [@a], Y [@b], Z [@c]).`",
" - Contrast pointer: `While some systems emphasize <A> (X [@a]; Y [@b]), others emphasize <B> (Z [@c]).`",
"- Avoid budget-dump voice (high-signal automation tells): `Representative systems include ...`, `Notable lines of work include ...`.",
"- Keep additions inside the same H3 (no cross-subsection citation drift).",
"- Apply via `citation-injector` (LLM-first) and then rerun: `draft-polisher` → `global-reviewer` → `pipeline-auditor`.",
]
)
atomic_write_text(out_path, "\n".join(lines).rstrip() + "\n")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Regression-check citation anchoring (citations stay in the same subsection) to prevent “polish drift” that breaks claim→evidence alignment. **Trigger**: cita...
---
name: citation-anchoring
description: 'Regression-check citation anchoring (citations stay in the same subsection) to prevent “polish drift” that breaks
claim→evidence alignment.
**Trigger**: citation anchoring, citation drift, regression, cite stability, 引用锚定, 引用漂移.
**Use when**: after editing/polishing, you want to confirm citations did not migrate across `###` subsections.
**Skip if**: you do not have a baseline anchor file yet.
**Network**: none.
**Guardrail**: analysis-only; do not edit content.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Citation Anchoring (regression)
Purpose: prevent a common failure mode: polishing rewrites text and accidentally moves citation markers into a different `###` subsection, breaking claim→evidence alignment.
## Inputs
- `output/DRAFT.md`
- `output/citation_anchors.prepolish.jsonl` (baseline; created by `draft-polisher` on first run)
## Outputs
- `output/CITATION_ANCHORING_REPORT.md` (PASS/FAIL + drift examples)
## Baseline policy
- `draft-polisher` captures a baseline once per run: `output/citation_anchors.prepolish.jsonl`.
- Subsequent polish runs should keep per-H3 citation sets stable.
## Workflow (analysis-only)
Role:
- **Auditor**: only checks and reports; does not edit.
Steps:
1) Load the baseline anchors.
2) Parse the current `output/DRAFT.md` into `###` subsections and extract citation keys per subsection.
3) Compare current sets to baseline sets:
- keys added/removed within a subsection
- keys that migrated across subsections
4) Write `output/CITATION_ANCHORING_REPORT.md`:
- `- Status: PASS` only if no drift is detected
- otherwise, `- Status: FAIL` with a short diff table + examples
## Notes
If you intentionally restructure across subsections:
- delete `output/citation_anchors.prepolish.jsonl` and regenerate a new baseline (then treat that as the new regression anchor).
## Troubleshooting
### Issue: baseline anchor file is missing
Fix:
- Run `draft-polisher` once to generate `output/citation_anchors.prepolish.jsonl`, then rerun the anchoring check.
### Issue: citations intentionally moved across subsections
Fix:
- Delete `output/citation_anchors.prepolish.jsonl` and regenerate a new baseline (then treat that as the new regression anchor).
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `citation-anchoring` from `research-units-pipeline-skills`.
Export slug: `citation-anchoring`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Build a retrieval-informed chapter skeleton (`outline/chapter_skeleton.yml`) from taxonomy/core scope before stable H3 decomposition. **Trigger**: chapter sk...
---
name: chapter-skeleton
description: 'Build a retrieval-informed chapter skeleton (`outline/chapter_skeleton.yml`) from taxonomy/core scope before
stable H3 decomposition.
**Trigger**: chapter skeleton, chapter-level outline, H2 skeleton, section-first survey, 章节骨架, 章级骨架.
**Use when**: survey structure should stabilize chapter-level intent before subsection mapping and writing cards.
**Skip if**: `outline/chapter_skeleton.yml` already exists and is refined.
**Network**: none.
**Guardrail**: NO PROSE; do not invent papers; keep output chapter-level only.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Chapter Skeleton
## Load Order
Always read:
- `references/overview.md`
Use `scripts/run.py` only for deterministic materialization:
- load taxonomy / scope hints
- emit `outline/chapter_skeleton.yml`
- preserve existing non-placeholder user work
## Inputs
- `outline/taxonomy.yml`
- Optional: `GOAL.md`
## Outputs
- `outline/chapter_skeleton.yml`
## Asset contract
- `assets/output_contract.json`
## Script
### Quick Start
- `python scripts/run.py --workspace <workspace_dir>`
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `chapter-skeleton` from `research-units-pipeline-skills`.
Export slug: `chapter-skeleton`.
FILE:assets/output_contract.json
{
"artifact": "outline/chapter_skeleton.yml",
"required_fields": [
"id",
"title",
"rationale",
"seed_topics",
"target_h3_count"
]
}
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:references/overview.md
# Chapter Skeleton — Overview
Goal:
- stabilize chapter-level structure before the run commits to H3 ids
Output expectations:
- one record per core chapter
- chapter ids and titles are stable
- include a short rationale and a small seed topic list
- include a target H3 count, but do not emit H3 ids yet
Guardrails:
- chapter-level only
- no reader-facing prose
- do not encode full subsection writing plans here
FILE:scripts/run.py
from __future__ import annotations
import argparse
import sys
from pathlib import Path
from typing import Any
def _is_placeholder(text: str) -> bool:
low = (text or "").strip().lower()
if not low:
return True
return "(placeholder)" in low or "<!-- scaffold" in low or "todo" in low
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import dump_yaml, load_workspace_pipeline_spec, load_yaml, parse_semicolon_list
workspace = Path(args.workspace).resolve()
inputs = parse_semicolon_list(args.inputs) or ["outline/taxonomy.yml", "GOAL.md"]
outputs = parse_semicolon_list(args.outputs) or ["outline/chapter_skeleton.yml"]
taxonomy_path = workspace / inputs[0]
goal_path = workspace / inputs[1] if len(inputs) > 1 else workspace / "GOAL.md"
out_path = workspace / outputs[0]
if out_path.exists() and out_path.stat().st_size > 0:
if not _is_placeholder(out_path.read_text(encoding="utf-8", errors="ignore")):
return 0
taxonomy = load_yaml(taxonomy_path) if taxonomy_path.exists() else None
if not isinstance(taxonomy, list) or not taxonomy:
raise SystemExit(f"Invalid taxonomy in {taxonomy_path}")
spec = load_workspace_pipeline_spec(workspace)
target_h3 = int(getattr(spec, "core_chapter_h3_target", 0) or 3)
goal_line = ""
if goal_path.exists():
for raw in goal_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line and not line.startswith("#"):
goal_line = line
break
skeleton: list[dict[str, Any]] = []
section_no = 3
for topic in taxonomy:
if not isinstance(topic, dict):
continue
title = str(topic.get("name") or "").strip()
if not title:
continue
desc = str(topic.get("description") or "").strip()
children = topic.get("children") or []
seed_topics = [
str(child.get("name") or "").strip()
for child in children
if isinstance(child, dict) and str(child.get("name") or "").strip()
][: max(3, target_h3)]
rationale = desc or (f"retrieval-informed chapter for {title}" if not goal_line else f"{title} within {goal_line}")
skeleton.append(
{
"id": str(section_no),
"title": title,
"rationale": rationale,
"seed_topics": seed_topics,
"target_h3_count": target_h3,
}
)
section_no += 1
dump_yaml(out_path, skeleton)
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Write H2 chapter lead blocks (`sections/S<sec_id>_lead.md`) that preview the chapter's comparison lens and connect its H3 subsections, without adding new fac...
---
name: chapter-lead-writer
description: 'Write H2 chapter lead blocks (`sections/S<sec_id>_lead.md`) that preview the chapter''s comparison lens and
connect its H3 subsections, without adding new facts.
**Trigger**: chapter lead writer, section lead writer, H2 lead, lead paragraph, 章节导读, 章节导语.
**Use when**: you have H2 chapters with multiple H3 subsections and the draft reads like paragraph islands across subsections.
**Skip if**: the outline has no H3 subsections, or `outline/chapter_briefs.jsonl` is missing.
**Network**: none.
**Guardrail**: no new facts/citations; no headings; no narration templates; use only citation keys present in `citations/ref.bib`.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Chapter Lead Writer
## Purpose
This skill writes the body-only lead block that sits under an H2 heading and makes a chapter with multiple H3 subsections read like one argument.
This `SKILL.md` is now the **package router**, not the full method manual.
## Migration status
This package is in **P0 compatibility-preserving migration**:
- `references/` and `assets/` now hold the intended knowledge and contract layers.
- `scripts/run.py` remains in **compatibility mode** for active generation.
- a later script-thinning pass should move more judgment and exemplars out of Python and leave the script with deterministic execution and validation only.
For now, preserve the existing output contract and treat `scripts/run.py` as the execution source of truth.
## Inputs
Required:
- `outline/outline.yml`
- `outline/chapter_briefs.jsonl`
- `citations/ref.bib`
Optional:
- `outline/writer_context_packs.jsonl`
## Outputs
For each H2 section with H3 subsections:
- `sections/S<sec_id>_lead.md`
## Output contract
Keep these file-shape rules stable:
- each lead file is body-only and contains no headings
- each lead file previews the chapter lens and connects multiple H3s as one argument
- each lead file stays within the chapter's existing citation scope
- each lead file adds no new facts that are not supported later in the chapter
## Load Order
Always read:
- `references/overview.md`
- `references/lead_block_archetypes.md`
Read by task:
- `references/throughline_patterns.md` — when chapter briefs are thin or hard to convert into a throughline
- `references/bridge_examples.md` — when the lead needs stronger H3 transitions without slide narration
- `references/bad_narration_examples.md` — when removing table-of-contents narration, planner talk, count-based openers
Machine-readable assets:
- `assets/lead_block_contract.json` — stable package contract for lead-block shape
- `assets/lead_block_compatibility_defaults.json` — fallback phrasing, item limits, joiners, sentence cadence
## Routing rules
Use this skill in the following order:
1. Confirm the chapter is eligible
- identify H2 sections with H3 subsections from `outline/outline.yml`
- locate the corresponding chapter brief in `outline/chapter_briefs.jsonl`
2. Load the method
- read `references/overview.md`
- read `references/lead_block_archetypes.md`
- load the other reference files only if the chapter brief or current prose needs them
3. Check citation scope
- if `outline/writer_context_packs.jsonl` exists, use it for cross-cutting chapter citations
- keep any citations inside the existing chapter scope and validate keys against `citations/ref.bib`
4. Execute
- **current phase**: use `scripts/run.py` in compatibility mode to preserve active behavior and output shape
- **future phase**: keep `scripts/run.py` for deterministic execution only, with the writing method and anti-pattern inventory living in `references/`
## Compatibility mode note
`scripts/run.py` still contains active lead-generation logic.
That is temporary. For now:
- do not treat the current script wording as the target architecture
- do treat `assets/lead_block_compatibility_defaults.json` as the primary compatibility-mode wording source
- do not copy large prose instructions back into `SKILL.md`
- do preserve the current output contract while reducing obvious narration stems in the active path
## What this skill should guarantee
Regardless of where the detailed method lives, this skill should produce chapter leads that:
- state the chapter's comparison lens rather than narrating the outline
- connect the H3 subsections as one argument, not as isolated stops on a tour
- introduce recurring contrasts without slash-list jargon
- keep the evaluation or calibration lens visible at a high level
- avoid slide narration, planner talk, and repeated stock openers
- choose from multiple candidate lead frames when possible (lens-first / sequence-first / comparison-first) and keep the least narrated option instead of reusing one stock cadence everywhere
## Block conditions
Stop and route upstream if any of these are true:
- `outline/chapter_briefs.jsonl` is missing
- the target H2 section has no H3 subsections
- the chapter brief is too incomplete to infer a throughline safely
- the requested lead would require new facts or out-of-scope citations
## Script role
`scripts/run.py` should currently be treated as a **compatibility executor**.
Its long-term role after script thinning is narrower:
- chapter discovery
- brief loading and normalization
- contract validation
- deterministic report writing
It is not the long-term home for lead archetypes, bridge examples, or narration anti-patterns.
## Script
### Quick Start
- `python scripts/run.py --workspace <workspace_dir>`
### All Options
- `--workspace <dir>`
- `--unit-id <id>`
- `--inputs <a;b;...>`
- `--outputs <a;b;...>`
- `--checkpoint <C*>`
### Examples
- `python scripts/run.py --workspace workspaces/<ws>`
## Troubleshooting
- If `outline/chapter_briefs.jsonl` is missing or too thin, rebuild chapter briefs first.
- If `outline/writer_context_packs.jsonl` is missing, the script will still run but with a thinner citation pool.
- If a generated lead sounds narrated, patch the compatibility asset and references before changing Python.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `chapter-lead-writer` from `research-units-pipeline-skills`.
Export slug: `chapter-lead-writer`.
FILE:assets/lead_block_compatibility_defaults.json
{
"compatibility_mode": true,
"default_archetype": "lens-first",
"limits": {
"throughline_items": 2,
"contrast_items": 3,
"subsection_preview_items": 3
},
"fallbacks": {
"section_title": "chapter {section_id}",
"comparison_problem": "the comparison pressure that the subsections keep reopening",
"recurring_contrasts": "protocol assumptions, resource constraints, and evaluation scope",
"subsection_preview": "the chapter subsections"
},
"templates": {
"lens-first": {
"paragraph_1": [
"In {section_title}, the central question is how {comparison_problem} change the meaning of reported progress{citation}.",
"{section_title} is where differences in {comparison_problem} stop looking like implementation detail and start changing what the evidence can support{citation}.",
"The core issue in {section_title} is not breadth alone, but how {comparison_problem} reshape what counts as a fair comparison{citation}."
],
"paragraph_2": [
"{subsection_preview} each expose that pressure from a different angle, showing why {recurring_contrasts} cannot be treated as interchangeable background choices{citation}.",
"Across {subsection_preview}, the same problem keeps returning: {recurring_contrasts} alter the interpretation of otherwise similar results{citation}.",
"Taken together, {subsection_preview} show where {recurring_contrasts} become substantive constraints rather than secondary implementation detail{citation}."
]
}
}
}
FILE:assets/lead_block_contract.json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://example.local/chapter-lead-writer/lead-block-contract.schema.json",
"title": "Chapter Lead Block Contract",
"type": "object",
"additionalProperties": false,
"required": [
"section_id",
"section_title",
"output_path",
"body_only",
"max_paragraphs",
"must_not_have_headings",
"may_add_new_facts",
"citation_scope",
"compatibility_mode"
],
"properties": {
"section_id": {
"type": "string",
"minLength": 1
},
"section_title": {
"type": "string",
"minLength": 1
},
"output_path": {
"type": "string",
"pattern": "^sections/S[^/]+_lead\\.md$"
},
"body_only": {
"type": "boolean",
"const": true
},
"max_paragraphs": {
"type": "integer",
"minimum": 1,
"maximum": 3
},
"must_not_have_headings": {
"type": "boolean",
"const": true
},
"may_add_new_facts": {
"type": "boolean",
"const": false
},
"citation_scope": {
"type": "string",
"enum": [
"chapter-only",
"chapter-or-selected-crosscutting"
]
},
"compatibility_mode": {
"type": "boolean"
}
}
}
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:references/bad_narration_examples.md
# Bad Narration Examples
Use this file to remove high-signal generator voice from chapter leads.
## Table-of-contents narration
Bad:
- `In this section, we discuss planning, memory, and adaptation.`
- `This chapter surveys several design choices and then reviews evaluation.`
Why it fails:
- it narrates structure instead of stating the chapter's comparison problem
## Slide navigation
Bad:
- `Next, we move from interfaces to planning.`
- `We now turn to evaluation.`
Why it fails:
- it sounds like a presentation script rather than paper prose
## Planner talk
Bad:
- `To keep the chapter coherent, we first focus on representation and then expand to evaluation.`
- `The remaining uncertainty is resolved by looking at feedback mechanisms.`
Why it fails:
- it exposes internal planning language instead of the literature-facing argument
## Count-based opener slots
Bad:
- `Two key points organize this chapter.`
- `Three takeaways define the next comparison.`
Why it fails:
- it creates a reusable shell that quickly feels generated when repeated across chapters
## Title narration
Bad:
- `From representation to adaptation, the chapter covers several themes.`
Why it fails:
- it restates heading movement instead of making a claim about the chapter lens
FILE:references/bridge_examples.md
# Bridge Examples
Use these examples to connect H3 subsections without sounding like slide navigation.
Paraphrase them; do not copy them verbatim.
## Good bridge: from interface to decision policy
- `Once the interface fixes what actions are executable, the next comparison is how different systems choose among those actions under limited feedback and bounded cost.`
Why it works:
- it makes the next subsection feel necessary
- it names the hinge of the comparison instead of narrating movement
## Good bridge: from representation to evaluation
- `Richer internal state only matters if the evaluator can still tell what evidence a decision relied on, which is why the chapter shifts from representation choices to comparison under explicit protocol assumptions.`
Why it works:
- it links a design choice to the chapter's calibration lens
## Good bridge: from local mechanism to chapter-level contrast
- `These mechanisms differ in implementation, but they are comparable because each changes the same trade-off between runtime flexibility and predictable behavior under constraint.`
Why it works:
- it connects local variation back to one recurring contrast
## Good bridge: from one H3 to the next without ToC talk
- `That bottleneck narrows the comparison to what the system can revise after acting, which is why the next subsection focuses on feedback and update structure rather than on architecture labels alone.`
Why it works:
- it points forward through argument, not through narration
FILE:references/lead_block_archetypes.md
# Lead Block Archetypes
Use these archetypes as drafting shapes.
Paraphrase them; do not paste them verbatim.
Compatibility note:
- the active compatibility-mode sentence defaults now live in `assets/lead_block_compatibility_defaults.json`
- update that asset when the issue is fallback cadence or reusable sentence wording, and update this file when the issue is the underlying chapter-lead method
## 1. Lens-first
Use when:
- the chapter brief already has a strong throughline
- the reader mainly needs the chapter's comparison lens up front
Move pattern:
- sentence 1: state the shared comparison problem
- sentence 2: name the recurring contrasts
- optional sentence 3: calibrate how evidence or protocol assumptions affect comparison
Example:
- `What matters most in this chapter is not whether the systems share a surface architecture, but which assumptions make their decisions comparable under the same evaluation setting.`
## 2. Contrast-first
Use when:
- the chapter is held together by 2-3 recurring trade-offs
- the H3 subsections are easy to over-read as isolated methods
Move pattern:
- sentence 1: introduce the recurring trade-off
- sentence 2: show how different subsections expose different sides of it
- sentence 3: state why that contrast matters for interpretation
Example:
- `The central contrast in this chapter is between flexibility and control: each subsection changes how much structure the system fixes in advance and how much it leaves to runtime adaptation.`
## 3. Calibration-first
Use when:
- the comparison only makes sense if the reader keeps protocol or evidence assumptions in view
- the chapter includes evaluation, reproducibility, or deployment constraints
Move pattern:
- sentence 1: state the high-level calibration rule
- sentence 2: connect that rule to the chapter's subsections
- sentence 3: preview the design choices that the chapter will compare under that rule
Example:
- `The comparisons in this chapter only hold if protocol assumptions stay visible, because the same design choice looks very different once budget, access, or observability changes.`
## 4. Problem-first
Use when:
- the chapter brief defines a shared bottleneck or failure surface
- the subsections are best read as different answers to the same question
Move pattern:
- sentence 1: name the bottleneck
- sentence 2: show how each subsection addresses it differently
- sentence 3: explain the comparison lens that keeps those answers on the same scale
Example:
- `This chapter is unified by a single bottleneck: once the system must act under limited feedback, design choices that seem local become coupled to evidence quality, cost, and stability.`
FILE:references/overview.md
# Overview
## Purpose
This package writes the short H2 lead block that turns several H3 subsections into one chapter-level argument.
Read this file before applying the skill or editing the package.
## Role split
### `SKILL.md`
Use `SKILL.md` as the router:
- when the skill triggers
- required inputs and outputs
- when to read which reference file
- what the active script still does in compatibility mode
### `references/`
Use `references/` for the judgment layer:
- lead-block archetypes
- throughline construction patterns
- bridge examples
- narration anti-patterns
### `assets/`
Use `assets/` for machine-readable constraints that later validation can consume.
In compatibility mode, use `assets/lead_block_compatibility_defaults.json` as the editable source for fallback lead phrasing, joiners, and sentence cadence.
### `scripts/`
Use the script only for deterministic work and compatibility-mode execution.
It should not remain the long-term home for voice rules or reusable prose patterns.
If a fixed fallback phrase or lead cadence needs tuning, change the compatibility asset before editing Python.
## What a chapter lead must do
A good chapter lead does four jobs:
- state the chapter's comparison lens
- connect multiple H3 subsections as one argument
- preview 2-3 recurring contrasts without turning into an axis dump
- calibrate how the chapter should be read at a high level, especially when protocol or evidence assumptions affect comparison
## What it must not do
Avoid:
- table-of-contents narration
- slide-navigation phrasing
- planner-talk transitions
- count-based opener slots used as a reusable shell
- title narration that sounds like a generated heading bridge
- new claims that the downstream H3s do not support
## Recommended read order
1. `lead_block_archetypes.md`
- choose the shape of the lead block
2. `throughline_patterns.md`
- map chapter brief fields into a chapter-level argument
3. `bridge_examples.md`
- strengthen local transitions between subsections without narrating the outline
4. `bad_narration_examples.md`
- remove weak stems and generator voice before finalizing the block
## Compatibility mode boundary
In this migration phase, the script still writes the active output.
Use the references to guide edits and future refactors, but preserve:
- body-only output
- existing file naming
- the report file
- the same high-level contract of a short lead block per eligible H2 chapter
FILE:references/throughline_patterns.md
# Throughline Patterns
Use this file when `outline/chapter_briefs.jsonl` is hard to convert into one lead block.
## Field-to-move mapping
### `throughline`
Use it to answer:
- what shared question this chapter is really asking
- why the H3 subsections belong together
Good transformation:
- convert a topic list into one comparison problem
Example transformation:
- weak input: `state management`, `deliberation`, `feedback loops`
- stronger throughline: `The chapter compares how systems preserve decision quality once intermediate state, delayed feedback, and multi-step control must all be managed together.`
### `key_contrasts`
Use it to choose 2-3 recurring contrasts that can appear in the lead.
Good contrasts are:
- comparative
- reusable across several H3 subsections
- readable as prose rather than labels
Prefer:
- `fixed structure versus runtime adaptation`
- `richer state versus easier verification`
- `local gains versus transfer under tighter constraints`
Avoid:
- copied slash lists
- internal planner labels
- contrast inventories longer than the lead can hold naturally
### `lead_paragraph_plan`
Use it to decide progression.
Good use:
- interpret the plan as movement in the argument, not as a miniature table of contents
Example:
- weak use: `first cover setup, then methods, then evaluation`
- stronger use: `move from the source of the bottleneck to the design choices it induces, then to the conditions under which those choices remain comparable`
## Compression rule
If a brief contains many bullets, compress them into:
- one chapter-level question
- two or three recurring contrasts
- one progression sentence if needed
If the brief cannot be compressed that way, the lead should stay conservative and avoid over-claiming coherence.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import hashlib
import json
import re
import sys
from pathlib import Path
from typing import Any
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
out: list[dict[str, Any]] = []
if not path.exists() or path.stat().st_size <= 0:
return out
import json
for raw in path.read_text(encoding='utf-8', errors='ignore').splitlines():
raw = raw.strip()
if not raw:
continue
try:
rec = json.loads(raw)
except Exception:
continue
if isinstance(rec, dict):
out.append(rec)
return out
def _load_json_asset(path: Path) -> dict[str, Any]:
if not path.exists() or path.stat().st_size <= 0:
return {}
try:
data = json.loads(path.read_text(encoding='utf-8', errors='ignore'))
except Exception:
return {}
return data if isinstance(data, dict) else {}
def _uniq(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
v = str(item or '').strip()
if not v or v in seen:
continue
seen.add(v)
out.append(v)
return out
def _cite(keys: list[str], n: int = 3) -> str:
vals = _uniq(keys)[:n]
return f"[@{'; '.join(vals)}]" if vals else ""
def _clean_phrase(text: str) -> str:
s = str(text or '').strip().replace('/', ' and ')
s = ' '.join(s.split())
return s.rstrip(' .;:,;。')
def _natural_phrase(text: str) -> str:
s = _clean_phrase(text).replace('&', 'and')
s = re.sub(r'\([^)]*\)', '', s)
s = re.sub(r'\s+', ' ', s).strip(' ,;:')
low = s.lower()
blocked = (
'core mechanism and architecture',
'evaluation protocol',
'datasets, metrics, human eval',
)
if any(token in low for token in blocked):
return ''
return low if low else ''
def _content_phrase(text: str) -> str:
s = _clean_phrase(text)
low = s.lower()
prefixes = [
'pin scope to goal:',
'compare approaches along:',
'para 1:',
'para 2:',
'para 3:',
'para 3 (optional):',
]
for prefix in prefixes:
if low.startswith(prefix):
s = s.split(':', 1)[1].strip() if ':' in s else ''
low = s.lower()
meta_starts = (
'state the chapter',
'preview the h3',
'highlight evaluation anchors',
)
if any(low.startswith(prefix) for prefix in meta_starts):
return ''
return s
def _skill_root() -> Path:
return Path(__file__).resolve().parents[1]
def _compatibility_defaults() -> dict[str, Any]:
asset_path = _skill_root() / 'assets' / 'lead_block_compatibility_defaults.json'
return _load_json_asset(asset_path)
def _limit(value: Any, *, fallback: int) -> int:
try:
num = int(value)
except Exception:
return fallback
return num if num > 0 else fallback
def _series(items: list[str], *, limit: int, fallback: str) -> str:
vals = [str(item).strip() for item in items if str(item).strip()][:limit]
if not vals:
return fallback
if len(vals) == 1:
return vals[0]
if len(vals) == 2:
return f'{vals[0]} and {vals[1]}'
return f"{', '.join(vals[:-1])}, and {vals[-1]}"
def _citation_suffix(keys: list[str], n: int = 3) -> str:
cite = _cite(keys, n)
return f' {cite}' if cite else ''
def _render(template: str, **kwargs: str) -> str:
text = template.format(**kwargs)
text = ' '.join(text.split())
text = text.replace(' .', '.').replace(' ,', ',').replace(' ;', ';').replace(' :', ':')
return text
def _ordered_options(seed: str, value: Any) -> list[str]:
if not isinstance(value, list):
single = str(value or '').strip()
return [single] if single else []
options = [str(item or '').strip() for item in value if str(item or '').strip()]
if not options:
return []
digest = hashlib.sha1(str(seed or '').encode('utf-8', errors='ignore')).hexdigest()
idx = int(digest[:12], 16) % len(options)
return options[idx:] + options[:idx]
def _pick_template(seed: str, value: Any, fallback: str) -> str:
options = _ordered_options(seed, value)
return options[0] if options else fallback
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('--workspace', required=True)
parser.add_argument('--unit-id', default='')
parser.add_argument('--inputs', default='')
parser.add_argument('--outputs', default='')
parser.add_argument('--checkpoint', default='')
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, ensure_dir, load_yaml, now_iso_seconds
from tooling.pipeline_text import slug_unit_id
workspace = Path(args.workspace).resolve()
compat = _compatibility_defaults()
limits = compat.get('limits') if isinstance(compat.get('limits'), dict) else {}
fallbacks = compat.get('fallbacks') if isinstance(compat.get('fallbacks'), dict) else {}
templates = compat.get('templates') if isinstance(compat.get('templates'), dict) else {}
archetype = str(compat.get('default_archetype') or 'lens-first').strip() or 'lens-first'
template_pack = templates.get(archetype) if isinstance(templates.get(archetype), dict) else {}
sections_dir = workspace / 'sections'
ensure_dir(sections_dir)
report_path = workspace / 'output' / 'CHAPTER_LEADS_REPORT.md'
ensure_dir(report_path.parent)
outline = load_yaml(workspace / 'outline' / 'outline.yml') if (workspace / 'outline' / 'outline.yml').exists() else []
briefs = {str(r.get('section_id') or '').strip(): r for r in _load_jsonl(workspace / 'outline' / 'chapter_briefs.jsonl') if str(r.get('section_id') or '').strip()}
packs = _load_jsonl(workspace / 'outline' / 'writer_context_packs.jsonl')
cites_by_sec: dict[str, list[str]] = {}
for rec in packs:
sec_id = str(rec.get('section_id') or '').strip()
if not sec_id:
continue
bucket = cites_by_sec.setdefault(sec_id, [])
for field in ['allowed_bibkeys_selected', 'allowed_bibkeys_chapter']:
for key in rec.get(field) or []:
bucket.append(str(key).strip())
written: list[str] = []
if isinstance(outline, list):
for sec in outline:
if not isinstance(sec, dict):
continue
subs = [x for x in (sec.get('subsections') or []) if isinstance(x, dict)]
if not subs:
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip() or str(fallbacks.get('section_title') or 'chapter {section_id}').format(section_id=sec_id)
if not sec_id:
continue
brief = briefs.get(sec_id) or {}
throughline = [_natural_phrase(_content_phrase(x)) for x in (brief.get('throughline') or []) if _natural_phrase(_content_phrase(x))]
contrasts = [_natural_phrase(x) for x in (brief.get('key_contrasts') or []) if _natural_phrase(x)]
sub_titles = [str(x.get('title') or '').strip() for x in subs if str(x.get('title') or '').strip()]
cites = cites_by_sec.get(sec_id) or []
comparison_problem = _series(
contrasts or throughline,
limit=_limit(limits.get('throughline_items'), fallback=2),
fallback=str(fallbacks.get('comparison_problem') or 'the comparison pressure that the subsections keep reopening'),
)
recurring_contrasts = _series(
contrasts,
limit=_limit(limits.get('contrast_items'), fallback=3),
fallback=str(fallbacks.get('recurring_contrasts') or 'protocol assumptions, resource constraints, and evaluation scope'),
)
subsection_preview = _series(
sub_titles,
limit=_limit(limits.get('subsection_preview_items'), fallback=3),
fallback=str(fallbacks.get('subsection_preview') or 'the chapter subsections'),
)
paragraph_1 = _render(
_pick_template(
f'lead:p1:{sec_id}:{sec_title}',
template_pack.get('paragraph_1'),
'What binds {section_title} together is a shared comparison problem: {comparison_problem}{citation}.',
),
section_title=sec_title,
comparison_problem=comparison_problem,
citation=_citation_suffix(cites, 4),
)
paragraph_2 = _render(
_pick_template(
f'lead:p2:{sec_id}:{subsection_preview}',
template_pack.get('paragraph_2'),
'Taken together, {subsection_preview} show why {recurring_contrasts} cannot be treated as background implementation detail{citation}.',
),
subsection_preview=subsection_preview,
recurring_contrasts=recurring_contrasts,
citation=_citation_suffix(cites[4:], 4) or _citation_suffix(cites, 4),
)
paragraphs = [paragraph_1, paragraph_2]
path = sections_dir / f"{slug_unit_id(sec_id)}_lead.md"
atomic_write_text(path, '\n\n'.join(paragraphs).rstrip() + '\n')
written.append(str(path.relative_to(workspace)))
report = '\n'.join([
'# Chapter leads report',
'',
'- Status: PASS',
f'- Generated at: `{now_iso_seconds()}`',
] + [f'- `{p}`' for p in written]) + '\n'
atomic_write_text(report_path, report)
return 0
if __name__ == '__main__':
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Build per-chapter (H2) writing briefs (NO PROSE) so the final survey reads like a paper (chapter leads + cross-H3 coherence) without inflating the ToC. **Tri...
---
name: chapter-briefs
description: 'Build per-chapter (H2) writing briefs (NO PROSE) so the final survey reads like a paper (chapter leads + cross-H3
coherence) without inflating the ToC.
**Trigger**: chapter briefs, H2 briefs, chapter lead plan, section intent, 章节意图, 章节导读, H2 卡片.
**Use when**: `outline/outline.yml` + `outline/subsection_briefs.jsonl` exist and you want thicker chapters (fewer headings,
more logic).
**Skip if**: the outline is still changing heavily (fix outline/mapping first).
**Network**: none.
**Guardrail**: NO PROSE; do not invent papers; only reference subsection ids and already-mapped papers.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Chapter Briefs (H2 writing cards) [NO PROSE]
Purpose: turn each **H2 chapter that contains H3 subsections** into a chapter-level writing card so the writer can:
- add a chapter lead paragraph block (coherence)
- keep a consistent comparison axis across the chapter
- avoid “8 small islands” where every H3 restarts from scratch
This artifact is **internal intent**, not reader-facing prose.
Why this matters for writing quality:
- Chapter briefs prevent the "paragraph island" failure mode: without a throughline, each H3 restarts and repeats openers.
- Treat `throughline` and `lead_paragraph_plan` as decision constraints, not copyable sentences.
## Inputs
- `outline/outline.yml`
- `outline/subsection_briefs.jsonl`
- Optional: `GOAL.md`
## Outputs
- `outline/chapter_briefs.jsonl`
## Output format (`outline/chapter_briefs.jsonl`)
JSONL (one object per H2 chapter that has H3 subsections).
Required fields:
- `section_id`, `section_title`
- `subsections` (list of `{sub_id,title}` in outline order)
- `synthesis_mode` (one of: `clusters`, `timeline`, `tradeoff_matrix`, `case_study`, `tension_resolution`)
- `synthesis_preview` (1–2 bullets; how the chapter will synthesize across H3 without template-y “Taken together…”)
- `throughline` (3–6 bullets)
- `key_contrasts` (2–6 bullets; pull from each H3 `contrast_hook` when available)
- `lead_paragraph_plan` (2–3 bullets; plan only, not prose)
- Each bullet should be chapter-specific and mention concrete handles (axes / contrast hooks / evaluation lens).
- Avoid generic glue like "Para 1: introduce the chapter" without naming what is being compared.
- `bridge_terms` (5–12 tokens; union of H3 bridge terms)
## How C5 uses this (chapter lead contract)
The writer uses `outline/chapter_briefs.jsonl` to draft `sections/S<sec_id>_lead.md` (body-only; no headings).
Contract (paper-like, no new facts):
- Preview the chapter’s comparison axes (2–3) and how the H3s connect; do not restate the table of contents.
- Reuse `key_contrasts` / `bridge_terms` as *handles* (not templates) so the chapter reads coherent without repeating "Taken together" everywhere.
- Keep it grounded (>=2 citations later in C5; do not invent new papers here).
## Workflow
0. (Optional) Read `GOAL.md` to pin scope/audience, and inject that constraint into the chapter throughline.
1. Read `outline/outline.yml` and list H2 chapters that have H3 subsections.
2. Read `outline/subsection_briefs.jsonl` and group briefs by `section_id`.
3. For each chapter, produce:
- a **throughline**: what the whole chapter is trying to compare/explain
- **key contrasts**: 2–6 contrasts that span multiple H3s
- a **synthesis_mode**: enforce synthesis diversity across chapters (avoid repeating the same closing paragraph shape)
- a **lead paragraph plan**: 2–3 paragraph objectives (what the chapter lead must do)
- a **bridge_terms** set to keep terminology stable across H3s
4. Write `outline/chapter_briefs.jsonl`.
## Quality checklist
- [ ] One record per H2-with-H3 chapter.
- [ ] No placeholders (`TODO`/`…`/`(placeholder)`/template instructions).
- [ ] `throughline` and `key_contrasts` are chapter-specific (not copy/paste generic).
- [ ] `lead_paragraph_plan` bullets explicitly preview 2–3 comparison axes and how the H3 subsections partition them (no generic chapter-intro boilerplate).
## Script
### Quick Start
- `python scripts/run.py --help`
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>`
- `--inputs <semicolon-separated>`
- `--outputs <semicolon-separated>`
- `--checkpoint <C#>`
### Examples
- Default IO:
- `python scripts/run.py --workspace workspaces/<ws>`
- Explicit IO:
- `python scripts/run.py --workspace workspaces/<ws> --inputs "outline/outline.yml;outline/subsection_briefs.jsonl;GOAL.md" --outputs "outline/chapter_briefs.jsonl"`
### Refinement marker (recommended; prevents churn)
When you are satisfied with chapter briefs, create:
- `outline/chapter_briefs.refined.ok`
This is an explicit "I reviewed/refined this" signal:
- prevents scripts from regenerating and undoing your work
- (in strict runs) can be used as a completion signal to avoid silently accepting a bootstrap scaffold
### Notes
- This helper is a bootstrap; refine manually if needed.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `chapter-briefs` from `research-units-pipeline-skills`.
Export slug: `chapter-briefs`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
def _choose_synthesis_mode(*, sec_id: str, sec_title: str, axes: list[str]) -> tuple[str, list[str]]:
"""Pick a synthesis mode to avoid repeating the same chapter-ending template everywhere (NO NEW FACTS)."""
title_low = (sec_title or "").lower()
axes_low = " ".join([str(a or "").lower() for a in (axes or [])])
if any(k in title_low for k in ["evaluation", "benchmark", "metrics", "measurement"]) or any(k in axes_low for k in ["evaluation", "benchmark", "metrics"]):
return (
"tradeoff_matrix",
[
"Synthesize by comparing approaches along a small trade-off matrix (axes + evaluation protocol), not by listing papers.",
"Keep contrasts decision-relevant (reliability/cost/safety) and reuse consistent evaluation anchors across H3s.",
],
)
if any(k in title_low for k in ["history", "evolution", "timeline", "trend"]):
return (
"timeline",
[
"Synthesize via a lightweight evolution story (what changed in assumptions/interfaces/evaluation over time).",
"Use the timeline to motivate why later H3s focus on different constraints.",
],
)
if any(k in title_low for k in ["security", "safety", "attack", "defense", "risk", "robust"]) or any(k in axes_low for k in ["safety", "security", "risk"]):
return (
"tension_resolution",
[
"Synthesize as a tension-resolution arc (capability vs safety/reliability) and how approaches resolve or shift that tension.",
"Make the failure modes concrete and connect them to evaluation protocols and mitigations.",
],
)
# Deterministic fallback: cycle a couple of modes to increase diversity across chapters.
try:
idx = int(str(sec_id or "0").strip())
except Exception:
idx = 0
if idx % 3 == 0:
return (
"case_study",
[
"Synthesize by anchoring the chapter around 1–2 representative system case studies and using H3s as axes of analysis.",
"Ensure the case studies only use papers already mapped to this chapter (no new citations).",
],
)
return (
"clusters",
[
"Synthesize by contrasting 2–3 clusters of approaches and making the trade-offs explicit (not a per-paper summary).",
"Use explicit connectors (however/therefore/building on) to keep paragraphs from becoming islands.",
],
)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import (
ensure_dir,
load_yaml,
now_iso_seconds,
parse_semicolon_list,
read_jsonl,
write_jsonl,
)
workspace = Path(args.workspace).resolve()
inputs = parse_semicolon_list(args.inputs) or [
"outline/outline.yml",
"outline/subsection_briefs.jsonl",
"GOAL.md",
]
outputs = parse_semicolon_list(args.outputs) or ["outline/chapter_briefs.jsonl"]
outline_path = workspace / inputs[0]
briefs_path = workspace / inputs[1]
goal_path = workspace / (inputs[2] if len(inputs) >= 3 else "GOAL.md")
out_path = workspace / outputs[0]
freeze_marker = out_path.parent / "chapter_briefs.refined.ok"
if out_path.exists() and out_path.stat().st_size > 0 and freeze_marker.exists():
return 0
outline = load_yaml(outline_path) if outline_path.exists() else None
if not isinstance(outline, list) or not outline:
raise SystemExit(f"Invalid outline: {outline_path}")
briefs = read_jsonl(briefs_path)
if not briefs:
raise SystemExit(f"Missing or empty subsection briefs: {briefs_path}")
briefs_by_sub: dict[str, dict[str, Any]] = {}
for rec in briefs:
if not isinstance(rec, dict):
continue
sid = str(rec.get("sub_id") or "").strip()
if sid:
briefs_by_sub[sid] = rec
goal = _read_goal(goal_path)
records: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get("id") or "").strip()
sec_title = str(sec.get("title") or "").strip()
subs = sec.get("subsections") or []
if not (sec_id and sec_title and isinstance(subs, list) and subs):
continue
sub_list: list[dict[str, str]] = []
bridge_terms: list[str] = []
contrast_hooks: list[str] = []
axes: list[str] = []
for sub in subs:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get("id") or "").strip()
sub_title = str(sub.get("title") or "").strip()
if not sub_id or not sub_title:
continue
sub_list.append({"sub_id": sub_id, "title": sub_title})
brief = briefs_by_sub.get(sub_id) or {}
for t in brief.get("bridge_terms") or []:
t = str(t or "").strip()
if t and t not in bridge_terms:
bridge_terms.append(t)
hook = str(brief.get("contrast_hook") or "").strip()
if hook and hook not in contrast_hooks:
contrast_hooks.append(hook)
for a in brief.get("axes") or []:
a = str(a or "").strip()
if a and a not in axes:
axes.append(a)
throughline: list[str] = []
if axes:
throughline.extend(axes[:6])
else:
throughline.append("organizing comparison: mechanism, protocol, and limitations")
key_contrasts = [c for c in contrast_hooks if c]
if not key_contrasts:
key_contrasts = ["cross-subsection contrast grounded in mapped papers"]
synthesis_mode, synthesis_preview = _choose_synthesis_mode(sec_id=sec_id, sec_title=sec_title, axes=axes)
lead_plan = [
"chapter problem + decision-relevant comparison frame",
"subsection map + chapter question decomposition",
"evaluation anchors + recurrent limitations",
]
records.append(
{
"section_id": sec_id,
"section_title": sec_title,
"subsections": sub_list,
"synthesis_mode": synthesis_mode,
"synthesis_preview": synthesis_preview[:2],
"throughline": throughline[:6],
"key_contrasts": key_contrasts[:6],
"lead_paragraph_plan": lead_plan,
"bridge_terms": bridge_terms[:12],
"generated_at": now_iso_seconds(),
}
)
ensure_dir(out_path.parent)
_check_no_placeholders(records)
write_jsonl(out_path, records)
return 0
def _read_goal(path: Path) -> str:
if not path.exists():
return ""
for line in path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
if line.startswith(("-", ">", "<!--")):
continue
low = line.lower()
if "写一句话" in line or "fill" in low:
continue
return line
return ""
def _check_no_placeholders(records: list[dict[str, Any]]) -> None:
raw = json.dumps(records, ensure_ascii=False)
low = raw.lower()
if "…" in raw:
raise SystemExit("chapter_briefs contains ellipsis placeholder")
if "(placeholder)" in low:
raise SystemExit("chapter_briefs contains placeholder markers")
if re.search(r"(?i)\b(?:todo|tbd|fixme)\b", raw):
raise SystemExit("chapter_briefs contains TODO/TBD/FIXME")
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Add bias/risk-of-bias assessment fields to an extraction table and populate them consistently. **Trigger**: bias, risk-of-bias, RoB, evidence quality, 偏倚评估,...
---
name: bias-assessor
description: 'Add bias/risk-of-bias assessment fields to an extraction table and populate them consistently.
**Trigger**: bias, risk-of-bias, RoB, evidence quality, 偏倚评估, 证据质量.
**Use when**: systematic review 已生成 `papers/extraction_table.csv`,需要在 synthesis 前补齐偏倚/质量字段。
**Skip if**: 不是 systematic review,或还没有 `papers/extraction_table.csv`。
**Network**: none.
**Guardrail**: 使用简单可复核刻度(low/unclear/high)+ 简短 notes;保持字段一致性。'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Bias Assessor (risk-of-bias, lightweight)
Goal: make evidence quality explicit in a way that is quick, consistent, and auditable.
## Inputs
- `papers/extraction_table.csv`
## Outputs
- Updated `papers/extraction_table.csv`
## Recommended fields
Use a simple 3-level scale (all lowercase): `low | unclear | high`.
Suggested columns to add (if missing):
- `rob_selection`
- `rob_measurement`
- `rob_confounding`
- `rob_reporting`
- `rob_overall`
- `rob_notes`
## Workflow
1. Read `papers/extraction_table.csv` and identify the set of included studies.
2. If RoB columns are missing, add them (keep names stable once introduced).
3. For each study, fill each RoB domain:
- `low`: design/reporting plausibly controls the bias
- `unclear`: not enough information to judge
- `high`: clear risk (e.g., missing controls, ambiguous measurement, selective reporting)
4. Set `rob_overall` conservatively:
- `high` if any domain is `high`
- `unclear` if no `high` but at least one `unclear`
- `low` only if all domains are `low`
5. Add 1–3 short notes in `rob_notes` that justify the rating.
## Definition of Done
- [ ] Every included paper row has all RoB columns filled.
- [ ] Values are strictly from `low|unclear|high` (no free-form scale drift).
- [ ] Notes are short and specific (what was missing / what was strong).
## Troubleshooting
### Issue: the table has mixed or inconsistent RoB column names
**Fix**:
- Normalize to the recommended column names and keep a single set across all rows.
### Issue: the paper lacks enough methodological detail
**Fix**:
- Prefer `unclear` with a concrete note (“no details on X”) rather than guessing.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `bias-assessor` from `research-units-pipeline-skills`.
Export slug: `bias-assessor`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`). **Trigger**: arXiv, arxiv, paper search, meta...
---
name: arxiv-search
description: 'Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`).
**Trigger**: arXiv, arxiv, paper search, metadata retrieval, 文献检索, 论文检索, 拉取元数据, 离线导入.
**Use when**: 需要一个初始论文集合(survey/snapshot 的 Stage C1),来源为 arXiv(在线检索或离线导入 export)。
**Skip if**: 已经有可用的 `papers/papers_raw.jsonl`,或数据源不是 arXiv。
**Network**: 在线检索需要网络;离线 `--input <export.*>` 不需要网络。
**Guardrail**: 只做 metadata;不要在 `output/` 写长 prose。'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# arXiv Search (metadata-first)
Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.
When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.
## Load Order
Always read:
- `references/domain_pack_overview.md` — how domain packs drive topic-specific behavior
Domain packs (loaded by topic match):
- `assets/domain_packs/llm_agents.json` — pinned IDs, query rewrite rules for LLM agent topics
## Script Boundary
Use `scripts/run.py` only for:
- arXiv API retrieval and XML parsing
- offline export conversion (CSV/JSON/JSONL normalization)
- metadata enrichment via `id_list` backfill
Do not treat `run.py` as the place for:
- hardcoded topic detection or query rewriting (use domain packs)
- domain-specific pinned paper lists (externalize to `assets/domain_packs/`)
## Input
- `queries.md` (keywords, excludes, time window)
## Outputs
- `papers/papers_raw.jsonl` (JSONL; 1 paper per line)
- Each record includes at least: `title`, `authors`, `year`, `url`, `abstract`
- When using the arXiv API online mode, records also include helpful metadata: `arxiv_id`, `pdf_url`, `categories`, `primary_category`, `published`, `updated`, `doi`, `journal_ref`, `comment`
- Convenience index (optional but generated by the script):
- `papers/papers_raw.csv`
## Decision: online vs offline
- If you have network access: run arXiv API retrieval.
- If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
- Hybrid: if you import offline but still have network later, you can **enrich missing fields** (abstract/authors/categories) via arXiv `id_list` using `--enrich-metadata` or `queries.md` `enrich_metadata: true`.
## Workflow (heuristic)
1. Read `queries.md` and expand into concrete query strings.
2. Retrieve results (online) or import an export (offline).
3. Normalize every record to include at least:
- `title`, `authors` (array), `year`, `url`, `abstract`
4. Keep the set broad at this stage; dedupe/ranking comes next.
5. Apply time window and `max_results` if specified.
## Quality checklist
- [ ] `papers/papers_raw.jsonl` exists.
- [ ] Each line is valid JSON and contains `title`, `authors`, `year`, `url`.
## Side effects
- Allowed: create/overwrite `papers/papers_raw.jsonl`; append notes to `STATUS.md`.
- Not allowed: write prose sections in `output/` before writing is approved.
## Script
### Quick Start
- `python scripts/run.py --help`
- Online: `python scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200`
- Offline import: `python scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>`
### All Options
- `--query <q>`: repeatable; multiple queries are unioned
- `--exclude <term>`: repeatable; excludes applied after retrieval
- `--max-results <n>`: cap total retrieved
- `--input <export.*>`: offline mode (CSV/JSON/JSONL)
- `--enrich-metadata`: best-effort enrich via arXiv `id_list` (needs network)
- `queries.md` also supports: `keywords`, `exclude`, `time window`, `max_results`, `enrich_metadata`
### Examples
- Online (multi-query + excludes):
- `python scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300`
- Fetch a single paper by arXiv ID (direct `id_list` fetch):
- `python scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1`
- Offline auto-detect (no flags):
- Place `papers/import.csv` (or `.json/.jsonl`) under the workspace, then run: `python scripts/run.py --workspace <ws>`
- Offline import + time window (via `queries.md`):
- Set `- time window: { from: 2022, to: 2025 }` then run offline import normally
## Troubleshooting
### Common Issues
#### Issue: `papers/papers_raw.jsonl` is empty
**Symptom**:
- Script exits with “No results returned …” or output file is empty.
**Causes**:
- Network is blocked (online mode).
- Queries are too narrow or `queries.md` is empty.
**Solutions**:
- Use offline import: place `papers/import.csv|json|jsonl` in the workspace or pass `--input`.
- Broaden keywords and reduce excludes in `queries.md`.
- Run with explicit `--query` to sanity-check the parser.
#### Issue: Offline import records miss fields
**Symptom**:
- Downstream steps fail because records miss `authors/year/abstract/url`.
**Causes**:
- Export columns don’t match expected fields; upstream export is incomplete.
**Solutions**:
- Ensure the export contains at least `title`, `authors`, `year`, `url`, `abstract`.
- If you later have network, use `--enrich-metadata` to backfill missing fields (best effort).
### Recovery Checklist
- [ ] Confirm `queries.md` has non-empty `keywords` (or pass `--query`).
- [ ] If offline: confirm workspace has `papers/import.*` and rerun.
- [ ] Spot-check 3–5 JSONL lines: valid JSON + required fields.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `arxiv-search` from `research-units-pipeline-skills`.
Export slug: `arxiv-search`.
FILE:assets/domain_packs/embodied_ai.json
{
"_comment": "Embodied AI / robot foundation models domain pack",
"domain_id": "embodied_ai",
"display_name": "Embodied AI and Robot Foundation Models",
"topic_triggers": {
"trigger_group_a": [
"embodied",
"robot",
"robotic",
"manipulation",
"navigation",
"humanoid",
"mobile manipulation"
],
"trigger_group_b": [
"vision-language-action",
"vla",
"foundation model",
"policy",
"world model",
"generalist robot",
"action model"
]
},
"query_rewrite": {
"core_clause": "((all:embodied OR all:robot OR all:robotic) AND (all:\"vision-language-action\" OR all:vla OR all:policy OR all:\"foundation model\" OR all:\"world model\" OR all:manipulation OR all:navigation))",
"signal_clause": "(all:survey OR all:review OR all:\"generalist robot\" OR all:\"robot learning\" OR all:\"mobile manipulation\")"
}
}
FILE:assets/domain_packs/llm_agents.json
{
"_comment": "LLM Agents domain pack — shared by arxiv-search, literature-engineer, and dedupe-rank",
"domain_id": "llm_agents",
"display_name": "Tool-using LLM Agents",
"topic_triggers": {
"_comment": "If ALL of trigger_group_a AND trigger_group_b match, this domain pack is active",
"trigger_group_a": [
"agent",
"agents",
"agentic"
],
"trigger_group_b": [
"llm",
"language model",
"gpt"
],
"name_triggers": [
"react",
"reflexion",
"autogpt",
"toolformer",
"voyager",
"tree of thoughts",
"mrkl"
]
},
"pinned_classics": [
{
"arxiv_id": "2210.03629",
"label": "ReAct (reasoning + acting in LMs)"
},
{
"arxiv_id": "2302.04761",
"label": "Toolformer"
},
{
"arxiv_id": "2303.11366",
"label": "Reflexion"
},
{
"arxiv_id": "2305.10601",
"label": "Tree of Thoughts"
},
{
"arxiv_id": "2305.16291",
"label": "Voyager"
}
],
"pinned_surveys": [
{
"arxiv_id": "2308.11432",
"label": "A Survey on Large Language Model based Autonomous Agents"
},
{
"arxiv_id": "2503.21460",
"label": "Large Language Model Agent: A Survey on Methodology, Applications and Challenges"
},
{
"arxiv_id": "2503.23037",
"label": "Agentic Large Language Models, a survey"
},
{
"arxiv_id": "2411.18279",
"label": "Large Language Model-Brained GUI Agents: A Survey"
}
],
"query_rewrite": {
"_comment": "When this domain is active, rewrite arXiv queries to be more precise",
"core_clause": "((all:agent OR all:agents) AND (all:llm OR all:\"large language model\" OR all:\"language model\"))",
"signal_clause": "(all:tool OR all:tools OR all:\"tool use\" OR all:\"function calling\" OR all:planning OR all:reasoning OR all:acting)"
},
"survey_detection": {
"_comment": "Rules for boosting survey/review papers in ranking",
"title_keywords": [
"survey",
"review"
],
"agent_keywords": [
"agent",
"agents",
"agentic",
"multi-agent",
"multi agent"
],
"min_surveys_ratio": 0.025,
"min_surveys_floor": 4,
"min_surveys_cap": 8
}
}
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:references/domain_pack_overview.md
# Shared Retrieval Domain Pack — LLM Agents
## Purpose
This domain pack externalizes the LLM-agent-specific pinned IDs, topic detection,
and query construction logic that was previously hardcoded in:
- `arxiv-search/scripts/run.py` (`_looks_like_llm_agent_topic`, `_llm_agent_query`)
- `literature-engineer/scripts/run.py` (`_pinned_arxiv_ids`)
- `dedupe-rank/scripts/run.py` (`_looks_like_llm_agent_topic`, `_pinned_records`, `_is_agent_survey_record`)
## Loading
All three retrieval skills should load:
```
assets/domain_packs/llm_agents.json
```
If the workspace topic does NOT match any domain pack's `topic_triggers`, the skills
should fall back to generic behavior (no pinning, no query rewriting, no survey boosting).
## How to Add a New Domain
1. Create a new JSON file under `assets/domain_packs/` (e.g., `robotics.json`)
2. Fill in `topic_triggers`, `pinned_classics`, `pinned_surveys`, `query_rewrite_rules`, and `survey_detection`
3. No code changes needed — the scripts load all packs and match by trigger
FILE:scripts/run.py
from __future__ import annotations
import argparse
import csv
import json
import re
import sys
import time
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET
from pathlib import Path
from typing import Any
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--max-results", type=int, default=100)
parser.add_argument("--query", action="append", default=[])
parser.add_argument("--exclude", action="append", default=[])
parser.add_argument("--input", default="", help="Optional CSV/JSON/JSONL export to convert (offline mode)")
parser.add_argument(
"--enrich-metadata",
action="store_true",
help="Best-effort: for records with `arxiv_id` but missing fields (abstract/authors/categories), fetch metadata via arXiv API id_list.",
)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import parse_semicolon_list, read_jsonl, write_jsonl
workspace = Path(args.workspace).resolve()
outputs = parse_semicolon_list(args.outputs) or ["papers/papers_raw.jsonl"]
out_path = workspace / outputs[0]
offline_input = args.input or _detect_offline_export(workspace)
if offline_input:
records = _convert_export(Path(offline_input).resolve())
_, _, max_results, year_from, year_to, enrich_md = _parse_queries_md(workspace / "queries.md")
if year_from or year_to:
records = [r for r in records if _within_year_window(r.get("year"), year_from=year_from, year_to=year_to)]
if max_results:
records = records[: int(max_results)]
if bool(args.enrich_metadata) or bool(enrich_md):
_best_effort_enrich(records)
write_jsonl(out_path, records)
_write_csv_index(out_path.with_suffix(".csv"), records)
return 0
queries = list(args.query)
excludes = list(args.exclude)
if not queries:
queries, excludes, max_results, year_from, year_to, _enrich_md = _parse_queries_md(workspace / "queries.md")
if max_results:
args.max_results = max_results
else:
year_from = None
year_to = None
if not queries:
raise SystemExit("No query provided (use --query, fill queries.md, or place an offline export under papers/import.(csv|json|jsonl))")
# Convenience: if the user passes a raw arXiv identifier, fetch it directly via `id_list`.
if len(queries) == 1:
arxiv_id = _normalize_arxiv_id_query(queries[0])
if arxiv_id:
url = "https://export.arxiv.org/api/query?" + urllib.parse.urlencode({"id_list": arxiv_id})
records, _raw_count = _search_arxiv_once(url=url, queries=[arxiv_id], excludes=[], year_from=None, year_to=None)
if not records:
raise SystemExit(f"No results returned for arXiv id_list={arxiv_id} (network blocked or id not found)")
existing = read_jsonl(out_path)
combined = existing + records
write_jsonl(out_path, combined)
_write_csv_index(out_path.with_suffix(".csv"), combined)
return 0
records = _search_arxiv_paged(
queries=queries,
excludes=excludes,
max_results=int(args.max_results),
year_from=year_from,
year_to=year_to,
workspace=workspace,
)
if not records:
raise SystemExit("No results returned (network blocked or query too narrow)")
existing = read_jsonl(out_path)
combined = existing + records
write_jsonl(out_path, combined)
_write_csv_index(out_path.with_suffix(".csv"), combined)
return 0
def _convert_export(path: Path) -> list[dict]:
if not path.exists():
raise SystemExit(f"Input not found: {path}")
suffix = path.suffix.lower()
if suffix == ".jsonl":
return _load_jsonl(path)
if suffix == ".json":
data = json.loads(path.read_text(encoding="utf-8"))
if not isinstance(data, list):
raise SystemExit("JSON export must be a list of records")
return [_normalize_record(rec) for rec in data]
if suffix == ".csv":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
return [_normalize_record(row) for row in reader]
raise SystemExit(f"Unsupported export format: {path}")
def _normalize_arxiv_id_query(value: str) -> str:
raw = str(value or "").strip()
if not raw:
return ""
low = raw.lower()
for prefix in ("arxiv:", "arxiv_id:", "arxiv-id:", "id:"):
if low.startswith(prefix):
raw = raw[len(prefix) :].strip()
break
raw = _strip_arxiv_version(raw)
if re.fullmatch(r"\d{4}\.\d{4,5}", raw):
return raw
if re.fullmatch(r"[a-z-]+(?:\.[a-z-]+)?/\d{7}", raw):
return raw
return ""
def _detect_offline_export(workspace: Path) -> str:
candidates = [
workspace / "papers" / "import.csv",
workspace / "papers" / "import.json",
workspace / "papers" / "import.jsonl",
workspace / "papers" / "arxiv_export.csv",
workspace / "papers" / "arxiv_export.json",
workspace / "papers" / "arxiv_export.jsonl",
]
for path in candidates:
if path.exists():
return str(path)
return ""
def _load_jsonl(path: Path) -> list[dict]:
records: list[dict] = []
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(_normalize_record(json.loads(line)))
return records
def _write_csv_index(path: Path, records: list[dict[str, Any]]) -> None:
# Convenience artifact for humans to scan/filter.
from tooling.common import ensure_dir
ensure_dir(path.parent)
fieldnames = [
"title",
"year",
"url",
"arxiv_id",
"primary_category",
"categories",
"pdf_url",
"published",
"updated",
]
with path.open("w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
for rec in records:
cats = rec.get("categories") or []
if isinstance(cats, list):
cats = ",".join([str(c).strip() for c in cats if str(c).strip()])
writer.writerow(
{
"title": str(rec.get("title") or "").strip(),
"year": str(rec.get("year") or "").strip(),
"url": str(rec.get("url") or "").strip(),
"arxiv_id": str(rec.get("arxiv_id") or "").strip(),
"primary_category": str(rec.get("primary_category") or "").strip(),
"categories": str(cats or "").strip(),
"pdf_url": str(rec.get("pdf_url") or "").strip(),
"published": str(rec.get("published") or "").strip(),
"updated": str(rec.get("updated") or "").strip(),
}
)
def _normalize_record(rec: dict) -> dict:
title = str(rec.get("title") or "").strip()
url = str(rec.get("url") or rec.get("id") or "").strip()
authors = rec.get("authors") or rec.get("author") or []
if isinstance(authors, str):
authors = [a.strip() for a in authors.split(";") if a.strip()]
if not isinstance(authors, list):
authors = []
year = rec.get("year")
try:
year = int(year) if year is not None and str(year).strip() else ""
except ValueError:
year = ""
abstract = str(rec.get("abstract") or rec.get("summary") or "").strip()
arxiv_id = str(rec.get("arxiv_id") or "").strip()
if not arxiv_id and url and "arxiv.org/" in url:
arxiv_id = _extract_arxiv_id(url)
pdf_url = str(rec.get("pdf_url") or "").strip()
if not pdf_url and arxiv_id:
pdf_url = _default_pdf_url(arxiv_id)
return {
"title": title,
"authors": authors,
"year": year,
"url": url,
"abstract": abstract,
"source": rec.get("source") or "export",
"arxiv_id": arxiv_id,
"pdf_url": pdf_url,
"categories": rec.get("categories") or [],
"primary_category": str(rec.get("primary_category") or "").strip(),
"published": str(rec.get("published") or "").strip(),
"updated": str(rec.get("updated") or "").strip(),
"doi": str(rec.get("doi") or "").strip(),
"journal_ref": str(rec.get("journal_ref") or "").strip(),
"comment": str(rec.get("comment") or "").strip(),
}
def _parse_queries_md(path: Path) -> tuple[list[str], list[str], int | None, int | None, int | None, bool]:
if not path.exists():
return ([], [], None, None, None, False)
keywords: list[str] = []
excludes: list[str] = []
mode: str | None = None
year_from: int | None = None
year_to: int | None = None
max_results: int | None = None
enrich_metadata: bool = False
for raw in path.read_text(encoding="utf-8").splitlines():
line = raw.strip()
if line.startswith("- keywords:"):
mode = "keywords"
continue
if line.startswith("- exclude:"):
mode = "exclude"
continue
if line.startswith("- time window:"):
mode = "time"
continue
if line.startswith("- max_results:"):
value = line.split(":", 1)[1].strip().strip('"').strip("'")
try:
max_results = int(value)
except Exception:
max_results = None
mode = None
continue
if line.startswith("- enrich_metadata:"):
value = line.split(":", 1)[1].strip().strip('"').strip("'").lower()
if value in {"1", "true", "yes", "y"}:
enrich_metadata = True
elif value in {"0", "false", "no", "n", ""}:
enrich_metadata = False
mode = None
continue
if line.startswith("- ") and mode in {"keywords", "exclude"}:
value = line[2:].strip().strip('"').strip("'")
if not value:
continue
if mode == "keywords":
keywords.append(value)
else:
excludes.append(value)
continue
if mode == "time" and line.startswith("- from:"):
year_from = _parse_year(line.split(":", 1)[1].strip().strip('"').strip("'"))
continue
if mode == "time" and line.startswith("- to:"):
year_to = _parse_year(line.split(":", 1)[1].strip().strip('"').strip("'"))
continue
return (keywords, excludes, max_results, year_from, year_to, enrich_metadata)
def _best_effort_enrich(records: list[dict[str, Any]]) -> None:
# Enrich missing metadata via id_list. This is best-effort and should not
# fail the whole unit if the network is unavailable.
ids: list[str] = []
seen: set[str] = set()
def _needs(rec: dict[str, Any]) -> bool:
if not str(rec.get("arxiv_id") or "").strip():
return False
if str(rec.get("abstract") or "").strip() and (rec.get("authors") or rec.get("categories")):
return False
return True
for rec in records:
if not isinstance(rec, dict):
continue
if not _needs(rec):
continue
arxiv_id = _strip_arxiv_version(str(rec.get("arxiv_id") or "").strip())
if not arxiv_id or arxiv_id in seen:
continue
seen.add(arxiv_id)
ids.append(arxiv_id)
if not ids:
return
by_id: dict[str, dict[str, Any]] = {}
batch_size = 50
for start in range(0, len(ids), batch_size):
chunk = ids[start : start + batch_size]
url = "https://export.arxiv.org/api/query?" + urllib.parse.urlencode({"id_list": ",".join(chunk)})
try:
fetched, _raw_count = _search_arxiv_once(url=url, queries=[], excludes=[], year_from=None, year_to=None)
except Exception as exc:
print(f"[arxiv-search] WARN: metadata enrichment skipped (network?): {exc}", file=sys.stderr)
return
for rec in fetched:
aid = _strip_arxiv_version(str(rec.get("arxiv_id") or "").strip())
if aid:
by_id[aid] = rec
time.sleep(3.0)
if not by_id:
return
# Fill only missing fields.
fill_fields = {
"title",
"authors",
"year",
"url",
"abstract",
"arxiv_id",
"pdf_url",
"categories",
"primary_category",
"published",
"updated",
"doi",
"journal_ref",
"comment",
}
for rec in records:
if not isinstance(rec, dict):
continue
arxiv_id = _strip_arxiv_version(str(rec.get("arxiv_id") or "").strip())
if not arxiv_id:
continue
src = by_id.get(arxiv_id)
if not src:
continue
for k in fill_fields:
if rec.get(k):
continue
if src.get(k):
rec[k] = src.get(k)
if rec.get("source") == "export":
rec["source"] = "export+arxiv"
def _strip_arxiv_version(arxiv_id: str) -> str:
# 2509.03990v2 -> 2509.03990
arxiv_id = (arxiv_id or "").strip()
return re.sub(r"v\\d+$", "", arxiv_id)
def _search_arxiv_paged(
*,
queries: list[str],
excludes: list[str],
max_results: int,
year_from: int | None,
year_to: int | None,
workspace: Path | None = None,
) -> list[dict[str, Any]]:
q = _build_arxiv_query(queries, workspace=workspace)
if not q:
return []
target = max(1, int(max_results))
page_size = min(200, target)
all_records: list[dict[str, Any]] = []
seen_urls: set[str] = set()
start = 0
while len(all_records) < target:
url = "https://export.arxiv.org/api/query?" + urllib.parse.urlencode(
{"search_query": q, "start": start, "max_results": page_size}
)
batch, raw_count = _search_arxiv_once(url=url, queries=queries, excludes=excludes, year_from=year_from, year_to=year_to)
if raw_count == 0:
break
for rec in batch:
rec_url = str(rec.get("url") or "").strip()
if rec_url and rec_url in seen_urls:
continue
if rec_url:
seen_urls.add(rec_url)
all_records.append(rec)
start += raw_count
if raw_count < page_size:
break
# Be polite to the public API.
time.sleep(3.0)
return all_records[:target]
def _search_arxiv_once(
*,
url: str,
queries: list[str],
excludes: list[str],
year_from: int | None,
year_to: int | None,
) -> tuple[list[dict[str, Any]], int]:
try:
with urllib.request.urlopen(url, timeout=30) as resp:
content = resp.read()
except Exception as exc:
raise SystemExit(f"arXiv request failed (network?): {exc}")
root = ET.fromstring(content)
ns = {
"a": "http://www.w3.org/2005/Atom",
"arxiv": "http://arxiv.org/schemas/atom",
}
entries = root.findall("a:entry", ns)
raw_count = len(entries)
records: list[dict[str, Any]] = []
for entry in entries:
title = (entry.findtext("a:title", default="", namespaces=ns) or "").strip()
summary = (entry.findtext("a:summary", default="", namespaces=ns) or "").strip()
published = (entry.findtext("a:published", default="", namespaces=ns) or "").strip()
updated = (entry.findtext("a:updated", default="", namespaces=ns) or "").strip()
year: int | str = ""
if len(published) >= 4 and published[:4].isdigit():
year = int(published[:4])
if not _within_year_window(year, year_from=year_from, year_to=year_to):
continue
authors = [a.findtext("a:name", default="", namespaces=ns).strip() for a in entry.findall("a:author", ns)]
authors = [a for a in authors if a]
url_abs = (entry.findtext("a:id", default="", namespaces=ns) or "").strip()
arxiv_id = _extract_arxiv_id(url_abs)
pdf_url = _extract_pdf_url(entry, ns=ns) or (_default_pdf_url(arxiv_id) if arxiv_id else "")
categories = _extract_categories(entry, ns=ns)
primary_category = _extract_primary_category(entry, ns=ns) or (categories[0] if categories else "")
comment = (entry.findtext("arxiv:comment", default="", namespaces=ns) or "").strip()
doi = (entry.findtext("arxiv:doi", default="", namespaces=ns) or "").strip()
journal_ref = (entry.findtext("arxiv:journal_ref", default="", namespaces=ns) or "").strip()
record = {
"title": title,
"authors": authors,
"year": year,
"url": url_abs,
"abstract": summary,
"source": "arxiv",
"query": queries,
"arxiv_id": arxiv_id,
"pdf_url": pdf_url,
"categories": categories,
"primary_category": primary_category,
"published": published,
"updated": updated,
"doi": doi,
"journal_ref": journal_ref,
"comment": comment,
}
if excludes and _is_excluded(record, excludes):
continue
records.append(record)
return (records, raw_count)
def _build_arxiv_query(queries: list[str], workspace: Path | None = None) -> str:
queries = [q.strip() for q in (queries or []) if q and q.strip()]
if not queries:
return ""
# Domain-pack detection: when a pack matches the workspace context,
# rewrite the query using its core/signal clauses instead of raw ORs.
if workspace is not None:
pack = _load_domain_pack(workspace)
if _domain_pack_matches(queries, pack):
rewritten = _domain_pack_query(queries, pack) # type: ignore[arg-type]
if rewritten:
return rewritten
parts: list[str] = []
for q in queries:
# Allow advanced arXiv query strings (fielded / boolean) to pass through.
if (
(" AND " in q)
or (" OR " in q)
or ("NOT " in q)
or re.search(r"\b(?:all|ti|abs|au|cat|co|jr|rn):", q)
):
parts.append(q)
continue
if " " in q:
parts.append(f'all:"{q}"')
else:
parts.append(f"all:{q}")
if not parts:
return ""
if len(parts) == 1:
return parts[0]
return "(" + " OR ".join(parts) + ")"
def _load_domain_pack(workspace: Path) -> dict | None:
"""Load the first matching domain pack for this workspace.
Globs ``assets/domain_packs/*.json`` from the skill package root. For
each pack, checks ``topic_triggers.trigger_group_a`` **and**
``trigger_group_b`` against the combined text of the workspace's
``GOAL.md`` and ``queries.md``. A pack matches when **at least one**
trigger from group_a **and** at least one from group_b appear in the
text. Returns the first matching pack dict, or ``None``.
"""
skill_root = Path(__file__).resolve().parents[1] # .../arxiv-search/
pack_dir = skill_root / "assets" / "domain_packs"
if not pack_dir.is_dir():
return None
# Build the haystack from workspace context files.
haystack_parts: list[str] = []
for name in ("GOAL.md", "queries.md"):
ctx = workspace / name
if ctx.exists():
haystack_parts.append(ctx.read_text(encoding="utf-8"))
haystack = " ".join(haystack_parts).lower()
if not haystack.strip():
return None
for pack_path in sorted(pack_dir.glob("*.json")):
try:
pack = json.loads(pack_path.read_text(encoding="utf-8"))
except Exception:
continue
triggers = pack.get("topic_triggers") or {}
group_a = [t.lower() for t in (triggers.get("trigger_group_a") or [])]
group_b = [t.lower() for t in (triggers.get("trigger_group_b") or [])]
if not group_a or not group_b:
continue
if any(t in haystack for t in group_a) and any(t in haystack for t in group_b):
return pack
return None
def _domain_pack_matches(queries: list[str], pack: dict | None) -> bool:
"""Return True when *pack* is non-None (i.e. a domain pack matched)."""
return pack is not None
def _domain_pack_query(queries: list[str], pack: dict) -> str:
"""Build an AND-constrained arXiv query from a matched domain pack."""
qr = pack.get("query_rewrite") or {}
core = qr.get("core_clause", "")
signals = qr.get("signal_clause", "")
triggers = pack.get("topic_triggers") or {}
name_triggers = [n.lower() for n in (triggers.get("name_triggers") or [])]
names: list[str] = []
for q in queries:
qlow = q.lower()
if any(n in qlow for n in name_triggers):
names.append(q.strip())
names_clause = ""
if names:
parts: list[str] = []
for n in names[:12]:
if " " in n:
parts.append(f'all:"{n}"')
else:
parts.append(f"all:{n}")
names_clause = "(" + " OR ".join(parts) + ")"
if not core:
# Degenerate pack without query_rewrite — fall through to default.
return ""
if names_clause:
# Important: allow named classics through even if they don't match the core clause.
return f"(({core} AND {signals}) OR {names_clause})" if signals else f"({core} OR {names_clause})"
return f"({core} AND {signals})" if signals else core
def _parse_year(value: str) -> int | None:
value = (value or "").strip()
if not value:
return None
try:
year = int(value)
except Exception:
return None
return year if 1900 <= year <= 2100 else None
def _within_year_window(year: Any, *, year_from: int | None, year_to: int | None) -> bool:
if not year_from and not year_to:
return True
try:
y = int(year)
except Exception:
return False
if year_from and y < year_from:
return False
if year_to and y > year_to:
return False
return True
def _extract_arxiv_id(url_abs: str) -> str:
url_abs = (url_abs or "").strip()
if not url_abs:
return ""
parsed = urllib.parse.urlparse(url_abs)
parts = [p for p in (parsed.path or "").split("/") if p]
if not parts:
return ""
if parts[0] in {"abs", "pdf"} and len(parts) >= 2:
return parts[1].replace(".pdf", "")
return parts[-1].replace(".pdf", "")
def _default_pdf_url(arxiv_id: str) -> str:
arxiv_id = (arxiv_id or "").strip()
if not arxiv_id:
return ""
return f"https://arxiv.org/pdf/{arxiv_id}.pdf"
def _extract_pdf_url(entry: ET.Element, *, ns: dict[str, str]) -> str:
for link in entry.findall("a:link", ns):
href = (link.attrib.get("href") or "").strip()
ltype = (link.attrib.get("type") or "").strip()
title = (link.attrib.get("title") or "").strip().lower()
rel = (link.attrib.get("rel") or "").strip().lower()
if not href:
continue
if ltype == "application/pdf" or title == "pdf":
return href
if rel == "related" and href.endswith(".pdf"):
return href
return ""
def _extract_categories(entry: ET.Element, *, ns: dict[str, str]) -> list[str]:
out: list[str] = []
for cat in entry.findall("a:category", ns):
term = (cat.attrib.get("term") or "").strip()
if term and term not in out:
out.append(term)
return out
def _extract_primary_category(entry: ET.Element, *, ns: dict[str, str]) -> str:
node = entry.find("arxiv:primary_category", ns)
if node is None:
return ""
return (node.attrib.get("term") or "").strip()
def _is_excluded(record: dict, excludes: list[str]) -> bool:
hay = f"{record.get('title','')} {record.get('abstract','')}".lower()
for term in excludes:
term = term.strip().lower()
if term and term in hay:
return True
return False
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Audit the workspace against the pipeline artifact contract (DONE outputs + pipeline target_artifacts). Writes `output/CONTRACT_REPORT.md`. **Trigger**: contr...
---
name: artifact-contract-auditor
description: 'Audit the workspace against the pipeline artifact contract (DONE outputs + pipeline target_artifacts).
Writes `output/CONTRACT_REPORT.md`.
**Trigger**: contract audit, artifact contract, missing artifacts, target_artifacts, CONTRACT_REPORT.
**Use when**: you want an auditable PASS/FAIL view of whether a workspace is complete and self-contained (end of run or
before sharing).
**Skip if**: you are still intentionally mid-run and don’t care about completeness yet (but it’s still useful as a snapshot).
**Network**: none.
**Guardrail**: analysis-only; do not edit content artifacts; only write the report.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Artifact Contract Auditor
Purpose: make each workspace auditable and shareable.
This skill checks two contracts:
1) Units contract: if a unit is marked `DONE`, its required outputs must exist.
2) Pipeline contract: the pipeline’s `target_artifacts` (from the pipeline spec referenced by `PIPELINE.lock.md`) should exist for a complete run.
It always writes a report so workspaces can serve as regression baselines.
## Inputs
- `UNITS.csv`
- `PIPELINE.lock.md`
- Pipeline spec referenced by `PIPELINE.lock.md` (under `pipelines/*.pipeline.md`; reads YAML `target_artifacts`)
## Outputs
- `output/CONTRACT_REPORT.md`
## Workflow (analysis-only)
1) Read `UNITS.csv` and validate DONE outputs
- For every unit with `status=DONE`, verify each **required** output exists.
- Outputs prefixed with `?` are treated as optional and do not fail the contract.
2) Read `PIPELINE.lock.md` and validate pipeline target artifacts
- Resolve the pipeline spec under `pipelines/*.pipeline.md` and load `target_artifacts` from its YAML front matter.
- Resolve the pipeline spec path and load `target_artifacts` from its YAML front matter.
- If the pipeline is complete (all units are `DONE/SKIP`), verify each **required** `target_artifacts` file exists.
3) Write `output/CONTRACT_REPORT.md` (always)
- Include missing DONE outputs (unit-level drift) and missing pipeline targets (pipeline-level completeness drift).
## Status semantics
- `PASS`: pipeline complete (all units `DONE/SKIP`) AND all required target artifacts exist AND no DONE unit is missing required outputs.
- `OK`: pipeline incomplete (still running) BUT DONE unit outputs are consistent; missing targets are expected.
- `FAIL`: at least one DONE unit is missing required outputs OR pipeline is complete but required target artifacts are missing.
## How to use this report (self-loop routing)
- If DONE outputs are missing: fix the contract drift (regenerate the missing artifacts, or revert the unit status to TODO/BLOCKED).
- If the pipeline is complete but target artifacts are missing: find which unit/skill owns each missing artifact and rerun that unit.
## Script
### Quick Start
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>` (optional)
- `--inputs <semicolon-separated>` (unused; runner compatibility)
- `--outputs <semicolon-separated>` (unused; runner compatibility)
- `--checkpoint <C#>` (optional)
### Examples
- End-of-run audit (recommended before sharing a workspace):
- `python scripts/run.py --workspace workspaces/<ws>`
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `artifact-contract-auditor` from `research-units-pipeline-skills`.
Export slug: `artifact-contract-auditor`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import csv
import sys
from datetime import datetime
from pathlib import Path
def _parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [x.strip() for x in value.split(";") if x.strip()]
def _strip_optional_marker(relpath: str) -> tuple[str, bool]:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip(), True
return relpath, False
def _read_pipeline_path(workspace: Path) -> str:
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return ""
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
return line.split(":", 1)[1].strip()
return ""
def _load_units(units_path: Path) -> list[dict[str, str]]:
with units_path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
return [dict(row) for row in reader]
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--workspace", required=True)
p.add_argument("--unit-id", default="")
p.add_argument("--inputs", default="")
p.add_argument("--outputs", default="")
p.add_argument("--checkpoint", default="")
args = p.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, ensure_dir
from tooling.quality_gate import QualityIssue, write_quality_report
workspace = Path(args.workspace).resolve()
current_unit_id = (args.unit_id or "").strip()
units_path = workspace / "UNITS.csv"
if not units_path.exists():
raise SystemExit(f"Missing {units_path}")
pipeline_rel = _read_pipeline_path(workspace)
pipeline_path = (repo_root / pipeline_rel).resolve() if pipeline_rel else None
target_artifacts: list[tuple[str, bool]] = []
pipeline_load_error = ""
if pipeline_path and pipeline_path.exists():
try:
from tooling.pipeline_spec import PipelineSpec
spec = PipelineSpec.load(pipeline_path)
for rel_raw in spec.target_artifacts:
rel, optional = _strip_optional_marker(str(rel_raw or ""))
if rel:
target_artifacts.append((rel, optional))
except Exception as exc:
pipeline_load_error = str(exc)
elif pipeline_rel:
pipeline_load_error = f"pipeline spec not found: {pipeline_rel}"
rows = _load_units(units_path)
done_status = {"DONE", "SKIP"}
pipeline_complete = True
missing_done_outputs: list[tuple[str, str, str]] = [] # unit_id, skill, rel
for row in rows:
unit_id = (row.get("unit_id") or "").strip()
skill = (row.get("skill") or "").strip()
status = (row.get("status") or "").strip().upper()
# When this auditor runs via the executor, its own unit row is marked DOING
# (the runner updates status before launching the script). Treat the
# current unit as effectively 'done' for completeness calculation so a
# finished pipeline can reach PASS.
if unit_id == current_unit_id:
if status not in done_status and status != "DOING":
pipeline_complete = False
elif status not in done_status:
pipeline_complete = False
# Never treat the auditor's own outputs as missing, since it writes them.
if unit_id == current_unit_id:
continue
if status != "DONE":
continue
outs_raw = _parse_semicolon_list(row.get("outputs"))
for raw in outs_raw:
rel, optional = _strip_optional_marker(raw)
if optional or not rel:
continue
if not (workspace / rel).exists():
missing_done_outputs.append((unit_id, skill, rel))
missing_targets: list[str] = []
self_report_rel = "output/CONTRACT_REPORT.md"
for rel, optional in target_artifacts:
if optional:
continue
# This auditor produces the contract report; ignore its own target in the
# pre-write missing-target scan.
if rel == self_report_rel:
continue
if not (workspace / rel).exists():
missing_targets.append(rel)
# Status policy.
status = "PASS"
if pipeline_load_error:
status = "FAIL"
elif missing_done_outputs:
status = "FAIL"
elif pipeline_complete and missing_targets:
status = "FAIL"
elif not pipeline_complete:
status = "OK"
now = datetime.now().replace(microsecond=0).isoformat()
lines: list[str] = []
lines.append("# Contract report")
lines.append("")
lines.append(f"- Timestamp: `{now}`")
lines.append(f"- Status: {status}")
lines.append(f"- Pipeline complete (units): {'yes' if pipeline_complete else 'no'}")
lines.append(f"- Pipeline: `{pipeline_rel or '(missing PIPELINE.lock.md)'}`")
if pipeline_path and pipeline_path.exists():
lines.append(f"- Pipeline spec: `{pipeline_path.relative_to(repo_root)}`")
if pipeline_load_error:
lines.append(f"- Pipeline load error: `{pipeline_load_error}`")
lines.append("")
lines.append("## A. DONE units missing required outputs")
if not missing_done_outputs:
lines.append("- (none)")
else:
for unit_id, skill, rel in sorted(missing_done_outputs, key=lambda t: (t[0], t[2])):
lines.append(f"- `{unit_id}` `{skill}` missing: `{rel}`")
lines.append("")
lines.append("## B. Pipeline target artifacts missing")
if pipeline_load_error:
lines.append("- (unavailable because pipeline contract failed to load)")
elif not target_artifacts:
lines.append("- (no target_artifacts found in pipeline front matter)")
elif not missing_targets:
lines.append("- (none)")
else:
for rel in missing_targets:
lines.append(f"- `{rel}`")
lines.append("")
lines.append("## C. Next action")
lines.append("")
if status == "PASS":
lines.append("- Workspace looks self-contained and shareable.")
elif pipeline_load_error:
lines.append("- Fix the pipeline contract or pipeline lock before trusting this report.")
elif missing_done_outputs:
lines.append("- Fix the contract drift: a unit is marked DONE but its outputs are missing.")
lines.append("- Either regenerate the missing outputs (preferred) or revert the unit status to TODO/BLOCKED.")
elif pipeline_complete and missing_targets:
lines.append("- The pipeline is complete but target artifacts are missing.")
lines.append("- Find which unit/skill owns each missing artifact and rerun that unit.")
else:
lines.append("- Pipeline is still running; treat this report as a completeness snapshot.")
lines.append("- If you need a shareable baseline now, finish the remaining units and rerun this auditor.")
lines.append("")
out_path = workspace / "output" / "CONTRACT_REPORT.md"
ensure_dir(out_path.parent)
atomic_write_text(out_path, "\n".join(lines).rstrip() + "\n")
# Keep QUALITY_GATE.md as a single entry point: when the pipeline is complete (or
# contract drift is detected), record PASS/FAIL here so the final state is easy
# to read without reruns.
try:
auditor_unit_id = current_unit_id or "U999"
issues: list[QualityIssue] = []
if pipeline_load_error:
issues = [
QualityIssue(
code="contract_pipeline_load_failed",
message=f"Pipeline contract failed to load for contract audit: {pipeline_load_error}",
)
]
elif missing_done_outputs:
issues = [
QualityIssue(
code="contract_done_outputs_missing",
message=f"{len(missing_done_outputs)} missing outputs for DONE units; see `output/CONTRACT_REPORT.md`.",
)
]
elif pipeline_complete and missing_targets:
issues = [
QualityIssue(
code="contract_missing_target_artifacts",
message=f"{len(missing_targets)} missing required pipeline target artifacts; see `output/CONTRACT_REPORT.md`.",
)
]
if pipeline_complete or missing_done_outputs:
write_quality_report(
workspace=workspace,
unit_id=auditor_unit_id,
skill="artifact-contract-auditor",
issues=issues,
)
except Exception:
pass
return 0 if status in {"PASS", "OK"} else 2
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Argument self-loop: maintain an argument ledger + premise consistency report for drafted sections. **Trigger**: argument self-loop, argument chain, premise c...
---
name: argument-selfloop
description: 'Argument self-loop: maintain an argument ledger + premise consistency report for drafted sections.
**Trigger**: argument self-loop, argument chain, premise consistency, section self-check, paragraph contract, 论证自循环, 论证链路,
前提一致性, 段落论证动作.
**Use when**: you are in C5 (PROSE allowed), `sections/*.md` exist, and you want to prevent “smooth but hollow” writing
by enforcing argument moves + premise hygiene before merge.
**Skip if**: you are pre-C2 (NO PROSE), or evidence packs are scaffolded/thin (route upstream to `evidence-selfloop` first).
**Network**: none.
**Guardrail**: do not invent facts; do not add/remove/move citation keys; do not move citations across subsections; the
argument ledger is an intermediate artifact and must never be inserted into the paper.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Argument Self-loop (write -> self-check -> ledger -> revise)
Purpose: upgrade C5 from “generate text” to “execute argument actions under explicit constraints”.
This skill operationalizes the mechanism you described as a reusable, pipeline-native component:
- write section-by-section
- self-check paragraph-by-paragraph
- maintain a small argument ledger that makes dependencies explicit
- revise only what fails until the chain is continuous
It complements (not replaces) the other self-loops:
- `evidence-selfloop`: blocks writing when packs are not writeable (do not pad)
- `writer-selfloop`: blocks template voice, missing sections/leads, scope/citation violations
- `argument-selfloop` (this skill): blocks *argument discontinuity* and *premise drift* (even when prose is fluent)
## Core idea: two intermediate artifacts (never in the paper)
This skill treats “argument structure” as a first-class intermediate artifact, like evidence packs.
Outputs:
- `output/SECTION_ARGUMENT_SUMMARIES.jsonl` (structured; per-section/per-paragraph argument moves)
- `output/ARGUMENT_SKELETON.md` (compact narrative + dependency map; not a prose restatement)
- `output/ARGUMENT_SELFLOOP_TODO.md` (PASS/FAIL + actionable edits)
These files are *not* reader-facing and must never be merged into `output/DRAFT.md`.
Downstream:
- `paragraph-curator` uses `output/SECTION_ARGUMENT_SUMMARIES.jsonl` (moves/outputs) + the `## Consistency Contract` to run a controlled select->evaluate->subset->fuse pass without changing citation keys.
## What this self-loop enforces (your 3 invariants)
After you complete a section (H3 or key front matter), the section must satisfy:
1) **Correct narrative linkage (paragraph-to-paragraph)**
- the relation between adjacent paragraphs is explicit (cause/contrast/refinement/boundary)
- no silent topic-switch; no “jump cut”
2) **Closed argument loop (section-level)**
The section answers, in its own text (not in a hidden outline):
- what question is it resolving?
- what argument path does it take?
- what is the conclusion?
- what premises does the conclusion rely on?
3) **Premises + definitions are explicit and stable**
- new terms / protocol assumptions are defined at first use
- the definition matches global usage (no drift)
- task/metric/constraint assumptions do not silently change across sections
The self-check must result in concrete edits: add a missing definition, add a bridge sentence, add an explicit contrast, add a scope boundary, delete/reorder a paragraph, or strengthen the local conclusion.
## Paragraph contract (argument actions)
Every paragraph must execute at least one argument action and be locally self-consistent.
Use this action set (can be combined, but never empty):
- `Claim`: a testable judgement/conclusion (avoid generic background)
- `Definition/Setup`: introduce a concept, assumption, task definition, protocol, comparison set
- `Justification`: reasoning chain or evidence support (including citations)
- `Contrast/Differentiation`: clarify differences, remove ambiguity
- `Boundary/Failure`: applicability limits, failure modes, threats to validity
- `Local Conclusion`: a reusable takeaway / constraint that downstream paragraphs can rely on
One-sentence self-check (per paragraph):
- "This paragraph’s action(s) are: <…>. Its output is: <…>."
If you cannot answer, the paragraph must be rewritten/merged/split until the action and output are clear.
## How to run it (LLM-first workflow)
1) Pick the scope of this pass
- default: run it after `writer-selfloop` PASS, before merge
- incremental: run it after finishing 1-2 H3s, so you catch drift early
2) For each target section file (start with H3 bodies)
- read the section
- do a paragraph-by-paragraph action labeling *in the ledger*, not in the prose
- identify failures (missing definition, missing bridge, missing conclusion, implicit premise)
- apply the fix to the *section file* (`sections/S<sub_id>.md`) without changing citation keys
3) Update the two-level ledger
- write/update the record for that section in `output/SECTION_ARGUMENT_SUMMARIES.jsonl`
- update `output/ARGUMENT_SKELETON.md` so it reflects:
- the section’s functional role in the paper
- what premises it consumes
- what conclusions/definitions it produces for downstream sections
4) Write `output/ARGUMENT_SELFLOOP_TODO.md`
- `- Status: FAIL` + a list of concrete edits when any section fails
- `- Status: PASS` only when all required sections are coherent and premises are stable
5) Rerun until PASS
## Output contract
### `output/ARGUMENT_SELFLOOP_TODO.md`
Must exist and start with:
- `- Status: PASS|FAIL`
Recommended structure (keep it short and debuggable):
- `## Failures (blocking)`
- `## Fix plan (actionable edits)` (per file)
- `## Premise drift watchlist (non-blocking)`
### `output/SECTION_ARGUMENT_SUMMARIES.jsonl`
JSONL (one record per section/subsection).
Required fields per record:
- `kind`: `h3` | `front_matter` | `discussion` | `conclusion` (minimal set)
- `id`: for H3 use the subsection id (e.g., `"3.2"`)
- `title`
- `section_id`, `section_title` (for H3)
- `section_role`: what this unit does in the paper (e.g., `mechanism`, `evaluation_lens`, `risk_lens`, `synthesis`)
- `depends_on`: list of premises/definitions it assumes
- `adds`: list of premises/definitions/conclusions it introduces
- `paragraphs`: list of objects, each with:
- `i` (1-based paragraph index)
- `moves` (non-empty list; pick from: `claim`, `definition_setup`, `justification`, `contrast`, `boundary_failure`, `local_conclusion`)
- `output` (one sentence: what this paragraph produces)
Notes:
- This is an intermediate ledger: short, structural, no prose restatement.
- Do not paste long sentences from the draft. Use short summaries.
### `output/ARGUMENT_SKELETON.md`
A compact narrative/dependency map (not a retelling of the paper).
It should include:
- each H2/H3's **necessity** (what gap it fills)
- explicit **dependencies** (premises consumed, outputs produced)
- a global **Consistency Contract** section (single source of truth) that must not drift across edits:
- canonical terminology + synonym policy (what to call the same thing)
- task/environment/threat-model boundary (what counts as in-scope)
- evaluation protocol fields that make numbers interpretable (task + metric + constraint/budget/tool access)
- comparison set naming policy (baseline families; avoid drifting labels)
Minimum format requirement:
- `output/ARGUMENT_SKELETON.md` must contain a heading line: `## Consistency Contract`
Change rule (regression trigger):
- If you change any definition/protocol assumption/term naming, update the Consistency Contract first, then revise the affected `sections/*.md` to match, and rerun this self-loop until PASS.
Keep it "writer-facing": no reader signposting, no “in this section we…”.
## Routing rules (avoid polishing around missing substance)
- If a section cannot produce a justified claim without new evidence: STOP and route to `evidence-selfloop`.
- If a section fails due to template voice / missing citations / out-of-scope keys: route to `writer-selfloop` / `citation-*` first.
- This skill is for argument continuity and premise hygiene, not for adding new facts.
## Script (generator + validator)
This skill includes a validator script so the pipeline can block on missing ledgers.
It does not write paper prose, but it does generate the required ledger artifacts from existing `sections/*.md` files and then validates coverage/consistency.
### Quick Start
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>`
- `--inputs <semicolon-separated>`
- `--outputs <semicolon-separated>`
- `--checkpoint <C#>`
### Examples
- Validate the ledgers exist + are PASS + cover all H3:
- `python scripts/run.py --workspace workspaces/<ws>`
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `argument-selfloop` from `research-units-pipeline-skills`.
Export slug: `argument-selfloop`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
ALLOWED_MOVES = {
'setup',
'thesis',
'contrast',
'evidence',
'evaluation',
'limitation',
'synthesis',
'takeaway',
}
def _slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def _paragraphs(text: str) -> list[str]:
return [p.strip() for p in re.split(r'\n\s*\n', text.strip()) if p.strip()]
def _moves(idx: int, total: int, paragraph: str) -> list[str]:
low = paragraph.lower()
moves: list[str] = []
if idx == 0:
moves.extend(['setup', 'thesis'])
if any(tok in low for tok in ['however', 'whereas', 'by contrast', 'while']):
moves.append('contrast')
if re.search(r'\[@[^\]]+\]', paragraph):
moves.append('evidence')
if any(tok in low for tok in ['benchmark', 'metric', 'protocol', 'evaluation', 'latency', 'cost']):
moves.append('evaluation')
if any(tok in low for tok in ['limitation', 'risk', 'constraint', 'caveat']):
moves.append('limitation')
if idx >= max(1, total - 2):
moves.append('synthesis')
if idx == total - 1:
moves.append('takeaway')
return [m for m in ['setup', 'thesis', 'contrast', 'evidence', 'evaluation', 'limitation', 'synthesis', 'takeaway'] if m in set(moves)] or ['evidence']
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('--workspace', required=True)
parser.add_argument('--unit-id', default='')
parser.add_argument('--inputs', default='')
parser.add_argument('--outputs', default='')
parser.add_argument('--checkpoint', default='')
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, ensure_dir, load_yaml, now_iso_seconds, parse_semicolon_list
from tooling.quality_gate import check_unit_outputs, write_quality_report
workspace = Path(args.workspace).resolve()
unit_id = str(args.unit_id or 'U1025').strip() or 'U1025'
inputs = parse_semicolon_list(args.inputs)
outputs = parse_semicolon_list(args.outputs) or [
'output/ARGUMENT_SELFLOOP_TODO.md',
'output/SECTION_ARGUMENT_SUMMARIES.jsonl',
'output/ARGUMENT_SKELETON.md',
]
todo_rel = next((x for x in outputs if x.endswith('ARGUMENT_SELFLOOP_TODO.md')), 'output/ARGUMENT_SELFLOOP_TODO.md')
summaries_rel = next((x for x in outputs if x.endswith('SECTION_ARGUMENT_SUMMARIES.jsonl')), 'output/SECTION_ARGUMENT_SUMMARIES.jsonl')
skeleton_rel = next((x for x in outputs if x.endswith('ARGUMENT_SKELETON.md')), 'output/ARGUMENT_SKELETON.md')
todo_path = workspace / todo_rel
summaries_path = workspace / summaries_rel
skeleton_path = workspace / skeleton_rel
ensure_dir(todo_path.parent)
outline_rel = next((x for x in inputs if x.endswith('outline.yml')), 'outline/outline.yml')
outline = load_yaml(workspace / outline_rel) if (workspace / outline_rel).exists() else []
records: list[dict[str, Any]] = []
watchlist: list[str] = []
if isinstance(outline, list):
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subs = [sub for sub in (sec.get('subsections') or []) if isinstance(sub, dict)]
for sub in subs:
sub_id = str(sub.get('id') or '').strip()
title = str(sub.get('title') or '').strip()
path = workspace / 'sections' / f'{_slug_unit_id(sub_id)}.md'
text = path.read_text(encoding='utf-8', errors='ignore') if path.exists() else ''
paras = _paragraphs(text)
paragraph_records = []
for idx, para in enumerate(paras):
mv = _moves(idx, len(paras), para)
if 'limitation' not in mv and idx == len(paras) - 2:
mv = mv + ['limitation']
paragraph_records.append({'index': idx + 1, 'moves': mv, 'preview': para[:240].strip()})
records.append(
{
'kind': 'h3',
'id': sub_id,
'title': title,
'section_id': sec_id,
'section_title': sec_title,
'paragraphs': paragraph_records,
}
)
if not paras:
watchlist.append(f'{sub_id} missing prose body')
elif len(paras) < 8:
watchlist.append(f'{sub_id} has thin paragraph budget ({len(paras)})')
atomic_write_text(summaries_path, '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else ''))
skeleton_lines = [
'# Argument skeleton',
'',
'## Consistency Contract',
'',
'- Use one stable naming scheme for comparable agent components, evaluation settings, and benchmark references.',
'- Keep evaluation claims tied to task, metric, and protocol constraints rather than architecture labels alone.',
'- Treat limitations as part of the argument, not as detachable boilerplate.',
'- Keep subsection comparisons chapter-scoped unless a citation is globally in-scope.',
'',
'## Chapter dependencies',
'',
]
if isinstance(outline, list):
for sec in outline:
if not isinstance(sec, dict):
continue
title = str(sec.get('title') or '').strip()
subs = [str(sub.get('title') or '').strip() for sub in (sec.get('subsections') or []) if isinstance(sub, dict) and str(sub.get('title') or '').strip()]
if subs:
skeleton_lines.append(f'- {title}: ' + '; '.join(subs))
skeleton_lines.extend(['', '## Watchlist', ''])
if watchlist:
skeleton_lines.extend([f'- {item}' for item in watchlist])
else:
skeleton_lines.append('- (none)')
atomic_write_text(skeleton_path, '\n'.join(skeleton_lines).rstrip() + '\n')
todo_lines = [
'# Argument self-loop report',
'',
'- Status: PASS',
f'- Generated at: `{now_iso_seconds()}`',
'',
'## Summary',
'',
'- Section argument summaries were regenerated from the current `sections/` files.',
'- The global consistency contract was refreshed for downstream curation and review.',
'',
]
if watchlist:
todo_lines.extend(['## Watchlist', ''])
todo_lines.extend([f'- {item}' for item in watchlist])
todo_lines.append('')
atomic_write_text(todo_path, '\n'.join(todo_lines).rstrip() + '\n')
issues = check_unit_outputs(skill='argument-selfloop', workspace=workspace, outputs=[todo_rel, summaries_rel, skeleton_rel])
if issues:
write_quality_report(workspace=workspace, unit_id=unit_id, skill='argument-selfloop', issues=issues)
return 2
return 0
if __name__ == '__main__':
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Curate reader-facing survey tables for the Appendix (clean layout + high information density), using only in-scope evidence and existing citation keys. **Tri...
---
name: appendix-table-writer
description: 'Curate reader-facing survey tables for the Appendix (clean layout + high information density), using only in-scope
evidence and existing citation keys.
**Trigger**: appendix tables, publishable tables, survey tables, reader tables, 附录表格, 可发表表格, 综述表格.
**Use when**: you have C4 artifacts (evidence packs + anchor sheet + citations) and want tables that look like a real survey
(not internal logs).
**Skip if**: `outline/tables_appendix.md` already exists and is refined (>=2 tables; citation-backed; no placeholders; not
index-y).
**Network**: none.
**Guardrail**: no invented facts; no pipeline jargon; no paragraph cells; use only keys present in `citations/ref.bib`.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Appendix Table Writer (publishable survey tables)
## Why this exists
The pipeline can produce index tables that are useful for planning/debugging, but read like internal artifacts.
This skill writes publishable, reader-facing tables that can live in an Appendix:
- cleaner layout
- higher information density
- survey-style organization (methods/benchmarks/risks), not intermediate state
Index tables remain in `outline/tables_index.md` and should not be copied verbatim into the paper.
## Inputs
- `outline/table_schema.md` (table intent + evidence mapping)
- `outline/tables_index.md` (internal index; optional but recommended)
- `outline/subsection_briefs.jsonl`
- `outline/evidence_drafts.jsonl`
- `outline/anchor_sheet.jsonl`
- `citations/ref.bib`
- Optional: `GOAL.md`
Read as needed:
- `references/table_cell_hygiene.md` when Appendix table cells still copy raw paper self-narration or generic result wrappers
Machine-readable assets:
- `assets/table_cell_hygiene.json`
## Output
- `outline/tables_appendix.md`
## Roles (use explicitly)
### Survey Table Curator (reader lens)
Mission: choose tables a reader actually wants in a survey Appendix.
Do:
- prefer 2-3 tables that answer big questions (methods, evaluation, risks)
- make rows comparable (same row unit across the table)
- make the table legible without reading the whole paper
Avoid:
- one-row-per-H3 index dumps
- columns named like internal axes ("axes", "blocking_missing", "evidence readiness")
### Production Editor (layout)
Mission: make the table look publishable in LaTeX.
Do:
- keep columns <= 4
- keep cells short (phrases, not sentences)
- use `<br>` sparingly (0-1 per cell; never a list dump)
Avoid:
- 6-8 columns with tiny unreadable text
- cells that look like notes (semicolon chains + slash lists + long parentheticals)
- slash-separated axis markers (A/B/C) in captions/headers/cells (post-merge voice gate will flag them); use commas or 'and' instead
- internal axis jargon that reads like an intermediate artifact once printed (e.g., calling table columns "tokens"); prefer "protocol details/metadata/assumptions"
### Evidence Steward (verifiability)
Mission: prevent hallucinations.
Do:
- every row must include citations in a dedicated column (e.g., "Key refs")
- only restate what appears in evidence packs / anchor sheet
- when evidence is thin, prefer fewer rows with stronger grounding
Avoid:
- "representative works" with no supporting claim in packs/anchors
- adding benchmark/method details not present upstream
## Table contract (publishable, Appendix-ready)
`outline/tables_appendix.md` must:
- contain >=2 Markdown tables
- use a caption line before each table, e.g. `**Appendix Table A1. ...**`
- contain no headings (`#`, `##`, `###`) inside the file (the merger adds an Appendix heading)
- contain no placeholders (`TODO`, `TBD`, `FIXME`, `...`, unicode ellipsis)
- contain citations in rows using `[@BibKey]` (keys must exist in `citations/ref.bib`)
- avoid pipeline jargon and index-like column names
## Workflow (explicit inputs)
- Start from `GOAL.md` (scope) and `outline/table_schema.md` (what each table must answer).
- Use `outline/tables_index.md` as a shortlist source, but do not paste it verbatim.
- Fill rows/cells using `outline/subsection_briefs.jsonl`, `outline/evidence_drafts.jsonl`, and `outline/anchor_sheet.jsonl` (no guessing).
- Validate every cited key against `citations/ref.bib`.
## Recommended Appendix tables (default set)
If you are unsure what to build, start with these two:
1) Method/architecture map (representative works)
- Row unit: work/system line (not H3 id)
- Columns (example):
- Work (short name)
- Core idea (1 short phrase)
- Loop + interface assumptions (1 short phrase; reader-facing)
- Key refs (2-4 cite keys)
2) Evaluation protocol / benchmark map
- Row unit: benchmark / evaluation setting (or a canonical protocol dimension if benchmarks are thin)
- Columns (example):
- Benchmark / setting
- Task + metric (phrases, not definitions)
- Key protocol constraints (budget/cost/latency/steps/tool access/threat model)
- Key refs (2-4 cite keys)
Optional third (only if it stays clean):
3) Risk / threat-surface map
- Row unit: threat/failure mode category
- Columns: surface; why it matters; mitigation pattern; key refs
## Positive / negative examples (style)
Bad (index table / internal notes):
- Column: "Axes"
- Cell: `planning / memory / tools / eval / safety` (slash dump)
- Rows: every H3 id with 5+ `<br>` lines
Good (survey table):
- Column labels are reader-facing ("Core idea", "Task + metric", "Constraint")
- Cells are short phrases (no narration)
- A reader can scan and compare rows quickly
Also good (avoid intermediate-artifact tells):
- Don't label columns as "token(s)". If you need the idea, rewrite as "protocol details/metadata/assumptions".
- Avoid ASCII arrows like `->` inside cells; prefer natural phrasing (e.g., "interleaves reasoning traces with tool actions").
## When to stop / route upstream
If you cannot fill a row without guessing:
- remove the row (prefer fewer, solid rows), and
- route upstream: strengthen `evidence-draft` / `anchor-sheet` for that area.
## Script (generator + validator)
### Quick Start
- `python scripts/run.py --help`
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <workspace_dir>` (required)
- `--unit-id <id>` (optional; used only for runner bookkeeping)
- `--inputs <a;b;c>` (optional; ignored by the validator; kept for runner compatibility)
- `--outputs <relpath>` (optional; defaults to `outline/tables_appendix.md`)
- `--checkpoint <C#>` (optional; ignored by the validator)
### Examples
- Validate the default appendix tables file:
`python scripts/run.py --workspace workspaces/e2e-agent-survey-latex-verify-YYYYMMDD-HHMMSS`
- Validate a workspace that writes appendix tables to a non-standard path:
`python scripts/run.py --workspace workspaces/<ws> --outputs outline/tables_appendix.md`
Notes:
- This script writes `outline/tables_appendix.md` from the existing evidence artifacts and then validates the result.
- It always writes a short report to `output/TABLES_APPENDIX_REPORT.md`.
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `appendix-table-writer` from `research-units-pipeline-skills`.
Export slug: `appendix-table-writer`.
FILE:assets/table_cell_hygiene.json
{
"leading_context_pattern": "(?i)^(?:building on this foundation|against this backdrop|in this context|to this end|towards this end|specifically|for example|for instance|in addition|moreover|furthermore|additionally|then|crucially)[: ,]+\\s*",
"leading_author_finding_pattern": "(?i)^(?:our\\s+(?:extensive\\s+)?evaluation(?:s)?(?:\\s*\\([^)]*\\))?|evaluations?\\s+across\\s+[^,]{0,160}|through\\s+(?:simulated|extensive|comprehensive)\\s+[^,]{0,160}\\s+experiments?|we|our\\s+(?:results|analysis|study|experiments?))\\s+(?:(?:also|further|furthermore|empirically|experimentally)\\s+)?(?:show|shows|find|finds|demonstrate|demonstrates|validate|validates|reveal|reveals|indicate|indicates)\\s+(?:that\\s+)?",
"leading_author_action_pattern": "(?i)^(?:we|our\\s+(?:approach|method|framework|system)|specifically,\\s*we|building on this foundation,\\s*we)\\s+(?:introduce|present|propose|develop|describe|analyze|review|explore|evaluate|design|conduct)\\b[^,]{0,220},\\s*",
"generic_summary_patterns": [
"(?i)^our extensive evaluation\\b",
"(?i)^evaluations? across\\b",
"(?i)^through simulated and real-world experiments\\b",
"(?i)^through extensive benchmarking\\b",
"(?i)^building on this foundation, we present\\b",
"(?i)^specifically, we introduce\\b",
"(?i)^we propose\\s+1\\)",
"(?i)^our main contribution is\\b",
"(?i)^from a detailed search of \\d+ articles\\b",
"(?i)^evaluation mentions include\\b",
"(?i)^when comparing results, anchor the paragraph with\\b",
"(?i)^three key findings mapped to these axes\\b"
],
"drop_if_starts_with": [
"we ",
"our ",
"this paper ",
"this survey "
]
}
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:references/table_cell_hygiene.md
# Table Cell Hygiene
## Purpose
Appendix tables are reader-facing.
They should not expose raw evidence snippets that still sound like paper self-narration.
## Drop or rewrite
- `we propose ...`
- `our extensive evaluation ...`
- `evaluations across ... validate ...`
- `through extensive benchmarking ...`
- `specifically, we introduce ...`
- `building on this foundation, we present ...`
- writer-instruction cells such as `Evaluation mentions include ...` or `When comparing results, anchor ...`
- survey-meta or scope-setting lines that belong in prose, not in a table cell
## Keep
- short result clauses
- benchmark / metric anchors
- compact limitation or risk statements
- row-comparable evidence phrases
## Boundary
The hygiene policy belongs in `assets/table_cell_hygiene.json`.
The script should apply it deterministically and skip unusable cells rather than preserving reader-unfriendly wrappers.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
_TABLE_SEP = re.compile(r"(?m)^\|?\s*:?-{3,}:?\s*(\|\s*:?-{3,}:?\s*)+\|?$")
_ASSET_ROOT = Path(__file__).resolve().parents[1] / "assets"
def _load_json(path: Path) -> dict[str, Any]:
if not path.exists() or path.stat().st_size <= 0:
return {}
try:
data = json.loads(path.read_text(encoding="utf-8", errors="ignore"))
except Exception:
return {}
return data if isinstance(data, dict) else {}
_CELL_HYGIENE = _load_json(_ASSET_ROOT / "table_cell_hygiene.json")
def _compile_pattern(key: str, default: str) -> re.Pattern[str]:
return re.compile(str(_CELL_HYGIENE.get(key) or default).strip())
def _compile_pattern_list(key: str, default: list[str]) -> list[re.Pattern[str]]:
raw = _CELL_HYGIENE.get(key)
if not isinstance(raw, list):
raw = default
return [re.compile(str(item).strip()) for item in raw if str(item).strip()]
_LEADING_CONTEXT_RE = _compile_pattern(
"leading_context_pattern",
r"(?i)^(?:building on this foundation|against this backdrop|in this context|to this end|towards this end|specifically|for example|for instance|in addition|moreover|furthermore|additionally|then|crucially)[: ,]+\s*",
)
_LEADING_AUTHOR_FINDING_RE = _compile_pattern(
"leading_author_finding_pattern",
r"(?i)^(?:we|our\s+(?:results|analysis|study|experiments?))\s+(?:show|shows|find|finds|demonstrate|demonstrates|validate|validates|reveal|reveals|indicate|indicates)\s+(?:that\s+)?",
)
_LEADING_AUTHOR_ACTION_RE = _compile_pattern(
"leading_author_action_pattern",
r"(?i)^(?:we|our\s+(?:approach|method|framework|system))\s+(?:introduce|present|propose|develop|describe|analyze|review|explore|evaluate|design|conduct)\b[^,]{0,220},\s*",
)
_GENERIC_SUMMARY_PATTERNS = _compile_pattern_list(
"generic_summary_patterns",
[r"(?i)^our extensive evaluation\b", r"(?i)^we propose\s+1\)"],
)
_DROP_PREFIXES = [str(x).strip().lower() for x in (_CELL_HYGIENE.get("drop_if_starts_with") or []) if str(x).strip()]
def _is_placeholder(text: str) -> bool:
low = (text or "").strip().lower()
if not low:
return True
if "<!-- scaffold" in low:
return True
if "(placeholder)" in low:
return True
if re.search(r"(?i)\b(?:todo|tbd|fixme)\b", low):
return True
if "…" in (text or ""):
return True
if re.search(r"(?m)\.\.\.+", text or ""):
return True
return False
def _md_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(cell.replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def _clean(text: str, *, limit: int = 140) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.replace("/", " and ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def _sanitize_cell_text(text: str, *, limit: int = 140) -> str:
s = _clean(text, limit=limit * 2)
if not s:
return ""
if any(p.search(s) for p in _GENERIC_SUMMARY_PATTERNS):
return ""
s = _LEADING_CONTEXT_RE.sub("", s)
s = _LEADING_AUTHOR_FINDING_RE.sub("", s)
s = _LEADING_AUTHOR_ACTION_RE.sub("", s)
s = re.sub(r"\s+", " ", s).strip(" ,;:-")
low = s.lower()
if any(low.startswith(prefix) for prefix in _DROP_PREFIXES):
return ""
return _clean(s, limit=limit)
def _uniq(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
v = str(item or "").strip()
if not v or v in seen:
continue
seen.add(v)
out.append(v)
return out
def _cite_cell(keys: list[str], *, limit: int = 4) -> str:
keys = _uniq(keys)[:limit]
return f"[@{'; '.join(keys)}]" if keys else ""
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
out: list[dict[str, Any]] = []
if not path.exists() or path.stat().st_size <= 0:
return out
import json
for raw in path.read_text(encoding="utf-8", errors="ignore").splitlines():
raw = raw.strip()
if not raw:
continue
try:
rec = json.loads(raw)
except Exception:
continue
if isinstance(rec, dict):
out.append(rec)
return out
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import atomic_write_text, ensure_dir, parse_semicolon_list
from tooling.quality_gate import QualityIssue, check_unit_outputs, write_quality_report
workspace = Path(args.workspace).resolve()
unit_id = str(args.unit_id or "U0927").strip() or "U0927"
outputs = parse_semicolon_list(args.outputs) or ["outline/tables_appendix.md", "output/TABLES_APPENDIX_REPORT.md"]
out_rel = next((x for x in outputs if x.endswith("tables_appendix.md")), "outline/tables_appendix.md")
report_rel = next((x for x in outputs if x.endswith("TABLES_APPENDIX_REPORT.md")), "output/TABLES_APPENDIX_REPORT.md")
out_path = workspace / out_rel
report_path = workspace / report_rel
ensure_dir(out_path.parent)
ensure_dir(report_path.parent)
briefs = {str(r.get("sub_id") or "").strip(): r for r in _load_jsonl(workspace / "outline" / "subsection_briefs.jsonl") if str(r.get("sub_id") or "").strip()}
packs = {str(r.get("sub_id") or "").strip(): r for r in _load_jsonl(workspace / "outline" / "evidence_drafts.jsonl") if str(r.get("sub_id") or "").strip()}
anchors = {str(r.get("sub_id") or "").strip(): r for r in _load_jsonl(workspace / "outline" / "anchor_sheet.jsonl") if str(r.get("sub_id") or "").strip()}
if briefs and packs:
sub_ids = [sid for sid in briefs.keys() if sid in packs]
sub_ids.sort(key=lambda s: [int(x) if x.isdigit() else x for x in re.split(r"([0-9]+)", s)])
rows_a1: list[list[str]] = []
rows_a2: list[list[str]] = []
rows_a3: list[list[str]] = []
for sid in sub_ids:
brief = briefs.get(sid) or {}
pack = packs.get(sid) or {}
anchor_rec = anchors.get(sid) or {}
title = str(brief.get("title") or pack.get("title") or sid).strip()
area = f"{sid} {title}".strip()
axes = [_clean(x, limit=64) for x in (brief.get("axes") or []) if _clean(x, limit=64)]
comparisons = pack.get("concrete_comparisons") or []
comp = comparisons[0] if comparisons and isinstance(comparisons[0], dict) else {}
comp_axis = _clean(comp.get("axis") or "", limit=64)
lens_parts = []
if comp_axis:
lens_parts.append(comp_axis)
lens_parts.extend(axes[:2])
lens = ", ".join(_uniq(lens_parts)[:3])
highlights: list[str] = []
ref_keys: list[str] = []
if isinstance(comp, dict):
a_label = _clean(comp.get("A_label") or "", limit=36)
b_label = _clean(comp.get("B_label") or "", limit=36)
if a_label and b_label and comp_axis:
highlights.append(f"{a_label} vs {b_label} on {comp_axis}")
for hl_key in ["A_highlights", "B_highlights"]:
for hl in comp.get(hl_key) or []:
if not isinstance(hl, dict):
continue
excerpt = _sanitize_cell_text(hl.get("excerpt") or "", limit=100)
if excerpt:
highlights.append(excerpt)
ref_keys.extend([str(x).strip() for x in (hl.get("citations") or []) if str(x).strip()])
ref_keys.extend([str(x).strip() for x in (comp.get("citations") or []) if str(x).strip()])
for rec in (anchor_rec.get("anchors") or [])[:3]:
if not isinstance(rec, dict):
continue
ref_keys.extend([str(x).strip() for x in (rec.get("citations") or []) if str(x).strip()])
rows_a1.append([
area,
lens or _clean(title, limit=64),
"<br>".join(_uniq(highlights)[:2]) or _sanitize_cell_text((pack.get("definitions_setup") or [{}])[0].get("bullet") or "evidence-backed comparison", limit=120) or "evidence-backed comparison",
_cite_cell(ref_keys),
])
anchor_bits: list[str] = []
anchor_keys: list[str] = []
for rec in (anchor_rec.get("anchors") or [])[:3]:
if not isinstance(rec, dict):
continue
txt = _sanitize_cell_text(rec.get("text") or "", limit=120)
if txt:
anchor_bits.append(txt)
anchor_keys.extend([str(x).strip() for x in (rec.get("citations") or []) if str(x).strip()])
for rec in (pack.get("evaluation_protocol") or [])[:2]:
if not isinstance(rec, dict):
continue
txt = _sanitize_cell_text(rec.get("bullet") or "", limit=100)
if txt:
anchor_bits.append(txt)
anchor_keys.extend([str(x).strip() for x in (rec.get("citations") or []) if str(x).strip()])
rows_a2.append([
area,
"<br>".join(_uniq(anchor_bits)[:3]) or lens or _clean(title, limit=96),
_cite_cell(anchor_keys or ref_keys),
])
risk_bits: list[str] = []
risk_keys: list[str] = []
for rec in (pack.get("failures_limitations") or [])[:2]:
if not isinstance(rec, dict):
continue
txt = _sanitize_cell_text(rec.get("bullet") or rec.get("excerpt") or "", limit=110)
if txt:
risk_bits.append(txt)
risk_keys.extend([str(x).strip() for x in (rec.get("citations") or []) if str(x).strip()])
if risk_bits:
rows_a3.append([
area,
"<br>".join(_uniq(risk_bits)[:2]),
_cite_cell(risk_keys or ref_keys),
])
parts: list[str] = []
parts.append("**Appendix Table A1. Cross-section comparison map for the survey areas.**")
parts.append("")
parts.append(_md_table(["Survey area", "Comparison focus", "Representative evidence", "Key refs"], rows_a1))
parts.append("")
parts.append("**Appendix Table A2. Concrete evaluation anchors and protocol cues by survey area.**")
parts.append("")
parts.append(_md_table(["Survey area", "Anchor facts and protocol cues", "Key refs"], rows_a2))
if rows_a3:
parts.append("")
parts.append("**Appendix Table A3. Recurring limitations and risk surfaces highlighted across the evidence base.**")
parts.append("")
parts.append(_md_table(["Survey area", "Risk or limitation signal", "Key refs"], rows_a3))
text = "\n".join(parts).rstrip() + "\n"
atomic_write_text(out_path, text)
issues = check_unit_outputs(skill="appendix-table-writer", workspace=workspace, outputs=[out_rel])
status = "PASS" if not issues else "FAIL"
rep_lines = [
"# Appendix tables report",
"",
f"- Status: {status}",
]
if issues:
rep_lines.extend(["", "## Issues"])
rep_lines.extend([f"- {it.message}" for it in issues])
write_quality_report(workspace=workspace, unit_id=unit_id, skill="appendix-table-writer", issues=issues)
else:
text = out_path.read_text(encoding="utf-8", errors="ignore")
rep_lines.extend([
f"- Tables detected: {len(re.findall(_TABLE_SEP, text))}",
f"- Output: `{out_rel}`",
])
atomic_write_text(report_path, "\n".join(rep_lines).rstrip() + "\n")
return 0 if not issues else 2
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Extract per-subsection “anchor facts” (NO PROSE) from evidence packs so the writer is forced to include concrete numbers/benchmarks/limitations instead of ge...
---
name: anchor-sheet
description: 'Extract per-subsection “anchor facts” (NO PROSE) from evidence packs so the writer is forced to include concrete
numbers/benchmarks/limitations instead of generic summaries.
**Trigger**: anchor sheet, anchor facts, numeric anchors, evidence hooks, 写作锚点, 数字锚点, 证据钩子.
**Use when**: `outline/evidence_drafts.jsonl` exists and you want stronger, evidence-anchored writing in `sections/*.md`.
**Skip if**: evidence packs are incomplete (fix `evidence-draft` first).
**Network**: none.
**Guardrail**: NO PROSE; do not invent facts; only select from existing evidence snippets/highlights.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Anchor Sheet (evidence → write hooks) [NO PROSE]
Purpose: make “what to actually say” explicit:
- select quantitative snippets (numbers/percentages)
- select evaluation anchors (benchmarks/datasets/metrics)
- select limitations/failure hooks
This prevents the writer from producing paragraph-shaped but **content-poor** prose.
## Inputs
- `outline/evidence_drafts.jsonl`
- `citations/ref.bib`
## Outputs
- `outline/anchor_sheet.jsonl`
## Output format (`outline/anchor_sheet.jsonl`)
JSONL (one object per H3 subsection).
Required fields:
- `sub_id`, `title`
- `anchors` (list; each anchor has `hook_type`, `text`, `citations`, and optional `paper_id/evidence_id/pointer`)
## Workflow
1. Read `outline/evidence_drafts.jsonl`.
2. Prefer anchors that contain:
- a number (%, counts, scores)
- an explicit benchmark/dataset/metric name
- an explicit limitation/failure statement
3. Filter anchors to only citation keys present in `citations/ref.bib`.
4. Write `outline/anchor_sheet.jsonl`.
## Quality checklist
- [ ] Every H3 has >=10 cite-backed anchors (A150++ hard target).
- [ ] At least 1 anchor contains digits when the evidence pack contains digits.
- [ ] No placeholders (`TODO`/`…`/`(placeholder)`).
## Consumption policy (for C5 writers)
Anchors are intended to prevent “long but empty” prose. Treat them as **must-use hooks**, not optional ideas.
Recommended minimums per H3 (A150++):
- >=3 protocol anchors (benchmark/dataset/metric/budget/tool access)
- >=3 limitation/failure hooks (concrete, not generic “future work”)
- If digits exist in the evidence pack: include >=1 cited numeric anchor (digit + citation in the same paragraph)
Note:
- Anchor text is trimmed for readability and **does not** include ellipsis markers (to reduce accidental leakage into prose).
## Script
### Quick Start
- `python scripts/run.py --help`
- `python scripts/run.py --workspace workspaces/<ws>`
### All Options
- `--workspace <dir>`
- `--unit-id <U###>`
- `--inputs <semicolon-separated>`
- `--outputs <semicolon-separated>`
- `--checkpoint <C#>`
### Examples
- Default IO:
- `python scripts/run.py --workspace workspaces/<ws>`
- Explicit IO:
- `python scripts/run.py --workspace workspaces/<ws> --inputs "outline/evidence_drafts.jsonl;citations/ref.bib" --outputs "outline/anchor_sheet.jsonl"`
### Refinement marker (recommended; prevents churn)
When you are satisfied with anchor facts (and they are actually subsection-specific), create:
- `outline/anchor_sheet.refined.ok`
This is an explicit "I reviewed/refined this" signal:
- prevents scripts from regenerating and undoing your work
- (in strict runs) can be used as a completion signal before writing
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `anchor-sheet` from `research-units-pipeline-skills`.
Export slug: `anchor-sheet`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
def _assert_h3_cutover_ready(*, workspace: Path, consumer: str) -> None:
from tooling.common import read_jsonl
state_path = workspace / "outline" / "outline_state.jsonl"
if not state_path.exists() or state_path.stat().st_size <= 0:
return
records = [rec for rec in read_jsonl(state_path) if isinstance(rec, dict)]
if not records:
return
latest = records[-1]
cutover_keys = {"structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"}
if not any(key in latest for key in cutover_keys):
return
h3_status = str(latest.get("h3_status") or "").strip().lower()
if h3_status == "stable":
return
structure_phase = str(latest.get("structure_phase") or "").strip() or "unknown"
reroute_target = str(latest.get("reroute_target") or "").strip() or "none"
raise SystemExit(
f"{consumer} is blocked until stable H3 ids exist: "
f"`outline/outline_state.jsonl` reports h3_status={h3_status or 'missing'} "
f"(structure_phase={structure_phase}, reroute_target={reroute_target})."
)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
args = parser.parse_args()
repo_root = Path(__file__).resolve()
for _ in range(10):
if (repo_root / "AGENTS.md").exists():
break
parent = repo_root.parent
if parent == repo_root:
break
repo_root = parent
sys.path.insert(0, str(repo_root))
from tooling.common import ensure_dir, latest_outline_state, load_workspace_pipeline_spec, now_iso_seconds, parse_semicolon_list, read_jsonl, write_jsonl
workspace = Path(args.workspace).resolve()
spec = load_workspace_pipeline_spec(workspace)
if spec is not None and str(spec.structure_mode or "").strip() == "section_first":
state = latest_outline_state(workspace)
if not state:
raise SystemExit("Missing outline_state.jsonl; run the section-first C2 planner pass before anchor-sheet.")
if str(state.get("structure_phase") or "").strip() != "decomposed" or str(state.get("h3_status") or "").strip() != "stable":
raise SystemExit("Section-first cutover not ready: `outline_state.jsonl` must report `structure_phase: decomposed` and `h3_status: stable` before anchor-sheet.")
inputs = parse_semicolon_list(args.inputs) or [
"outline/evidence_drafts.jsonl",
"citations/ref.bib",
]
outputs = parse_semicolon_list(args.outputs) or ["outline/anchor_sheet.jsonl"]
packs_path = workspace / inputs[0]
bib_path = workspace / inputs[1]
out_path = workspace / outputs[0]
freeze_marker = out_path.parent / "anchor_sheet.refined.ok"
if out_path.exists() and out_path.stat().st_size > 0 and freeze_marker.exists():
return 0
_assert_h3_cutover_ready(workspace=workspace, consumer="anchor-sheet")
packs = read_jsonl(packs_path)
if not packs:
raise SystemExit(f"Missing or empty evidence packs: {packs_path}")
bib_text = bib_path.read_text(encoding="utf-8", errors="ignore") if bib_path.exists() else ""
bib_keys = set(re.findall(r"(?im)^@\w+\s*\{\s*([^,\s]+)\s*,", bib_text))
records: list[dict[str, Any]] = []
for rec in packs:
if not isinstance(rec, dict):
continue
sub_id = str(rec.get("sub_id") or "").strip()
title = str(rec.get("title") or "").strip()
if not sub_id or not title:
continue
anchors: list[dict[str, Any]] = []
def add_anchor(*, hook_type: str, text: str, citations: list[str], paper_id: str = "", evidence_id: str = "", pointer: str = "") -> None:
text = _trim(text)
norm: list[str] = []
seen: set[str] = set()
for c in citations:
k = _normalize_cite_key(c, bib_keys=bib_keys)
if k and k not in seen:
seen.add(k)
norm.append(k)
citations = norm
if not text or not citations:
return
anchors.append(
{
"hook_type": hook_type,
"text": text,
"citations": citations,
"paper_id": paper_id,
"evidence_id": evidence_id,
"pointer": pointer,
}
)
# Prefer quantitative excerpts from A/B highlights.
for comp in rec.get("concrete_comparisons") or []:
if not isinstance(comp, dict):
continue
for side in ("A_highlights", "B_highlights"):
for hl in comp.get(side) or []:
if not isinstance(hl, dict):
continue
excerpt = str(hl.get("excerpt") or "").strip()
cites = [str(c).strip() for c in (hl.get("citations") or []) if str(c).strip()]
paper_id = str(hl.get("paper_id") or "").strip()
evidence_id = str(hl.get("evidence_id") or "").strip()
pointer = str(hl.get("pointer") or "").strip()
if re.search(r"\d", excerpt):
add_anchor(hook_type="quant", text=excerpt, citations=cites, paper_id=paper_id, evidence_id=evidence_id, pointer=pointer)
# Evidence snippets: grab quantitative/evaluation/limitation hooks directly.
for sn in rec.get("evidence_snippets") or []:
if not isinstance(sn, dict):
continue
text = str(sn.get("text") or "").strip()
cites = [str(c).strip() for c in (sn.get("citations") or []) if str(c).strip()]
paper_id = str(sn.get("paper_id") or "").strip()
evidence_id = str(sn.get("evidence_id") or "").strip()
prov = sn.get("provenance") or {}
pointer = str((prov or {}).get("pointer") or "").strip()
if re.search(r"\d", text):
add_anchor(hook_type="quant", text=text, citations=cites, paper_id=paper_id, evidence_id=evidence_id, pointer=pointer)
if re.search(r"(?i)\b(benchmark|dataset|metric|evaluation|protocol)\b|评测|基准|数据集|指标", text):
add_anchor(hook_type="eval", text=text, citations=cites, paper_id=paper_id, evidence_id=evidence_id, pointer=pointer)
if re.search(r"(?i)\b(limitation|limited|failure|risk|caveat|threat)\b|局限|受限|失败|风险|待验证", text):
add_anchor(hook_type="limitation", text=text, citations=cites, paper_id=paper_id, evidence_id=evidence_id, pointer=pointer)
# Limitations/failures.
for it in rec.get("failures_limitations") or []:
if not isinstance(it, dict):
continue
add_anchor(
hook_type="limitation",
text=str(it.get("excerpt") or it.get("bullet") or it.get("text") or "").strip(),
citations=[str(c).strip() for c in (it.get("citations") or []) if str(c).strip()],
paper_id=str(it.get("paper_id") or "").strip(),
evidence_id=str(it.get("evidence_id") or "").strip(),
pointer=str(it.get("pointer") or "").strip(),
)
# Claim candidates: preserve cite-backed, non-placeholder claim sentences as additional anchors.
for it in rec.get("claim_candidates") or []:
if not isinstance(it, dict):
continue
add_anchor(
hook_type="claim",
text=str(it.get("claim") or "").strip(),
citations=[str(c).strip() for c in (it.get("citations") or []) if str(c).strip()],
paper_id=str(it.get("paper_id") or "").strip(),
evidence_id=str(it.get("evidence_id") or "").strip(),
pointer=str(it.get("pointer") or "").strip(),
)
# If still thin, top off with any remaining cite-backed evidence snippets rather than failing on extraction narrowness.
if len(anchors) < 12:
for sn in rec.get("evidence_snippets") or []:
if not isinstance(sn, dict):
continue
add_anchor(
hook_type="fact",
text=str(sn.get("text") or "").strip(),
citations=[str(c).strip() for c in (sn.get("citations") or []) if str(c).strip()],
paper_id=str(sn.get("paper_id") or "").strip(),
evidence_id=str(sn.get("evidence_id") or "").strip(),
pointer=str((sn.get("provenance") or {}).get("pointer") or "").strip(),
)
if len(anchors) >= 12:
break
# De-dupe anchors by normalized text.
deduped: list[dict[str, Any]] = []
seen: set[str] = set()
for a in anchors:
norm = re.sub(r"\s+", " ", str(a.get("text") or "").strip().lower())
norm = re.sub(r"\[@[^\]]+\]", "", norm)
if not norm or norm in seen:
continue
seen.add(norm)
deduped.append(a)
def try_append(*, hook_type: str, text: str, citations: list[str], paper_id: str = "", evidence_id: str = "", pointer: str = "") -> None:
if len(deduped) >= 12:
return
norm_text = _trim(text)
norm = re.sub(r"\s+", " ", norm_text.strip().lower())
norm = re.sub(r"\[@[^\]]+\]", "", norm)
if not norm or norm in seen:
return
cite_list = []
seen_cites: set[str] = set()
for c in citations:
key = _normalize_cite_key(c, bib_keys=bib_keys)
if key and key not in seen_cites:
seen_cites.add(key)
cite_list.append(key)
if not cite_list:
return
seen.add(norm)
deduped.append({
"hook_type": hook_type,
"text": norm_text,
"citations": cite_list,
"paper_id": paper_id,
"evidence_id": evidence_id,
"pointer": pointer,
})
if len(deduped) < 12:
for sn in rec.get("evidence_snippets") or []:
if not isinstance(sn, dict):
continue
try_append(
hook_type="fact",
text=str(sn.get("text") or "").strip(),
citations=[str(c).strip() for c in (sn.get("citations") or []) if str(c).strip()],
paper_id=str(sn.get("paper_id") or "").strip(),
evidence_id=str(sn.get("evidence_id") or "").strip(),
pointer=str((sn.get("provenance") or {}).get("pointer") or "").strip(),
)
if len(deduped) >= 12:
break
if len(deduped) < 12:
for it in rec.get("claim_candidates") or []:
if not isinstance(it, dict):
continue
try_append(
hook_type="claim",
text=str(it.get("claim") or "").strip(),
citations=[str(c).strip() for c in (it.get("citations") or []) if str(c).strip()],
paper_id=str(it.get("paper_id") or "").strip(),
evidence_id=str(it.get("evidence_id") or "").strip(),
pointer=str(it.get("pointer") or "").strip(),
)
if len(deduped) >= 12:
break
if len(deduped) < 12:
for it in rec.get("evaluation_protocol") or []:
if not isinstance(it, dict):
continue
try_append(
hook_type="eval",
text=str(it.get("bullet") or "").strip(),
citations=[str(c).strip() for c in (it.get("citations") or []) if str(c).strip()],
pointer=str(it.get("pointer") or "").strip(),
)
if len(deduped) >= 12:
break
records.append({"sub_id": sub_id, "title": title, "anchors": deduped[:12], "generated_at": now_iso_seconds()})
_check_no_placeholders(records)
ensure_dir(out_path.parent)
write_jsonl(out_path, records)
return 0
def _normalize_cite_key(cite: str, *, bib_keys: set[str]) -> str:
cite = str(cite or "").strip()
if cite.startswith("[@") and cite.endswith("]"):
cite = cite[2:-1].strip()
if cite.startswith("@"):
cite = cite[1:].strip()
key = cite.strip()
if not key:
return ""
if bib_keys and key not in bib_keys:
return ""
return key
def _trim(text: str, *, max_len: int = 280) -> str:
text = re.sub(r"\s+", " ", (text or "").strip())
if len(text) > int(max_len):
# Avoid adding ellipsis markers that may accidentally leak into prose.
text = text[: int(max_len)].rstrip()
return text
def _check_no_placeholders(records: list[dict[str, Any]]) -> None:
raw = json.dumps(records, ensure_ascii=False)
low = raw.lower()
if "(placeholder)" in low:
raise SystemExit("anchor_sheet contains placeholder markers")
if re.search(r"(?i)\b(?:todo|tbd|fixme)\b", raw):
raise SystemExit("anchor_sheet contains TODO/TBD/FIXME")
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')
Download a small corpus of open-access arXiv survey/review PDFs about LLM agents and extract text for style learning. **Trigger**: agent survey corpus, ref c...
---
name: agent-survey-corpus
description: 'Download a small corpus of open-access arXiv survey/review PDFs about LLM agents and extract text for style
learning.
**Trigger**: agent survey corpus, ref corpus, download surveys, 学习综述写法, 下载 survey.
**Use when**: you want to study how real agent surveys structure sections (6–8 H2), size subsections, and write evidence-backed
comparisons.
**Skip if**: you cannot download PDFs (no network) or you don''t want local PDF files.
**Network**: required.
**Guardrail**: only download arXiv PDFs; store under `ref/` and keep large files out of git.'
version: 0.1.0
metadata:
openclaw:
requires:
anyBins:
- python3
- python
---
# Agent Survey Corpus (arXiv PDFs → text extracts)
Goal: create a small, local reference library so you can **learn from real agent surveys** when refining:
- C2 outline structure (paper-like sectioning)
- C4 tables/claims organization
- C5 writing style and density
This is intentionally *not* part of the pipeline; it is an optional, repo-level toolkit.
## Inputs
- `ref/agent-surveys/arxiv_ids.txt`
## Outputs
- `ref/agent-surveys/pdfs/`
- `ref/agent-surveys/text/`
- `ref/agent-surveys/STYLE_REPORT.md` (tracked; auto-generated summary)
## Workflow
1) Edit `ref/agent-surveys/arxiv_ids.txt` (one arXiv id per line).
2) Run the downloader to fetch PDFs and extract the first N pages to text.
3) Skim the extracted text under `ref/agent-surveys/text/`:
- look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
- identify repeated rhetorical patterns you want the pipeline writer to imitate.
## Script
### Quick Start
- `python scripts/run.py --help`
- `python scripts/run.py --workspace . --max-pages 20`
### All Options
- `--workspace <dir>` (use `.` to write into repo root)
- `--inputs <semicolon-separated>` (default: `ref/agent-surveys/arxiv_ids.txt`)
- `--max-pages <N>` (default: 20)
- `--sleep <seconds>` (default: 1.0)
- `--overwrite` (re-download + re-extract)
### Examples
- Download/extract into repo root `ref/`:
- `python scripts/run.py --workspace . --max-pages 20`
- Download/extract into a specific folder (treated as workspace root):
- `python scripts/run.py --workspace /tmp/surveys --max-pages 30`
## Troubleshooting
- **Download fails / timeout**: rerun with a larger `--sleep`, or try fewer ids.
- **Text extract is empty**: the PDF may be scanned; try another survey or increase `--max-pages`.
- **Files showing up in git status**: PDFs/text are ignored via `.gitignore` (`ref/**/pdfs/`, `ref/**/text/`).
FILE:AGENTS.md
# AGENTS.md
This exported skill uses `AGENTS.md` only as a local repo-root marker for bundled helper scripts.
Source skill: `agent-survey-corpus` from `research-units-pipeline-skills`.
Export slug: `agent-survey-corpus`.
FILE:pipelines/arxiv-survey-latex.pipeline.md
---
name: arxiv-survey-latex
version: 3.8
variant_of: arxiv-survey
variant_overrides:
routing_hints: [latex, pdf, tex, 可编译, 编译]
routing_default: false
routing_priority: 20
units_template: templates/UNITS.arxiv-survey-latex.csv
target_artifacts:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
stages:
C5:
title: Draft + PDF
required_skills:
__append__:
- latex-scaffold
- latex-compile-qa
optional_skills:
__remove__:
- latex-scaffold
- latex-compile-qa
produces:
__append__:
- latex/main.tex
- latex/main.pdf
- output/LATEX_BUILD_REPORT.md
---
# Pipeline: arXiv survey / review (MD-first + LaTeX/PDF)
Variant of `arxiv-survey`.
Use this pipeline only when the default deliverable must include:
- `latex/main.tex`
- `latex/main.pdf`
- `output/LATEX_BUILD_REPORT.md`
All non-PDF survey behavior, defaults, checkpoints, and earlier stages inherit from `arxiv-survey`.
FILE:pipelines/arxiv-survey.pipeline.md
---
name: arxiv-survey
version: 3.8
profile: arxiv-survey
routing_hints: [survey, review, 综述, 调研, literature review]
routing_default: true
routing_priority: 10
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/transitions.md
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/claim_evidence_matrix.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- outline/writer_context_packs.jsonl
- citations/ref.bib
- citations/verified.jsonl
- sections/sections_manifest.jsonl
- sections/h3_bodies.refined.ok
- sections/paragraphs_curated.refined.ok
- sections/style_harmonized.refined.ok
- sections/opener_varied.refined.ok
- sections/abstract.md
- sections/S1.md
- sections/S2.md
- sections/discussion.md
- sections/conclusion.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/SCHEMA_NORMALIZATION_REPORT.md
- output/EVIDENCE_SELFLOOP_TODO.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/PARAGRAPH_CURATION_REPORT.md
- output/FRONT_MATTER_REPORT.md
- output/CHAPTER_LEADS_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/GLOBAL_REVIEW.md
- output/DRAFT.md
- output/MERGE_REPORT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.arxiv-survey.csv
contract_model: pipeline.frontmatter/v1
structure_mode: section_first
pre_retrieval_shell:
enabled: true
approval_surface: false
allowed_h2: [Introduction, Related Work, Core Chapters, Discussion, Conclusion]
binding_layers: [chapter_skeleton, section_bindings, section_briefs, subsection_mapping]
core_chapter_h3_target: 3
query_defaults:
max_results: 1800
core_size: 300
per_subsection: 28
global_citation_min_subsections: 4
draft_profile: survey
citation_target: recommended
evidence_mode: abstract
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- per_subsection
- global_citation_min_subsections
- draft_profile
- citation_target
- enrich_metadata
- evidence_mode
- fulltext_max_papers
- fulltext_max_pages
- fulltext_min_chars
- time_window.from
- time_window.to
quality_contract:
citation_policy:
unique_hard_floor: 150
unique_recommended: 165
structure_policy:
max_final_h2_by_profile:
survey: 8
deep: 9
max_h3_by_profile:
survey: 10
deep: 12
front_matter_policy:
survey:
introduction:
min_cites: 35
min_paras: 5
min_chars: 2600
related_work:
min_cites: 50
min_paras: 6
min_chars: 3200
deep:
introduction:
min_cites: 40
min_paras: 6
min_chars: 3000
related_work:
min_cites: 55
min_paras: 7
min_chars: 3600
subsection_policy:
survey:
min_unique_citations: 12
min_chars: 4200
deep:
min_unique_citations: 14
min_chars: 5200
loop_policy:
stage_retry_budget:
C1: 2
C2: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init
mode: no_prose
required_skills: [workspace-init, pipeline-router]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
C1:
title: Retrieval & core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: [keyword-expansion, survey-seed-harvest]
produces: [papers/papers_raw.jsonl, papers/retrieval_report.md, papers/papers_dedup.jsonl, papers/core_set.csv]
C2:
title: Structure
mode: no_prose
required_skills: [taxonomy-builder, chapter-skeleton, section-bindings, section-briefs, outline-builder, section-mapper, outline-refiner, pipeline-router, human-checkpoint]
optional_skills: [outline-budgeter]
produces: [outline/taxonomy.yml, outline/chapter_skeleton.yml, outline/section_bindings.jsonl, outline/section_binding_report.md, outline/section_briefs.jsonl, outline/outline.yml, outline/mapping.tsv, outline/coverage_report.md, outline/outline_state.jsonl, output/REROUTE_STATE.json, DECISIONS.md]
human_checkpoint:
approve: scope + section skeleton + outline
write_to: DECISIONS.md
C3:
title: Evidence
mode: no_prose
required_skills: [pdf-text-extractor, paper-notes, subsection-briefs, chapter-briefs]
optional_skills: []
produces: [papers/fulltext_index.jsonl, papers/paper_notes.jsonl, papers/evidence_bank.jsonl, outline/subsection_briefs.jsonl, outline/chapter_briefs.jsonl]
C4:
title: Citations + evidence packs
mode: no_prose
required_skills: [citation-verifier, evidence-binder, evidence-draft, table-schema, anchor-sheet, table-filler, appendix-table-writer, schema-normalizer, writer-context-pack, evidence-selfloop, claim-matrix-rewriter]
optional_skills: [survey-visuals]
produces: [citations/ref.bib, citations/verified.jsonl, outline/evidence_bindings.jsonl, outline/evidence_binding_report.md, outline/table_schema.md, outline/tables_index.md, outline/tables_appendix.md, output/TABLES_APPENDIX_REPORT.md, outline/evidence_drafts.jsonl, outline/anchor_sheet.jsonl, output/SCHEMA_NORMALIZATION_REPORT.md, outline/writer_context_packs.jsonl, output/EVIDENCE_SELFLOOP_TODO.md, outline/claim_evidence_matrix.md]
C5:
title: Draft
mode: prose_allowed
required_skills: [front-matter-writer, chapter-lead-writer, subsection-writer, writer-selfloop, section-logic-polisher, argument-selfloop, paragraph-curator, style-harmonizer, opener-variator, evaluation-anchor-checker, transition-weaver, section-merger, post-merge-voice-gate, citation-diversifier, citation-injector, draft-polisher, global-reviewer, pipeline-auditor, artifact-contract-auditor]
optional_skills: [prose-writer, subsection-polisher, redundancy-pruner, terminology-normalizer, limitation-weaver, latex-scaffold, latex-compile-qa]
produces: [outline/transitions.md, sections/sections_manifest.jsonl, sections/h3_bodies.refined.ok, sections/paragraphs_curated.refined.ok, sections/style_harmonized.refined.ok, sections/opener_varied.refined.ok, sections/abstract.md, sections/S1.md, sections/S2.md, sections/discussion.md, sections/conclusion.md, output/WRITER_SELFLOOP_TODO.md, output/EVAL_ANCHOR_REPORT.md, output/ARGUMENT_SELFLOOP_TODO.md, output/SECTION_ARGUMENT_SUMMARIES.jsonl, output/ARGUMENT_SKELETON.md, output/PARAGRAPH_CURATION_REPORT.md, output/FRONT_MATTER_REPORT.md, output/CHAPTER_LEADS_REPORT.md, output/SECTION_LOGIC_REPORT.md, output/MERGE_REPORT.md, output/DRAFT.md, output/POST_MERGE_VOICE_REPORT.md, output/CITATION_BUDGET_REPORT.md, output/CITATION_INJECTION_REPORT.md, output/GLOBAL_REVIEW.md, output/AUDIT_REPORT.md, output/CONTRACT_REPORT.md]
---
# Pipeline: arXiv survey / review (MD-first)
Default contract (survey-grade, A150++):
- `queries.md` defaults are set for a *survey deliverable* (no silent downgrade): `core_size=300`, `per_subsection=28`, global unique citations hard floor `>=150` (recommended `>=165` when `core_size=300`; default `citation_target=recommended`).
- `draft_profile` controls **writing strictness** (`survey` vs `deep`), not “speed mode”.
- `evidence_mode` controls **evidence strength** (`abstract` default; `fulltext` optional and heavier).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- survey-seed-harvest
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- `queries.md` may specify `max_results` and a year `time window`; `arxiv-search` will paginate and attach arXiv metadata (categories, arxiv_id, etc.) when online.
- If you import an offline export but later have network, you can set `enrich_metadata: true` in `queries.md` (or run `arxiv-search --enrich-metadata`) to backfill missing abstracts/authors/categories via arXiv `id_list`.
- Evidence-first expectation (A150++): aim for a large dedup pool (target >=1200, not ~200) and a stable, verifiable core set (`core_size=300`) so later stages can bind wide in-scope citation pools without forcing out-of-scope drift.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- chapter-skeleton
- section-bindings
- section-briefs
- outline-builder
- section-mapper
- outline-refiner
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/chapter_skeleton.yml
- outline/section_bindings.jsonl
- outline/section_binding_report.md
- outline/section_briefs.jsonl
- outline/outline.yml
- outline/mapping.tsv
- outline/coverage_report.md
- outline/outline_state.jsonl
- output/REROUTE_STATE.json
human_checkpoint:
- approve: scope + section skeleton + outline
- write_to: DECISIONS.md
Notes:
- `chapter-skeleton` is the first retrieval-informed chapter contract; it stays chapter-level and does not emit stable H3 ids.
- `section-bindings` measures chapter saturation before H3 decomposition and writes PASS/BLOCKED signals into `outline/section_binding_report.md`.
- `section-briefs` turns the chapter layer into decomposition guidance plus subsection seeds; `outline-builder` then derives the first stable `outline/outline.yml` from that section layer.
- `outline-refiner` now writes both `outline/outline_state.jsonl` and `output/REROUTE_STATE.json`; section-first reroute/block information should be read from those artifacts instead of inferred from prose notes.
- Evidence-first expectation: each subsection should be written as a *question to answer* (RQ) plus *evidence needs* (what kind of citations/results are required), not just generic scaffold bullets.
- Coverage default: `section-mapper` uses `queries.md:per_subsection` as the per-H3 mapping contract (A150++ default: 28) so later evidence binding and writing have enough in-scope citations to choose from.
- Diversity expectation: mapping should not over-reuse a few papers across unrelated H3s; reserve “global” works for genuinely cross-cutting citations (controlled by `global_citation_min_subsections`).
- Budget policy (paper-like): avoid H3 explosion; the outline gate uses `queries.md:draft_profile` to set max H3 (survey<=10, deep<=12).
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent H3s into fewer, thicker units, then rerun `section-mapper` → `outline-refiner` before `Approve C2`.
## Stage 3 - Evidence (C3) [NO PROSE]
required_skills:
- pdf-text-extractor
- paper-notes
- subsection-briefs
- chapter-briefs
produces:
- papers/fulltext_index.jsonl
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
Notes:
- `queries.md` can set `evidence_mode: "abstract"|"fulltext"` (A150++ default: `abstract`).
- `queries.md` can set `draft_profile: "survey"|"deep"` to control writing gate strictness (A150++ default: `survey`).
- If `evidence_mode: "fulltext"`, `pdf-text-extractor` can be tuned via `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`.
- `subsection-briefs` converts each H3 into a verifiable writing card (scope_rule/rq/axes/clusters/paragraph_plan) so writing does not copy outline scaffolds.
- Optional refinement markers (recommended): treat briefs as *contracts*, not scaffolds. If you manually refine them and want to prevent regeneration, create:
- `outline/subsection_briefs.refined.ok`
- `outline/chapter_briefs.refined.ok`
These markers are used as explicit “reviewed/refined” signals and as a freeze switch (scripts won’t overwrite refined briefs).
## Stage 4 - Citations + evidence packs (C4) [NO PROSE]
required_skills:
- citation-verifier
- evidence-binder
- evidence-draft
- table-schema
- anchor-sheet
- table-filler
- appendix-table-writer
- schema-normalizer
- writer-context-pack
- evidence-selfloop
- claim-matrix-rewriter
optional_skills:
- survey-visuals
produces:
- citations/ref.bib
- citations/verified.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_binding_report.md
- outline/table_schema.md
- outline/tables_index.md
- outline/tables_appendix.md
- output/TABLES_APPENDIX_REPORT.md
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
- output/SCHEMA_NORMALIZATION_REPORT.md
- outline/writer_context_packs.jsonl
- output/EVIDENCE_SELFLOOP_TODO.md
- outline/claim_evidence_matrix.md
Notes:
- `evidence-draft` turns paper notes into per-subsection evidence packs (claim candidates + concrete comparisons + eval protocol + limitations) that the writer must follow.
- `claim-matrix-rewriter` makes `outline/claim_evidence_matrix.md` a projection/index of evidence packs (not an outline expansion), so writer guidance stays evidence-first.
- `writer-context-pack` builds a deterministic per-H3 drafting pack (briefs + evidence + anchors + allowed cites), reducing hollow writing and making C5 more debuggable.
- Tables are part of the default survey deliverable, but split into two layers:
- `outline/tables_index.md` (internal index; produced by `table-filler`; useful for planning/debugging; NOT inserted into the paper)
- `outline/tables_appendix.md` (reader-facing; produced by `appendix-table-writer`; clean/publishable; inserted into the draft as an Appendix block by `section-merger`)
- Optional: `survey-visuals` can still produce timeline/figure specs as intermediate artifacts.
- Optional refinement markers (recommended): after you spot-check/refine C4 artifacts and want to freeze them, create:
- `outline/evidence_bindings.refined.ok`
- `outline/evidence_drafts.refined.ok`
- `outline/anchor_sheet.refined.ok`
- `outline/writer_context_packs.refined.ok`
These markers make “reviewed/refined” explicit and prevent accidental regeneration/overwrite; strict mode relies on content checks (placeholders/blocking_missing/scope), not marker presence.
## Stage 5 - Draft (C5) [PROSE AFTER C2]
required_skills:
- front-matter-writer
- chapter-lead-writer
- subsection-writer
- writer-selfloop
- section-logic-polisher
- argument-selfloop
- paragraph-curator
- style-harmonizer
- opener-variator
- evaluation-anchor-checker
- transition-weaver
- section-merger
- post-merge-voice-gate
- citation-diversifier
- citation-injector
- draft-polisher
- global-reviewer
- pipeline-auditor
- artifact-contract-auditor
optional_skills:
- prose-writer
- subsection-polisher
- redundancy-pruner
- terminology-normalizer
- limitation-weaver
- evaluation-anchor-checker
- latex-scaffold
- latex-compile-qa
produces:
- sections/sections_manifest.jsonl
- sections/abstract.md
- sections/discussion.md
- sections/conclusion.md
- output/WRITER_SELFLOOP_TODO.md
- output/EVAL_ANCHOR_REPORT.md
- output/SECTION_LOGIC_REPORT.md
- output/ARGUMENT_SELFLOOP_TODO.md
- output/SECTION_ARGUMENT_SUMMARIES.jsonl
- output/ARGUMENT_SKELETON.md
- output/MERGE_REPORT.md
- output/DRAFT.md
- output/POST_MERGE_VOICE_REPORT.md
- output/CITATION_BUDGET_REPORT.md
- output/CITATION_INJECTION_REPORT.md
- output/GLOBAL_REVIEW.md
- output/AUDIT_REPORT.md
- output/CONTRACT_REPORT.md
Notes:
- C5 writing system (semantic + minimal artifacts; no extra machinery):
- **Unit of work**: `sections/*.md` (front matter, H2 leads, H3 bodies). Avoid editing `output/DRAFT.md` directly until after merge.
- **Single source of truth (口径锁定)**: `output/ARGUMENT_SKELETON.md` → `## Consistency Contract` (terminology, scope boundary, evaluation protocol fields, baseline naming).
- **Write → check → fix (five gates)**:
1) `writer-selfloop` → `output/WRITER_SELFLOOP_TODO.md`: file existence, depth, citation scope, paper voice.
2) `section-logic-polisher` → `output/SECTION_LOGIC_REPORT.md`: paragraph linkage (no jump cuts / “paragraph islands”).
3) `argument-selfloop` → `output/ARGUMENT_SELFLOOP_TODO.md` + `output/ARGUMENT_SKELETON.md` + `output/SECTION_ARGUMENT_SUMMARIES.jsonl`: section-level closure + premise/definition stability.
4) `paragraph-curator` → `output/PARAGRAPH_CURATION_REPORT.md`: **select → evaluate → subset → fuse** so sections converge (reduce redundancy, strengthen synthesis) without changing citation keys.
5) `evaluation-anchor-checker` → `output/EVAL_ANCHOR_REPORT.md`: final section-level numeric hygiene sweep so surviving numeric claims keep same-sentence task/metric/constraint context before merge.
- **Openers-last**: draft the middle first; rewrite paragraph 1 last so it reflects real content (front matter + H3).
- Writing self-loop gate: `subsection-writer` ensures the full `sections/` file set exists (and emits `sections/sections_manifest.jsonl`); `writer-selfloop` blocks until depth/citation-scope/paper-voice checks pass, writing `output/WRITER_SELFLOOP_TODO.md` (PASS/FAIL).
- Argument self-loop gate: `argument-selfloop` blocks “smooth but hollow” sections by making the argument chain explicit (per-section paragraph moves + a global dependency skeleton). Its ledgers are intermediate artifacts and must never be merged into the paper.
- Style hygiene (C5 hard gate for `survey`/`deep`): treat `output/WRITER_SELFLOOP_TODO.md` Style Smells as mandatory fixes. Run `style-harmonizer` + `opener-variator` on flagged files, then rerun `writer-selfloop` before merge.
- Numeric hygiene gate: run `evaluation-anchor-checker` after the last section-level rewrites (`paragraph-curator` + `style-harmonizer` + `opener-variator`) and before merge. If later section rewrites touch the same H3 files, rerun it; do not wait for `pipeline-auditor` to discover underspecified numbers in the merged draft.
- Micro-fix routing (preferred over broad rewrites): if Style Smells are specific, use targeted micro-skills before a general harmonize pass:
- opener cadence / “overview” narration → `opener-variator`
- count-based limitation slots (“Two limitations…”) → `limitation-weaver`
- underspecified numeric/performance claims (missing task/metric/budget) → `evaluation-anchor-checker`
- Triage rule (prevents “写作补洞”): if `writer-selfloop` FAILs because a subsection cannot meet `must_use` *in-scope* (thin packs / missing anchors / out-of-scope citation pressure), stop and rerun the evidence loop (`evidence-selfloop` + upstream C2/C3/C4) instead of padding prose.
- WebWeaver-style “planner vs writer” split (single agent, two passes):
- Planner pass: for each section/subsection, pick the exact citation IDs to use from the evidence bank (`outline/evidence_drafts.jsonl`) and keep scope consistent with the outline.
- Writer pass: write that section using only those citation IDs; avoid dumping the whole notes set into context.
- Treat this stage as an iteration loop: draft per H3 → logic-polish (thesis + connectors) → weave transitions → merge → de-template/cohere → global review → (if gaps) back to C3/C4 → regenerate.
- Post-merge voice gate: `post-merge-voice-gate` treats `outline/transitions.md` as a high-frequency injection source. If it FAILs, fix the *source* (usually transitions via `transition-weaver`, or the owning `sections/*.md`) and re-merge; do not “patch around it” in `draft-polisher`.
- Depth target (profile-aware): each H3 should be “少而厚” (avoid stubs). Use `queries.md:draft_profile` as the contract:
- `survey`: >=10 paragraphs + >=12 unique cites
- `deep`: >=11 paragraphs + >=14 unique cites
In all profiles, require >=2 concrete contrasts + evaluation anchoring + a cross-paper synthesis paragraph + an explicit limitation.
- Profile semantics: `survey` is the default deliverable contract; `deep` is stricter (and typically pairs well with `evidence_mode: fulltext`).
- Coherence target (paper-like): for every H2 chapter with H3 subsections, write a short **chapter lead** block (`sections/S<sec_id>_lead.md`) that previews the comparison axes and how the H3s connect (no new headings; avoid generic glue).
- Anti-template style contract (paper-like, not “outline narration”):
- Avoid meta openers like “This subsection surveys/argues …” and slide-like navigation (“Next, we move from … / We now turn to …”).
- Keep signposting light: avoid repeating a literal opener label across many subsections (e.g., `Key takeaway:`); vary opener phrasing and cadence.
- Tone target: calm, academic, understated; delete hype words (`clearly`, `obviously`) and “PPT speaker notes”.
- Keep evidence-policy disclaimers **once** in front matter (not repeated across H3s).
- If you cite numbers, include minimal evaluation context (task + metric + constraint/budget/cost) in the same paragraph.
- Citation shape must be reader-facing: no adjacent citation blocks (e.g., `[@a] [@b]`), no duplicate keys in one block (e.g., `[@a; @a]`), and avoid tail-only citation style by keeping mid-sentence citations in each H3.
- `section-merger` merges `sections/*.md` plus `outline/transitions.md` (within-chapter H3→H3 by default). Between-H2 transition insertion is optional: create `outline/transitions.insert_h2.ok` in the workspace if you want narrator-style handoffs included.
- Tables are part of the default deliverable: `outline/tables_appendix.md` is inserted into the draft by `section-merger` as a single Appendix block (index tables in `outline/tables_index.md` remain intermediate) unless `outline/tables.insert.off` exists. Other visuals (`outline/timeline.md`, `outline/figures.md`) remain intermediate by default.
- Citation scope policy: citations are subsection-first (from `outline/evidence_bindings.jsonl`), with limited reuse allowed within the same H2 chapter to reduce brittleness; avoid cross-chapter “free cite” drift.
- Controlled flexibility: bibkeys mapped to >= `queries.md:global_citation_min_subsections` subsections (A150++ default: 4) are treated as cross-cutting/global; see `allowed_bibkeys_global` in writer packs / `sections_manifest.jsonl`.
- If global unique citations are low, run `citation-diversifier` → `citation-injector` *before* `draft-polisher` (the polisher treats citation keys as immutable).
- `queries.md` can set `citation_target: recommended|hard` to control whether the recommended target is enforced as blocking (default: `recommended` for A150++).
- If you intentionally add/remove citations after an earlier polish run, reset the citation-anchoring baseline before rerunning `draft-polisher`:
- delete `output/citation_anchors.prepolish.jsonl` (workspace-local), then rerun `draft-polisher`.
- Recommended skills (toolkit, not a rigid one-shot chain):
- Modular drafting: `subsection-writer` → `writer-selfloop` → `section-logic-polisher` → `argument-selfloop` → `paragraph-curator` → `style-harmonizer` → `opener-variator` → `evaluation-anchor-checker` → `transition-weaver` → `section-merger` → `draft-polisher` → `global-reviewer` → `pipeline-auditor`.
- Legacy one-shot drafting: `prose-writer` (kept for quick experiments; less debuggable).
- If the draft reads like “paragraph islands”, run `section-logic-polisher` and patch only failing `sections/S*.md` until PASS, then merge.
- Add `pipeline-auditor` after `global-reviewer` as a regression test (blocks on ellipsis, repeated boilerplate, and citation hygiene).
- If you also need a PDF deliverable, use `latex-scaffold` + `latex-compile-qa` (see `arxiv-survey-latex`).
## Quality gates (strict mode)
- Citation coverage: expect a large, verifiable bibliography (A150++ default: `core_size=300` → `ref.bib` ~300) and high cite density:
- Per-H3: `survey` profile expects >=12 unique citations per H3 (and deeper profiles may require more).
- Front matter: `survey` profile expects Introduction>=35 and Related Work>=50 unique citations (dense positioning; no cite dumps).
- Global: `pipeline-auditor` gates on **global unique citations across the full draft**. A150++ defaults: hard `>=150`; recommended `>=165` (when bib=300). `queries.md:citation_target` controls which is blocking (default: `recommended`). If it fails, prefer `citation-diversifier` → `citation-injector` (in-scope, NO NEW FACTS) using each H3’s `allowed_bibkeys_selected` / `allowed_bibkeys_mapped` from `outline/writer_context_packs.jsonl`.
- Anti-template: drafts containing ellipsis placeholders (`…`) or leaked scaffold instructions (e.g., "enumerate 2-4 ...") should block and be regenerated from improved outline/mapping/evidence artifacts.
- Final polish hard gates (`survey`/`deep`): block on narration-template openers (e.g., `This subsection ...`), slide navigation phrasing, repeated opener stems/口癖, adjacent citation blocks (`[@a] [@b]`), duplicate keys in one block (`[@a; @a]`), and low H3 mid-sentence citation ratio (<30%).
FILE:pipelines/graduate-paper-pipeline.md
# Pipeline:中文毕业论文重构与定稿(研究阶段草案)
> 状态:研究阶段草案
> 当前定位:先研究流程与所需 skills,**暂不绑定 UNITS,不强行做成可执行合同**
> 面向对象:已有学校模板、既有论文、Overleaf 源稿、PDF、BibTeX、实验图表与中间工作文档的中文毕业论文写作
> 核心目标:把“已有材料”重构为“主线统一、结构顺畅、证据完整、术语一致、可编译交付”的中文毕业论文
## 1. 这条 Pipeline 的本质
这不是一条“从零写论文”的线性流程,也不是“把几篇 paper 拼起来”的拼装流程,而是一条**以问题清单驱动、以 Markdown 中间层为主、以 TeX 交付层收口**的论文工程 Pipeline。
它的真实主循环是:
> `问题清单 -> 语义定位 -> Markdown 重构 -> TeX 回写 -> 编译复查 -> 问题回写 -> 再进入下一轮`
因此,这条 Pipeline 的首要目标不是“赶紧写出一版正文”,而是先把以下几件事做对:
1. 先把已有材料还原出来,而不是直接在 `tex` 里盲改。
2. 先围绕毕业论文主线重构章节角色,而不是照搬原论文叙事。
3. 先在 Markdown 中间层把结构、证据、术语与图表想清楚,再回写到 TeX。
4. 先解决结构与证据问题,再解决文风与去 AI 味问题。
5. 真正的收敛点不是“写完一遍”,而是“问题清单被逐轮清空、编译与复查稳定通过”。
## 2. 这条 Pipeline 和 survey Pipeline 的根本区别
它与 `arxiv-survey` / `arxiv-survey-latex` 的差异很大,必须单独建模:
- survey pipeline 的起点是“根据主题检索文献并逐层收敛结构”;
- graduate-paper pipeline 的起点是“已有材料盘点与重构”;
- survey pipeline 的核心中间工件是 `outline/evidence_drafts/...`;
- graduate-paper pipeline 的核心中间工件是 `codex_md/` 里的大纲、问题清单、章节中间稿、图表计划和复查记录;
- survey pipeline 的主目标是“生成一篇综述”;
- graduate-paper pipeline 的主目标是“把已有研究工作重构成一篇符合中文学位论文规范的定稿论文”。
一句话说:
> survey 更像“检索驱动的证据写作流程”;
> graduate-paper 更像“已有材料驱动的论文工程重构流程”。
## 3. 工作区分层与关键工件
这条 Pipeline 需要明确区分**思考层**与**交付层**。
### 3.1 目录职责
- `main.tex`:论文总入口,负责装配封面、摘要、目录、正文、附录、致谢、成果与参考文献。
- `chapters/`:最终交付版正文 `.tex`。
- `abstract/`、`preface/`、`acknowledgement/`、`achievements/`:非正文正式交付部分。
- `references/`:参考文献库与样式文件。
- `pdf/`:已发表或已投稿论文 PDF。
- `Overleaf_ref/`:既有源稿、修回稿、补充材料。
- `codex_md/`:中间工作层,是**论文重构工作区**。
- `claude_md/`:复查清单、终稿核对记录。
- `tmp_layout/`、`tmp_layout2/`:图表试排与版面预演。
其中最关键的是:
- `codex_md/` 是**思考区 / 重构区**
- `chapters/` 是**交付区 / 定稿区**
不要把二者混用。
### 3.2 建议固定的中间工件
建议在这条 Pipeline 里把以下工件长期固定下来:
- `codex_md/material_index.md`:已有材料盘点与索引
- `codex_md/material_readiness.md`:材料就绪度与缺口提示
- `codex_md/missing_info.md`:缺失信息清单
- `codex_md/question_list.md`:本轮问题单、优先级与验收口径
- `codex_md/00_thesis_outline.md`:毕业论文主线、大纲、章节角色
- `codex_md/chapter_role_map.md`:来源材料 -> 章节 -> 角色 映射表
- `codex_md/chapter_rewrite_rules.md`:章节重构准则表
- `codex_md/terminology_glossary.md`:术语统一表
- `codex_md/symbol_metric_table.md`:符号 / 指标统一表
- `codex_md/figure_plan.md`:图表与版面计划
- `claude_md/review_checklist.md`:编译、排版、数据、模板项复查清单
- `output/THESIS_BUILD_REPORT.md`:编译与终稿检查报告
## 4. 这条 Pipeline 真正需要的 skills
当前先不考虑复用和泛化,按这条毕业论文 Pipeline 自身所需来拆,第一版建议使用以下 skills。
### 4.1 主链 skills
1. `thesis-workspace-init`
职责:初始化论文工程、提醒用户放置材料、建立工作区、检查 `main.tex` 是否可编译、生成材料盘点。
2. `thesis-question-list`
职责:建立并维护 `question_list.md`,把每一轮修改目标、问题优先级、边界、验收口径固定下来。
这是整条 Pipeline 的**控制面 skill**。
3. `thesis-source-role-mapper`
职责:把既有论文 / 模板 / 源稿 / PDF / 图表材料映射到“毕业论文角色”,不是只做 `paper -> chapter`。
4. `thesis-chapter-reconstructor`
职责:围绕毕业论文主线,重构每一章的目标、比重、承接与叙事方式。
这是整条 Pipeline 的**核心重构 skill**。
5. `thesis-markdown-aligner`
职责:统一主线、术语、符号、指标、图表口径,并让各章在 Markdown 中间层收敛成一篇论文而不是一组 paper。
6. `thesis-tex-writeback`
职责:把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并同步图表、公式、交叉引用与章节承接。
7. `thesis-compile-review`
职责:执行编译、warning 分级、模板模式检查、数据口径核验、问题回写。
它负责形成最后的质量闭环。
8. `thesis-style-polisher`
职责:在结构、证据、数据稳定后,做中文学位论文风格润色与去 AI 味处理。
### 4.2 并行支线 skills
9. `thesis-visual-layout-planner`
职责:图表规划、Mermaid 草图、临时拼版、图文节奏检查。
10. `thesis-frontmatter-sync`
职责:同步摘要、封面、附录、成果、致谢、名单等非正文部分,避免这些内容拖到最后才补。
11. `thesis-citation-enhance-review`
职责:定位必须有引用支撑的句子,扩充候选文献,核验引用与论断匹配关系,回写 `references/*.bib` 与正文引用。
### 4.3 这 11 个 skills 的关系
主链是:
`thesis-workspace-init -> thesis-question-list -> thesis-source-role-mapper -> thesis-chapter-reconstructor -> thesis-markdown-aligner -> thesis-tex-writeback -> thesis-compile-review -> thesis-style-polisher`
并行支线是:
- `thesis-visual-layout-planner`
- `thesis-frontmatter-sync`
- `thesis-citation-enhance-review`
其中:
- `thesis-question-list` 是全流程控制面
- `thesis-chapter-reconstructor` 是核心重构面
- `thesis-compile-review` 是质量闭环面
## 5. Pipeline 总览(按阶段)
| 阶段 | 目标 | 核心 skills | 主要输出 |
|---|---|---|---|
| 阶段 0 | 工程初始化与材料入仓 | `thesis-workspace-init` | 目录骨架、初始编译、材料索引、材料就绪度提示 |
| 阶段 1 | 还原已有材料到 Markdown 中间层 | `thesis-workspace-init` | 初始 Markdown 材料、缺失信息清单、材料缺口提示 |
| 阶段 1.5 | 锁定本轮问题与边界 | `thesis-question-list` | `question_list.md` |
| 阶段 2 | 建立来源材料与毕业论文角色映射 | `thesis-source-role-mapper` | 章节角色映射表、按章材料归类 |
| 阶段 2.5 | 围绕主线重构章节 | `thesis-chapter-reconstructor` | 重构后的章节 Markdown |
| 阶段 3 | 对齐、归并并统一全篇结构 | `thesis-markdown-aligner` | 稳定 outline、术语表、符号表、证据缺口清单 |
| 阶段 3.5 | 图表与版面预演 | `thesis-visual-layout-planner` | 图表计划、图文映射、临时版式结果 |
| 阶段 4 | 回写 TeX 交付层 | `thesis-tex-writeback` | 可编译章节 `.tex`、初版完整 `main.pdf` |
| 阶段 4.5 | 同步非正文部分 | `thesis-frontmatter-sync` | 摘要、附录、封面、成果等同步版本 |
| 阶段 5 | 引用增强与核验 | `thesis-citation-enhance-review` | 引用补强结果、核验记录 |
| 阶段 6 | 编译、排版与终稿复查 | `thesis-compile-review` | `THESIS_BUILD_REPORT`、review checklist |
| 阶段 7 | 最终润色与去 AI 味 | `thesis-style-polisher` | 更自然的中文终稿 |
## 6. 分阶段详细说明
### 阶段 0:工程初始化与材料入仓
**目标**
先把论文变成一个可维护的工程,而不是一堆零散文件。
**主要输入**
- 学校模板
- 现有仓库 / 源稿
- 学号、年份、中英文题目等基础元信息
- 现有 `main.tex`
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 基础目录骨架
- 初始 `main.tex` / 初始 `main.pdf`
- `codex_md/material_index.md`
- `codex_md/material_readiness.md`
**衔接关系**
- 如果 `main.tex` 还不能编译,不要进入正文重构阶段
- 如果材料还没归位,不要开始问题清单与章节重构
**执行重点**
- 明确哪些是材料区,哪些是工作区,哪些是交付区
- 先保证“能编译”,再谈“写得好”
### 阶段 1:还原已有材料到 Markdown 中间层
**目标**
把既有 `template / tex / Overleaf / PDF / bib` 中可复用的内容还原出来,形成 Markdown 中间层,而不是直接在 TeX 层硬改。
**主要输入**
- 学校模板
- 既有 `.tex`
- Overleaf 源稿
- PDF
- 参考文献库
**核心 skill**
- `thesis-workspace-init`
**主要输出**
- 初始 Markdown 章节材料
- `codex_md/material_readiness.md`
- `codex_md/missing_info.md`
- 材料索引与待补信息清单
**衔接关系**
- 阶段 1 的输出会直接喂给 `thesis-question-list` 和 `thesis-source-role-mapper`
**执行重点**
- 重点是“材料资产盘点”,不是“先写漂亮句子”
- 要显式标出待补实验细节、待补图注、待补引用、待补数字核验
### 阶段 1.5:锁定问题清单与本轮目标
**目标**
建立本轮控制面,明确这轮到底修什么,不修什么。
**主要输入**
- 初始 Markdown 材料
- 上一轮 review 反馈
- 当前编译 / 结构 / 文风问题
**核心 skill**
- `thesis-question-list`
**主要输出**
- `codex_md/question_list.md`
**衔接关系**
- 这一步是之后所有重构与回写的起点
- 任何结构性争议,都应该先回到这里重新排序
**执行重点**
- 每个问题都要写清楚:是什么、为什么、怎么改、验收到什么程度
- 问题单不是备忘录,而是迭代入口
### 阶段 2:建立来源材料与毕业论文角色映射
**目标**
把已有材料映射到毕业论文的章节角色,而不是只做“paper 对 chapter”的机械分配。
**主要输入**
- 现有论文、PDF、源稿、图表
- 初始 Markdown 材料
- `question_list.md`
**核心 skill**
- `thesis-source-role-mapper`
**主要输出**
- `codex_md/chapter_role_map.md`
- 按章归类的 Markdown 材料
- 各章内容边界
**衔接关系**
- 这是 `thesis-chapter-reconstructor` 的直接输入
**执行重点**
- 区分:方法主体 / 背景支撑 / 验证证据 / 系统实现 / 局限与展望素材
- 这里定错了,后面整篇论文都会歪
### 阶段 2.5:围绕主线重构章节
**目标**
把原论文式叙事改造成毕业论文式叙事。
**主要输入**
- 章节角色映射表
- `question_list.md`
- 各章 Markdown 草稿
**核心 skill**
- `thesis-chapter-reconstructor`
**主要输出**
- 重构后的章节 Markdown
- `codex_md/chapter_rewrite_rules.md`
**衔接关系**
- 这是整条 Pipeline 的核心阶段
- 后续一切对齐、回写、编译,都是在这里的重构成果之上进行
**执行重点**
- 明确每章要回答什么问题
- 决定哪些内容强化、弱化、上移、下沉
- 重写与前后章节的承接关系
- 把“卖点叙事”改成“毕业论文主线叙事”
### 阶段 3:Markdown 对齐、归并与统一
**目标**
让整篇论文在中间层先变成“一篇论文”,而不是几篇论文的拼接。
**主要输入**
- 重构后的章节 Markdown
- `question_list.md`
**核心 skill**
- `thesis-markdown-aligner`
**主要输出**
- 稳定版 `codex_md/00_thesis_outline.md`
- `codex_md/terminology_glossary.md`
- `codex_md/symbol_metric_table.md`
- 证据缺口清单
**衔接关系**
- 结构不稳、术语不稳、指标不稳时,不要进入 `tex-writeback`
**执行重点**
- 统一研究主线、术语、缩写、符号、指标、图表口径
- 决定哪些内容在第 2 章统一解释,哪些在第 3-5 章展开,哪些进入附录
### 阶段 3.5:图表与版面预演
**目标**
在正文定稿前提前规划图表与版式,不要把图表拖成最后的补丁工程。
**主要输入**
- 稳定版 outline
- 章节 Markdown
- 已有图表与实验结果
**核心 skill**
- `thesis-visual-layout-planner`
**主要输出**
- `codex_md/figure_plan.md`
- `mermaid/` 图示草稿
- `tmp_layout/`、`tmp_layout2/` 版面试排结果
**衔接关系**
- 图表规划结果会直接影响 `tex-writeback` 的图文结构
**执行重点**
- 明确每章需要哪些图表支撑主线
- 先做草图,再做临时拼版,再决定最终落位与图注
### 阶段 4:回写 TeX 交付层
**目标**
把已经在 Markdown 层理顺的内容回写到 `chapters/*.tex`,并进入交付层。
**主要输入**
- 稳定版 Markdown 章节
- 图表计划
- 术语 / 符号 / 指标统一表
**核心 skill**
- `thesis-tex-writeback`
**主要输出**
- `chapters/*.tex`
- 初版完整 `main.pdf`
**衔接关系**
- 如果在这个阶段发现结构问题,原则上优先回 Markdown 层修,不在 TeX 层重新发明结构
**执行重点**
- 同步图表、公式、交叉引用、章首导言与章末小结
- TeX 层负责交付,不负责重新思考结构
### 阶段 4.5:同步非正文部分
**目标**
并行维护摘要、附录、封面、成果、致谢等非正文部分,避免它们被拖到最后变成低质量补写。
**主要输入**
- 正文稳定版本
- 学校模板项
- 中英文题目与摘要信息
**核心 skill**
- `thesis-frontmatter-sync`
**主要输出**
- `abstract/abstract.tex`
- `abstract/abstract-en.tex`
- `preface/...`
- `appendix/...`
- `acknowledgement/...`
- `achievements/...`
**衔接关系**
- 阶段 6 的终稿复查会把这些部分与正文一并检查
**执行重点**
- 中英文题目与摘要口径一致
- 摘要中的数字、指标、方法名与正文一致
- 附录只放补充性内容,不重复主线
### 阶段 5:文献增强与引用核验
**目标**
把“应该有引用支撑的句子”系统补齐,并确保引用与论断匹配。
**主要输入**
- 当前 Markdown / TeX 版本
- `references/*.bib`
- 现有 PDF / 参考工作
**核心 skill**
- `thesis-citation-enhance-review`
**主要输出**
- 增强后的参考文献库
- 引用补强记录
- 引用核验记录
**衔接关系**
- 文献增强可以与阶段 3、4 并行,但收口必须早于最终润色
**执行重点**
- 优先找事实性描述、方法分类、经典结论、指标定义、数据来源这类必须有引文支撑的句子
- 先补对,再补多
### 阶段 6:编译、排版与终稿复查
**目标**
形成质量闭环,不只是“能编译”,而是“适合提交”。
**主要输入**
- 完整 TeX 工程
- 非正文同步版本
- review checklist
**核心 skill**
- `thesis-compile-review`
**主要输出**
- `output/THESIS_BUILD_REPORT.md`
- `claude_md/review_checklist.md`
- 已关闭 / 待关闭问题清单
**衔接关系**
- 阶段 6 的问题会决定回到阶段 1.5 / 2.5 / 3 / 4 的哪个层面修复
**执行重点**
- 编译通过只是最低要求
- 需要分级处理 warning:阻断提交 / 必须修 / 可记录观察
- 还要检查:数据一致性、cite key 有效性、模板参数是否正确、术语与缩写是否统一
### 阶段 7:最终润色与去 AI 味
**目标**
在结构、证据、数据稳定之后,再做中文学位论文风格润色。
**主要输入**
- 通过编译与复查的论文版本
- `中文写作要求.md`
- `GPT口癖与高频用词调研.md`
**核心 skill**
- `thesis-style-polisher`
**主要输出**
- 更自然的中文终稿
- 去模板化后的导言、小结、总结与展望
**衔接关系**
- 阶段 7 必须放在最后
- 如果润色暴露出结构或证据问题,应回退到前面阶段,而不是硬润色遮盖
**执行重点**
- 优先处理章首导言、章末小结、贡献描述、总结与展望
- 压低模板腔、宣传腔、AI 常见口癖
- 目标是“更像学位论文”,不是“更像 AI 优化后的论文”
## 7. 这条 Pipeline 的四个固定回环
为了避免它再次退化成“线性 SOP”,建议把以下四个回环固定下来:
### 7.1 结构闭环
`outline -> 某章 md -> 发现章节失衡 -> 回 outline / question_list`
### 7.2 内容闭环
`paper / Overleaf / PDF -> 抽取内容 -> 重构叙事 -> 发现仍像原论文 -> 再重构`
### 7.3 排版闭环
`tex -> 编译 -> 发现 warning / 图表位置 / 交叉引用问题 -> 回 tex 或图表规划`
### 7.4 文风闭环
`正文稳定 -> AI 润色 -> 发现模板腔 / AI 味 -> 回写作规范与措辞控制`
## 8. 这条 Pipeline 的执行优先级
如果实际运行时资源有限,优先级应当是:
1. `thesis-question-list`
2. `thesis-source-role-mapper`
3. `thesis-chapter-reconstructor`
4. `thesis-markdown-aligner`
5. `thesis-tex-writeback`
6. `thesis-compile-review`
换句话说:
- 真正不能省的是“问题清单 + 材料角色映射 + 章节重构 + 编译复查”
- 真正不该最早做的是“润色与去 AI 味”
## 9. 当前结论
这条 graduate-paper pipeline 的第一原则应当是:
> **先重构,再成文;先中间层收敛,再 TeX 交付;先证据与结构正确,再文风与润色。**
因此,这条 Pipeline 后续如果要正式执行化,最值得优先实现的不是“一个大而全的 runner”,而是先把以下 skills 做稳:
- `thesis-workspace-init`
- `thesis-question-list`
- `thesis-source-role-mapper`
- `thesis-chapter-reconstructor`
- `thesis-compile-review`
这五个 skill 稳了,这条中文毕业论文 Pipeline 才真正有落地价值。
FILE:pipelines/idea-brainstorm.pipeline.md
---
name: idea-brainstorm
version: 4.0
profile: idea-brainstorm
routing_hints: [idea, ideation, brainstorm, 点子, 选题, 找方向, 找 idea]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
- outline/taxonomy.yml
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.idea-brainstorm.csv
contract_model: pipeline.frontmatter/v1
query_defaults:
draft_profile: idea_brainstorm
max_results: 1800
core_size: 100
evidence_mode: abstract
direction_pool_min: 12
direction_pool_max: 24
idea_screen_top_n: 10
idea_shortlist_size: 5
report_top_n: 3
overridable_query_fields:
- keywords
- exclude
- max_results
- core_size
- evidence_mode
- direction_pool_min
- direction_pool_max
- idea_screen_top_n
- idea_shortlist_size
- report_top_n
- time_window.from
- time_window.to
quality_contract:
memo_bundle:
required_outputs: [output/REPORT.md, output/APPENDIX.md, output/REPORT.json]
signal_policy:
min_rows: 10
direction_policy:
shortlist_min: 3
shortlist_max: 5
keep_min: 3
cluster_diversity_min: 2
lead_diversity_target: 3
lead_diversity_axes: [cluster, direction_type, program_kind]
screening_policy:
keep_rank_max: 7
maybe_rank_max: 12
score_weights:
discussion_worthiness: 0.24
academic_value: 0.22
evidence_grounding: 0.18
direction_distinctness: 0.16
first_probe_clarity: 0.10
thesis_potential: 0.10
loop_policy:
stage_retry_budget:
C1: 2
C3: 1
C4: 1
max_reroutes: 4
require_human_on_retry_after_approval: true
stages:
C0:
title: Init + idea brief
mode: no_prose
required_skills: [workspace-init, pipeline-router, idea-brief, human-checkpoint]
optional_skills: []
produces: [STATUS.md, UNITS.csv, CHECKPOINTS.md, DECISIONS.md, GOAL.md, queries.md, output/trace/IDEA_BRIEF.md, output/QUALITY_GATE.md, output/RUN_ERRORS.md]
human_checkpoint:
approve: brainstorm brief
write_to: DECISIONS.md
C1:
title: Retrieval + core set
mode: no_prose
required_skills: [literature-engineer, dedupe-rank]
optional_skills: []
produces: [papers/papers_raw.jsonl, papers/papers_dedup.jsonl, papers/core_set.csv, papers/retrieval_report.md]
C2:
title: Idea landscape / focus
mode: no_prose
required_skills: [taxonomy-builder, pipeline-router, human-checkpoint]
optional_skills: []
produces: [outline/taxonomy.yml, DECISIONS.md]
human_checkpoint:
approve: focus clusters / lenses + exclusions
write_to: DECISIONS.md
C3:
title: Evidence signals
mode: no_prose
required_skills: [paper-notes, idea-signal-mapper]
optional_skills: []
produces: [papers/paper_notes.jsonl, papers/evidence_bank.jsonl, output/trace/IDEA_SIGNAL_TABLE.md, output/trace/IDEA_SIGNAL_TABLE.jsonl]
C4:
title: Direction pool + screening
mode: short_prose_ok
required_skills: [idea-direction-generator, idea-screener]
optional_skills: []
produces: [output/trace/IDEA_DIRECTION_POOL.md, output/trace/IDEA_DIRECTION_POOL.jsonl, output/trace/IDEA_SCREENING_TABLE.md, output/trace/IDEA_SCREENING_TABLE.jsonl]
C5:
title: Shortlist + memo synthesis + self-loop
mode: prose_allowed
required_skills: [idea-shortlist-curator, idea-memo-writer, deliverable-selfloop, artifact-contract-auditor]
optional_skills: []
produces: [output/trace/IDEA_SHORTLIST.md, output/trace/IDEA_SHORTLIST.jsonl, output/REPORT.md, output/APPENDIX.md, output/REPORT.json, output/DELIVERABLE_SELFLOOP_TODO.md, output/CONTRACT_REPORT.md]
---
# Pipeline: research idea brainstorm (signals -> directions -> memo)
Goal: produce a **discussion-ready research-idea memo** for PI / PhD readers.
This pipeline is for “找 research idea / brainstorm / 选题 / 找方向”.
It is **not** for writing a survey draft and **not** for generating execution-grade project specs.
The terminal deliverable is a single default entrypoint:
- `output/REPORT.md`
That memo should feel like a mature brainstorm artifact:
- grounded in literature,
- small enough to discuss,
- clear about what is promising,
- honest about uncertainty,
- and rich enough to guide the next discussion round.
Artifact policy:
- Reader-facing terminal artifacts:
- `output/REPORT.md`
- `output/APPENDIX.md`
- `output/REPORT.json`
- Trace artifacts stay under `output/trace/`.
- The run is only shareable when both layers exist and pass contract audit.
Default profile:
- Retrieval route: `literature-engineer`
- `core_size=100`
- Evidence mode: `abstract`
- Signal table: 10-20 rows
- Direction pool: 12-24
- Shortlist: 3-5
- Final memo lead directions: 3
## Stage 0 - Init + idea brief (C0)
required_skills:
- workspace-init
- pipeline-router
- idea-brief
- human-checkpoint
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/trace/IDEA_BRIEF.md
Notes:
- The brief is the single source of truth for topic, audience, constraints, exclusions, targets, and query buckets.
- Default behavior: block once at C0 for human approval of the brief before retrieval.
## Stage 1 - Retrieval + core set (C1)
required_skills:
- literature-engineer
- dedupe-rank
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/retrieval_report.md
Notes:
- Use multi-query buckets from `queries.md`.
- If recall is too low/noisy, fix it upstream and rerun C1.
## Stage 2 - Idea landscape / focus (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- pipeline-router
- human-checkpoint
produces:
- outline/taxonomy.yml
- DECISIONS.md
human_checkpoint:
- approve: focus clusters / lenses + exclusions
- write_to: DECISIONS.md
Notes:
- Taxonomy is an idea landscape, not a paper outline.
- Default behavior: pause at C2 so the human can choose a few promising focus lenses.
## Stage 3 - Evidence signals (C3) [table-first]
required_skills:
- paper-notes
- idea-signal-mapper
produces:
- papers/paper_notes.jsonl
- papers/evidence_bank.jsonl
- output/trace/IDEA_SIGNAL_TABLE.md
- output/trace/IDEA_SIGNAL_TABLE.jsonl
Notes:
- The signal table should capture tensions, missing pieces, and academically meaningful axes rather than proposal-ready wedges.
- Keep this stage compact and table-first; no long prose.
## Stage 4 - Direction pool + screening (C4)
required_skills:
- idea-direction-generator
- idea-screener
produces:
- output/trace/IDEA_DIRECTION_POOL.md
- output/trace/IDEA_DIRECTION_POOL.jsonl
- output/trace/IDEA_SCREENING_TABLE.md
- output/trace/IDEA_SCREENING_TABLE.jsonl
Notes:
- Expansion should generate discussion-worthy research directions, not an operator cartesian product.
- Screening should score discussion value, distinctness, evidence grounding, and thesis potential.
## Stage 5 - Shortlist + memo synthesis + self-loop (C5)
required_skills:
- idea-shortlist-curator
- idea-memo-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/trace/IDEA_SHORTLIST.md
- output/trace/IDEA_SHORTLIST.jsonl
- output/REPORT.md
- output/APPENDIX.md
- output/REPORT.json
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- `output/trace/IDEA_SHORTLIST.md` is an internal convergence layer, not the user-facing final answer.
- `output/REPORT.md` is the terminal deliverable and should read like a discussion-ready research idea brainstorm memo.
- `deliverable-selfloop` should evaluate the full memo bundle plus its trace chain.
FILE:pipelines/lit-snapshot.pipeline.md
---
name: lit-snapshot
version: 2.3
profile: lit-snapshot
routing_hints: [snapshot, 快照, one-page, one page, 48h]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
- outline/taxonomy.yml
- outline/outline.yml
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.lit-snapshot.csv
---
# Pipeline: literature snapshot (24-48h) [bullets-first]
Goal: a compact, reader-facing snapshot (`output/SNAPSHOT.md`) with a small but usable paper set and a paper-like structure. This is intentionally lighter than `arxiv-survey*` (no evidence packs, no BibTeX, no LaTeX).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Retrieval & core set (C1)
required_skills:
- arxiv-search
- dedupe-rank
optional_skills:
- keyword-expansion
produces:
- papers/papers_raw.jsonl
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Snapshot default: aim for a smaller but diverse set (e.g., core_set ~20-40) rather than a survey-scale pool.
- Practical retrieval loop (multi-query, then refine):
- Start broad with multiple query buckets (synonyms/acronyms/subtopics) and merge the results.
- If too few: add buckets + widen time window; if too noisy: rewrite keywords + add exclusions; rerun C1.
- If the snapshot feels generic, the fix is almost always upstream: broaden `queries.md` (more synonyms, fewer excludes, wider time window) and rerun C1.
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- taxonomy-builder
- outline-builder
optional_skills:
- outline-budgeter
produces:
- outline/taxonomy.yml
- outline/outline.yml
human_checkpoint:
- approve: scope + outline (snapshot ToC)
- write_to: DECISIONS.md
Notes:
- Paper-like default: prefer fewer, thicker sections. For snapshots, avoid H3 explosion; treat H2 as the main organizing unit.
- If the outline is over-fragmented, use `outline-budgeter` (NO PROSE) to merge adjacent nodes and keep the ToC readable before writing.
## Stage 3 - Snapshot (C3) [SHORT PROSE OK]
required_skills:
- snapshot-writer
- deliverable-selfloop
- artifact-contract-auditor
optional_skills:
- prose-writer
produces:
- output/SNAPSHOT.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- The deliverable is bullets-first and pointer-heavy: every non-trivial claim should attach 1-2 concrete paper pointers from `papers/core_set.csv`.
- Avoid outline narration (e.g., `This section surveys ...`); write content claims + why-it-matters + pointers.
FILE:pipelines/peer-review.pipeline.md
---
name: peer-review
version: 2.3
profile: peer-review
routing_hints: [peer review, review report, referee, 审稿]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PAPER.md
- output/CLAIMS.md
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.peer-review.csv
---
# Pipeline: peer review / referee report
Goal: produce an actionable referee report (`output/REVIEW.md`) that is traceable to the submitted paper’s claims and evidence (no free-floating advice, no invented comparisons).
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
## Stage 1 - Claims (C1)
required_skills:
- manuscript-ingest
- claims-extractor
produces:
- output/PAPER.md
- output/CLAIMS.md
Notes:
- Input expectation: you must have the manuscript text available as `output/PAPER.md` (or equivalent extracted text) before running this stage.
- Contract: every claim must include a source pointer (section/page/quote) so later critique is auditable.
## Stage 2 - Evidence audit (C2)
required_skills:
- evidence-auditor
- novelty-matrix
produces:
- output/MISSING_EVIDENCE.md
- output/NOVELTY_MATRIX.md
Notes:
- Evidence-auditor should write gaps/risks/verification steps, not “fix the paper”.
- Novelty-matrix should be conservative: compare against cited/known related work; if you lack sources, record that limitation explicitly.
## Stage 3 - Rubric write-up (C3)
required_skills:
- rubric-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/REVIEW.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Prefer concrete, minimal fixes (what experiment/ablation/analysis would resolve the concern) over generic “needs more experiments”.
FILE:pipelines/systematic-review.pipeline.md
---
name: systematic-review
version: 2.3
profile: systematic-review
routing_hints: [systematic review, systematic, prisma, 系统综述]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/PROTOCOL.md
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
- papers/screening_log.csv
- papers/extraction_table.csv
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3,C4,C5]
units_template: templates/UNITS.systematic-review.csv
---
# Pipeline: systematic review (PRISMA-style)
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Protocol (C1)
required_skills:
- protocol-writer
produces:
- output/PROTOCOL.md
human_checkpoint:
- approve: protocol locked (query, inclusion/exclusion, databases, time window)
- write_to: DECISIONS.md
## Stage 2 - Retrieval & candidate pool (C2)
required_skills:
- literature-engineer
- dedupe-rank
optional_skills:
- keyword-expansion
- arxiv-search
produces:
- papers/papers_raw.jsonl
- papers/retrieval_report.md
- papers/papers_dedup.jsonl
- papers/core_set.csv
Notes:
- Systematic-review contract: keep the candidate pool auditable. Prefer dedupe (good) over aggressive ranking/filtering (dangerous).
- `dedupe-rank` default: for this pipeline, it keeps the full deduped pool in `papers/core_set.csv` unless you explicitly set `queries.md:core_size` (screening should not silently drop papers).
## Stage 3 - Screening (C3)
required_skills:
- screening-manager
produces:
- papers/screening_log.csv
Notes:
- Use `papers/papers_dedup.jsonl` (or `papers/core_set.csv`) as the candidate list; `papers/screening_log.csv` must include protocol-grounded reasons for every decision.
- Practical auditability: number protocol clauses (`I1..` / `E1..`) and require screening rows to cite them (e.g., `reason_codes=E3`).
## Stage 4 - Extraction (C4)
required_skills:
- extraction-form
- bias-assessor
produces:
- papers/extraction_table.csv
Notes:
- Extraction is schema-driven: do not write narrative synthesis here; keep all fields consistent and fill bias fields (low/unclear/high + notes).
## Stage 5 - Synthesis (C5) [PROSE ALLOWED]
required_skills:
- synthesis-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/SYNTHESIS.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
FILE:pipelines/tutorial.pipeline.md
---
name: tutorial
version: 2.3
profile: tutorial
routing_hints: [tutorial, 教程]
routing_priority: 30
target_artifacts:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
- output/TUTORIAL_SPEC.md
- outline/concept_graph.yml
- outline/module_plan.yml
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/QUALITY_GATE.md
- output/RUN_ERRORS.md
- output/CONTRACT_REPORT.md
default_checkpoints: [C0,C1,C2,C3]
units_template: templates/UNITS.tutorial.csv
---
# Pipeline: tutorial (teaching loop)
Goal: a tutorial deliverable (`output/TUTORIAL.md`) that has a consistent running example and a real teaching loop (objectives -> steps -> exercises -> verification), not just a blog-style explanation.
## Stage 0 - Init (C0)
required_skills:
- workspace-init
- pipeline-router
produces:
- STATUS.md
- UNITS.csv
- CHECKPOINTS.md
- DECISIONS.md
- GOAL.md
- queries.md
## Stage 1 - Spec (C1)
required_skills:
- tutorial-spec
produces:
- output/TUTORIAL_SPEC.md
Notes:
- The spec is the scope contract: audience, prerequisites, learning objectives, and a running example (if applicable).
## Stage 2 - Structure (C2) [NO PROSE]
required_skills:
- concept-graph
- module-planner
- exercise-builder
produces:
- outline/concept_graph.yml
- outline/module_plan.yml
human_checkpoint:
- approve: target audience + scope + running example
- write_to: DECISIONS.md
Notes:
- Treat `outline/module_plan.yml` as the execution contract for writing: modules must have concrete outputs and at least one verifiable exercise each.
## Stage 3 - Writing (C3) [PROSE ALLOWED]
required_skills:
- tutorial-module-writer
- deliverable-selfloop
- artifact-contract-auditor
produces:
- output/TUTORIAL.md
- output/DELIVERABLE_SELFLOOP_TODO.md
- output/CONTRACT_REPORT.md
Notes:
- Write only within the approved scope; if you discover missing prerequisites, record them as a follow-up instead of expanding scope silently.
FILE:scripts/run.py
from __future__ import annotations
import argparse
import json
import sys
import time
import urllib.request
import xml.etree.ElementTree as ET
from pathlib import Path
from typing import Any
def _read_ids(path: Path) -> list[str]:
ids: list[str] = []
if not path.exists():
raise SystemExit(f"Missing id list: {path}")
for raw in path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line or line.startswith("#"):
continue
line = line.replace("https://arxiv.org/abs/", "").replace("http://arxiv.org/abs/", "")
line = line.replace("https://arxiv.org/pdf/", "").replace("http://arxiv.org/pdf/", "")
line = line.removesuffix(".pdf")
ids.append(line)
# de-dupe preserving order
out: list[str] = []
seen: set[str] = set()
for x in ids:
if x in seen:
continue
seen.add(x)
out.append(x)
return out
def _fetch_arxiv_meta(arxiv_id: str) -> dict[str, Any]:
url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
req = urllib.request.Request(url, headers={"User-Agent": "codex-agent-survey-corpus/1.0"})
with urllib.request.urlopen(req, timeout=30) as resp:
data = resp.read()
ns = {"a": "http://www.w3.org/2005/Atom"}
root = ET.fromstring(data)
entry = root.find("a:entry", ns)
if entry is None:
return {}
title = (entry.findtext("a:title", default="", namespaces=ns) or "").strip().replace("\n", " ")
published = (entry.findtext("a:published", default="", namespaces=ns) or "").strip()
updated = (entry.findtext("a:updated", default="", namespaces=ns) or "").strip()
abs_url = (entry.findtext("a:id", default="", namespaces=ns) or "").strip()
return {
"arxiv_id": arxiv_id,
"title": title,
"published": published,
"updated": updated,
"abs_url": abs_url,
}
def _download(url: str, dst: Path, *, overwrite: bool, sleep_s: float) -> None:
if dst.exists() and not overwrite:
return
dst.parent.mkdir(parents=True, exist_ok=True)
req = urllib.request.Request(url, headers={"User-Agent": "codex-agent-survey-corpus/1.0"})
with urllib.request.urlopen(req, timeout=60) as resp:
dst.write_bytes(resp.read())
if sleep_s:
time.sleep(max(0.0, float(sleep_s)))
def _extract_text(pdf_path: Path, *, max_pages: int) -> tuple[int, str]:
import re
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
pages = min(len(doc), max(1, int(max_pages)))
chunks: list[str] = []
for i in range(pages):
page = doc.load_page(i)
chunks.append(page.get_text("text"))
doc.close()
text = "\n".join(chunks)
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n\s*\n\s*\n+", "\n\n", text)
return pages, text.strip() + "\n"
def _extract_section_headings(text: str) -> list[str]:
"""Best-effort extraction of top-level section headings from PDF text.
Heuristics:
- Keep only numeric top-level headings (1..15).
- Prefer headings that look like paper sections (avoid tables/lists).
- Deduplicate by section number.
This is intentionally approximate; PDF text extraction is noisy.
"""
import re
allow_single = {
"abstract",
"introduction",
"background",
"preliminaries",
"overview",
"methodology",
"methods",
"taxonomy",
"evaluation",
"benchmarks",
"benchmark",
"datasets",
"experiments",
"results",
"applications",
"challenges",
"discussion",
"conclusion",
"references",
"appendix",
"planning",
"memory",
"techniques",
"scenarios",
"safety",
"security",
}
lines = [ln.strip() for ln in (text or "").splitlines() if ln.strip()]
candidates: list[tuple[int, str]] = []
def _maybe_add(num: int, title: str) -> None:
title = re.sub(r"\s+", " ", (title or "").strip())
if not title:
return
if num < 1 or num > 15:
return
if len(title) < 3 or len(title) > 80:
return
low = title.lower()
if low.startswith(("table", "figure", "fig.", "fig ")):
return
if title.endswith("-"):
return
if len(re.findall(r"\d", title)) > 3:
return
if title.count(",") >= 3:
return
if "http" in low:
return
# Single-token headings are common, but avoid false positives like country names.
words = [w for w in re.split(r"\s+", title) if w]
if len(words) == 1:
w = words[0]
if not w.isalpha():
return
if w.lower() not in allow_single:
return
candidates.append((num, title))
i = 0
while i < len(lines):
ln = lines[i]
# Pattern: "1" then "Introduction" on the next line.
if re.fullmatch(r"\d{1,2}", ln) and i + 1 < len(lines):
nxt = lines[i + 1]
if re.fullmatch(r"[A-Z][A-Za-z0-9 \-,:/()]{2,80}", nxt):
_maybe_add(int(ln), nxt)
i += 2
continue
m = re.match(r"^(\d{1,2})\s+([A-Z][A-Za-z0-9 \-,:/()]{2,80})$", ln)
if m:
_maybe_add(int(m.group(1)), m.group(2))
i += 1
# Deduplicate by number (keep first occurrence).
by_num: dict[int, str] = {}
for num, title in candidates:
by_num.setdefault(num, title)
return [f"{n} {by_num[n]}" for n in sorted(by_num)]
def _build_style_report(records: list[dict[str, object]], *, workspace: Path, out_dir: Path) -> str:
import statistics
rows = []
counts = []
for rec in records:
arxiv_id = str(rec.get("arxiv_id") or "").strip()
status = str(rec.get("status") or "").strip()
pages = int(rec.get("pages_extracted") or 0)
text_path = workspace / str(rec.get("text_path") or "")
headings: list[str] = []
if status == "ok" and text_path.exists() and text_path.stat().st_size > 0:
headings = _extract_section_headings(text_path.read_text(encoding="utf-8", errors="ignore"))
counts.append(len(headings))
sample = "; ".join(headings[:8])
if len(headings) > 8:
sample += " …"
rows.append((arxiv_id, pages, len(headings), sample))
ok = sum(1 for r in records if str(r.get("status") or "").strip() == "ok")
total = len(records)
lines = [
"# Agent survey style report (auto)",
"",
f"- Papers: {total} (ok={ok}, error={total - ok})",
"",
]
if counts:
lines.append("## Detected top-level section counts")
lines.append("")
lines.append(f"- min/median/max: {min(counts)}/{statistics.median(counts)}/{max(counts)}")
lines.append("")
lines.extend([
"## Per-paper detected headings (best-effort)",
"",
"| arXiv | pages | #sections | headings (first 8) |",
"|---|---:|---:|---|",
])
for arxiv_id, pages, nsec, sample in rows:
lines.append(f"| {arxiv_id} | {pages} | {nsec} | {sample} |")
lines.append("")
lines.append("## How to use (for pipeline tuning)")
lines.append("")
lines.append("- Target paper-like structure: ~6–8 top-level sections with fewer, thicker subsections.")
lines.append("- Front matter (Intro/Related Work) typically has higher citation density than a single H3 subsection.")
lines.append("- Use this report to sanity-check whether your generated outline and section sizing resembles real surveys.")
lines.append("")
return "\n".join(lines)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", required=True)
parser.add_argument("--unit-id", default="")
parser.add_argument("--inputs", default="")
parser.add_argument("--outputs", default="")
parser.add_argument("--checkpoint", default="")
parser.add_argument("--max-pages", type=int, default=20)
parser.add_argument("--sleep", type=float, default=1.0)
parser.add_argument("--overwrite", action="store_true")
args = parser.parse_args()
workspace = Path(args.workspace).resolve()
ids_rel = "ref/agent-surveys/arxiv_ids.txt"
if str(args.inputs).strip():
# first token wins (semicolon-separated)
ids_rel = str(args.inputs).split(";", 1)[0].strip() or ids_rel
ids_path = workspace / ids_rel
arxiv_ids = _read_ids(ids_path)
if not arxiv_ids:
raise SystemExit(f"No arXiv ids found in {ids_path}")
out_dir = workspace / "ref" / "agent-surveys"
pdf_dir = out_dir / "pdfs"
text_dir = out_dir / "text"
index_path = out_dir / "index.jsonl"
out_dir.mkdir(parents=True, exist_ok=True)
pdf_dir.mkdir(parents=True, exist_ok=True)
text_dir.mkdir(parents=True, exist_ok=True)
records: list[dict[str, Any]] = []
for arxiv_id in arxiv_ids:
rec: dict[str, Any] = {
"arxiv_id": arxiv_id,
"pdf_url": f"https://arxiv.org/pdf/{arxiv_id}.pdf",
"pdf_path": str((pdf_dir / f"{arxiv_id}.pdf").relative_to(workspace)),
"text_path": str((text_dir / f"{arxiv_id}.txt").relative_to(workspace)),
"max_pages": int(args.max_pages),
"status": "",
"error": "",
"pages_extracted": 0,
}
try:
rec.update(_fetch_arxiv_meta(arxiv_id))
except Exception as exc:
# Metadata is best-effort.
rec["meta_error"] = f"{type(exc).__name__}: {exc}"
pdf_path = workspace / rec["pdf_path"]
text_path = workspace / rec["text_path"]
try:
_download(rec["pdf_url"], pdf_path, overwrite=bool(args.overwrite), sleep_s=float(args.sleep))
pages, text = _extract_text(pdf_path, max_pages=int(args.max_pages))
rec["pages_extracted"] = int(pages)
if bool(args.overwrite) or not text_path.exists():
text_path.write_text(text, encoding="utf-8")
rec["status"] = "ok"
except Exception as exc:
rec["status"] = "error"
rec["error"] = f"{type(exc).__name__}: {exc}"
records.append(rec)
index_path.write_text("\n".join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + "\n", encoding="utf-8")
print(f"Wrote {index_path.relative_to(workspace)}")
try:
report = _build_style_report(records, workspace=workspace, out_dir=out_dir)
report_path = out_dir / "STYLE_REPORT.md"
report_path.write_text(report, encoding="utf-8")
print(f"Wrote {report_path.relative_to(workspace)}")
except Exception as exc:
print(f"Style report skipped: {type(exc).__name__}: {exc}", file=sys.stderr)
# Non-zero keeps this as a manual helper (not pipeline).
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:tooling/__init__.py
FILE:tooling/common.py
from __future__ import annotations
import csv
import json
import os
import re
import shutil
import tempfile
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Any, Iterable
import yaml
def today_iso() -> str:
return date.today().isoformat()
def now_iso_seconds() -> str:
return datetime.now().replace(microsecond=0).isoformat()
def ensure_dir(path: Path) -> None:
path.mkdir(parents=True, exist_ok=True)
def atomic_write_text(path: Path, content: str) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(content)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def backup_existing(path: Path) -> Path:
"""Rename an existing file to a timestamped `.bak.*` sibling and return the backup path."""
if not path.exists():
return path
stamp = datetime.now().replace(microsecond=0).isoformat().replace("-", "").replace(":", "")
backup = path.with_name(f"{path.name}.bak.{stamp}")
counter = 1
while backup.exists():
backup = path.with_name(f"{path.name}.bak.{stamp}.{counter}")
counter += 1
path.replace(backup)
return backup
def parse_semicolon_list(value: str | None) -> list[str]:
if not value:
return []
return [item.strip() for item in value.split(";") if item.strip()]
def normalize_title_for_dedupe(title: str) -> str:
title = title.lower()
title = re.sub(r"[^a-z0-9]+", " ", title)
title = re.sub(r"\s+", " ", title).strip()
return title
def normalize_axis_label(text: str) -> str:
text = re.sub(r"\s+", " ", (text or "").strip().lower())
text = text.rstrip(" .;:,;。")
text = re.sub(r"\s*/\s*", " ", text)
text = re.sub(r"[^a-z0-9]+", " ", text)
return text.strip()
def subsection_brief_generic_axis_norms() -> set[str]:
"""Axis labels that strict survey gates treat as scaffold-level defaults.
Keep this aligned across brief generation and quality-gate checks so the
generator does not promote an axis that the gate later classifies as generic.
"""
axes = {
"core mechanism and system architecture",
"training and data setup",
"evaluation protocol",
"evaluation protocol (benchmarks / metrics / human)",
"evaluation protocol (datasets / metrics / human)",
"evaluation protocol (datasets, metrics, human evaluation)",
"compute and efficiency",
"compute and latency constraints",
"efficiency and compute",
"tool interface contract (schemas / protocols)",
"tool selection / routing policy",
"sandboxing / permissions / observability",
"failure modes and limitations",
}
return {normalize_axis_label(axis) for axis in axes}
def read_jsonl(path: Path) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
if not path.exists():
return records
with path.open("r", encoding="utf-8") as handle:
for line in handle:
line = line.strip()
if not line:
continue
records.append(json.loads(line))
return records
def write_jsonl(path: Path, records: Iterable[dict[str, Any]]) -> None:
ensure_dir(path.parent)
lines = [json.dumps(record, ensure_ascii=False) for record in records]
atomic_write_text(path, "\n".join(lines) + ("\n" if lines else ""))
def read_tsv(path: Path) -> list[dict[str, str]]:
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle, delimiter="\t")
return [dict(row) for row in reader]
def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
writer.writeheader()
for row in rows:
writer.writerow({key: row.get(key, "") for key in fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
@dataclass(frozen=True)
class UnitsTable:
fieldnames: list[str]
rows: list[dict[str, str]]
@staticmethod
def load(path: Path) -> "UnitsTable":
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
fieldnames = list(reader.fieldnames or [])
rows = [dict(row) for row in reader]
return UnitsTable(fieldnames=fieldnames, rows=rows)
def save(self, path: Path) -> None:
ensure_dir(path.parent)
fd, tmp_path = tempfile.mkstemp(prefix=path.name, dir=str(path.parent))
try:
with os.fdopen(fd, "w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=self.fieldnames)
writer.writeheader()
for row in self.rows:
writer.writerow({key: row.get(key, "") for key in self.fieldnames})
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
def load_yaml(path: Path) -> Any:
with path.open("r", encoding="utf-8") as handle:
return yaml.safe_load(handle)
def dump_yaml(path: Path, data: Any) -> None:
ensure_dir(path.parent)
text = yaml.safe_dump(
data,
allow_unicode=True,
sort_keys=False,
default_flow_style=False,
width=120,
)
atomic_write_text(path, text)
def copy_tree(src_dir: Path, dst_dir: Path, *, overwrite: bool) -> None:
if not src_dir.is_dir():
raise ValueError(f"Template directory not found: {src_dir}")
ensure_dir(dst_dir)
for src_path in src_dir.rglob("*"):
rel = src_path.relative_to(src_dir)
dst_path = dst_dir / rel
if src_path.is_dir():
ensure_dir(dst_path)
continue
ensure_dir(dst_path.parent)
if dst_path.exists() and not overwrite:
continue
shutil.copy2(src_path, dst_path)
def tokenize(text: str) -> list[str]:
text = text.lower()
text = re.sub(r"[^a-z0-9]+", " ", text)
return [token for token in text.split() if token]
_EN_STOPWORDS = {
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"in",
"is",
"it",
"of",
"on",
"or",
"that",
"the",
"this",
"to",
"with",
"we",
"our",
"via",
"towards",
"toward",
"using",
"use",
"based",
"new",
"towards",
"into",
"over",
"under",
"between",
"within",
"without",
"beyond",
}
_GENERIC_PAPER_WORDS = {
"survey",
"review",
"tutorial",
"paper",
"approach",
"method",
"methods",
"model",
"models",
"framework",
"frameworks",
"system",
"systems",
"learning",
"deep",
"neural",
"network",
"networks",
"analysis",
"benchmark",
"benchmarks",
"dataset",
"datasets",
"evaluation",
"evaluating",
"towards",
"using",
"based",
"study",
"studies",
}
def candidate_keywords(titles: Iterable[str], *, top_k: int, min_freq: int) -> list[str]:
freq: dict[str, int] = {}
for title in titles:
for token in tokenize(title):
if token in _EN_STOPWORDS or token in _GENERIC_PAPER_WORDS:
continue
if len(token) < 3:
continue
freq[token] = freq.get(token, 0) + 1
candidates = [t for t, c in sorted(freq.items(), key=lambda kv: (-kv[1], kv[0])) if c >= min_freq]
return candidates[:top_k]
def update_status_log(status_path: Path, line: str) -> None:
ensure_dir(status_path.parent)
if status_path.exists():
existing = status_path.read_text(encoding="utf-8")
else:
existing = "# Status\n"
if "## Run log" not in existing:
existing = existing.rstrip() + "\n\n## Run log\n"
updated = existing.rstrip() + f"\n- {line}\n"
atomic_write_text(status_path, updated)
def update_status_field(status_path: Path, heading: str, value: str) -> None:
heading_line = f"## {heading}".strip()
bullet_line = f"- `{value}`"
if status_path.exists():
lines = status_path.read_text(encoding="utf-8").splitlines()
else:
lines = ["# Status"]
out: list[str] = []
i = 0
updated = False
while i < len(lines):
line = lines[i]
out.append(line)
if line.strip() == heading_line:
if i + 1 < len(lines) and lines[i + 1].lstrip().startswith("-"):
out.append(bullet_line)
i += 2
updated = True
continue
out.append(bullet_line)
updated = True
i += 1
if not updated:
out.extend(["", heading_line, bullet_line])
atomic_write_text(status_path, "\n".join(out).rstrip() + "\n")
def decisions_has_approval(decisions_path: Path, checkpoint: str) -> bool:
if not checkpoint:
return False
if not decisions_path.exists():
return False
text = decisions_path.read_text(encoding="utf-8")
pattern = rf"^\s*-\s*\[[xX]\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b"
return re.search(pattern, text, flags=re.MULTILINE) is not None
def ensure_decisions_approval_checklist(decisions_path: Path) -> None:
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n"
if re.search(r"^##\s+Approvals\b", text, flags=re.MULTILINE):
return
workspace = decisions_path.parent
checkpoints = _human_checkpoints_from_units(workspace)
if not checkpoints:
return
checklist_lines = ["## Approvals (check to unblock)"]
for checkpoint in checkpoints:
hint = _approval_hint(checkpoint)
suffix = f" ({hint})" if hint else ""
checklist_lines.append(f"- [ ] Approve {checkpoint}{suffix}")
checklist_lines.append("")
checklist = "\n".join(checklist_lines)
lines = text.splitlines()
if lines and lines[0].startswith("#"):
new_text = "\n".join([lines[0], "", checklist] + lines[1:]).rstrip() + "\n"
else:
new_text = (checklist + "\n" + text).rstrip() + "\n"
atomic_write_text(decisions_path, new_text)
def set_decisions_approval(decisions_path: Path, checkpoint: str, *, approved: bool) -> None:
checkpoint = checkpoint.strip()
if not checkpoint:
raise ValueError("checkpoint must be non-empty")
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
lines = text.splitlines()
pattern = re.compile(rf"^\s*-\s*\[\s*[xX ]\s*\]\s*(?:Approve\s*)?{re.escape(checkpoint)}\b")
updated = False
for idx, line in enumerate(lines):
if pattern.search(line):
lines[idx] = re.sub(
r"\[\s*[xX ]\s*\]",
"[x]" if approved else "[ ]",
line,
count=1,
)
updated = True
break
if not updated:
insert_at = None
for idx, line in enumerate(lines):
if line.strip().startswith("## Approvals"):
insert_at = idx + 1
break
if insert_at is None:
lines.append("")
lines.append("## Approvals (check to unblock)")
insert_at = len(lines)
lines.insert(insert_at, f"- [{'x' if approved else ' '}] Approve {checkpoint}")
atomic_write_text(decisions_path, "\n".join(lines).rstrip() + "\n")
def _human_checkpoints_from_units(workspace: Path) -> list[str]:
units_path = workspace / "UNITS.csv"
if not units_path.exists():
return []
try:
table = UnitsTable.load(units_path)
except Exception:
return []
seen: set[str] = set()
out: list[str] = []
for row in table.rows:
owner = (row.get("owner") or "").strip().upper()
if owner != "HUMAN":
continue
checkpoint = (row.get("checkpoint") or "").strip()
if checkpoint and checkpoint not in seen:
seen.add(checkpoint)
out.append(checkpoint)
return out
def _approval_hint(checkpoint: str) -> str:
hints = {
"C0": "kickoff: scope/sources/time window/constraints",
"C1": "retrieval + core set",
"C2": "scope + outline",
"C3": "evidence ready",
"C4": "citations verified",
"C5": "allow prose writing",
}
return hints.get(checkpoint, "")
def upsert_checkpoint_block(decisions_path: Path, checkpoint: str, markdown_block: str) -> None:
begin = f"<!-- BEGIN CHECKPOINT:{checkpoint} -->"
end = f"<!-- END CHECKPOINT:{checkpoint} -->"
block = "\n".join([begin, markdown_block.rstrip(), end, ""]).rstrip() + "\n"
if decisions_path.exists():
text = decisions_path.read_text(encoding="utf-8")
else:
text = "# Decisions log\n\n"
ensure_decisions_approval_checklist(decisions_path)
text = decisions_path.read_text(encoding="utf-8")
pattern = re.compile(
rf"{re.escape(begin)}.*?{re.escape(end)}\n?",
flags=re.DOTALL,
)
if pattern.search(text):
new_text = pattern.sub(block, text)
else:
new_text = text.rstrip() + "\n\n" + block
atomic_write_text(decisions_path, new_text)
def seed_queries_from_topic(queries_path: Path, topic: str) -> None:
topic = topic.strip()
if not topic:
return
if queries_path.exists():
lines = queries_path.read_text(encoding="utf-8").splitlines()
else:
lines = [
"# Queries",
"",
"## Primary query",
"- keywords:",
" - \"\"",
"- exclude:",
" - \"\"",
"- max_results: \"\"",
"- core_size: \"\"",
"- time window:",
" - from: \"\"",
" - to: \"\"",
"",
"## Notes",
"-",
]
def _has_nonempty_values(token: str) -> bool:
in_block = False
for raw in lines:
stripped = raw.strip()
if stripped.startswith(f"- {token}:"):
in_block = True
continue
if not in_block:
continue
if raw.startswith(" - "):
value = stripped[2:].strip().strip('"').strip("'")
if value:
return True
continue
if stripped.startswith("- "):
break
return False
def _has_nonempty_scalar(token: str) -> bool:
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {token}:"):
continue
value = stripped.split(":", 1)[1].split("#", 1)[0].strip().strip('"').strip("'")
return bool(value)
return False
def _has_nonempty_time_field(field: str) -> bool:
# Looks for lines like: ' - from: "2022"'
for raw in lines:
stripped = raw.strip()
if not stripped.startswith(f"- {field}:"):
continue
value = stripped.split(":", 1)[1].strip().strip('"').strip("'")
return bool(value)
return False
has_keywords = _has_nonempty_values("keywords")
has_excludes = _has_nonempty_values("exclude")
has_time_from = _has_nonempty_time_field("from")
has_time_to = _has_nonempty_time_field("to")
has_max_results = _has_nonempty_scalar("max_results")
has_core_size = _has_nonempty_scalar("core_size")
workspace = queries_path.parent
profile = pipeline_profile(workspace)
query_defaults = pipeline_query_defaults(workspace)
raw_tlow = topic.lower()
topic_for_queries = _sanitize_topic_for_query_seed(topic)
keyword_suggestions = [topic_for_queries]
tlow = topic_for_queries.lower()
is_agent = any(t in tlow for t in ("agent", "agents", "agentic"))
is_embodied = any(
t in tlow
for t in (
"embodied ai",
"embodied intelligence",
"embodied agent",
"embodied robotics",
"robot foundation model",
"robot learning",
"robot manipulation",
"vision-language-action",
"vla",
"generalist robot",
)
)
is_text_to_image = any(t in tlow for t in ("text-to-image", "text to image", "t2i"))
is_text_to_video = any(t in tlow for t in ("text-to-video", "text to video", "t2v"))
is_diffusion = "diffusion" in tlow
is_generative = is_text_to_image or is_text_to_video or is_diffusion or ("image generation" in tlow) or ("generative" in tlow)
if is_agent:
keyword_suggestions.extend(
[
"LLM agent",
"language model agent",
"tool use",
"function calling",
"tool-using agent",
"planning",
"memory",
"multi-agent",
"benchmark",
"safety",
]
)
exclude_suggestions: list[str] = []
if is_agent:
exclude_suggestions.append("agent-based modeling")
exclude_suggestions.extend(["react hooks", "perovskite", "banach", "coxeter"])
if is_embodied:
keyword_suggestions.extend(
[
"embodied AI survey",
"embodied AI review",
"embodied intelligence survey",
"embodied agent survey",
"robot foundation model survey",
"robot learning survey",
"robot manipulation survey",
"embodied robotics survey",
"vision-language-action survey",
"vision-language-action model",
"robot foundation model",
"generalist robot policy",
"world model robot",
]
)
stripped_output_terms = topic_for_queries.lower().strip() != raw_tlow.strip()
if stripped_output_terms and any(t in raw_tlow for t in ("latex", "pdf", "markdown", "typesetting")):
exclude_suggestions.extend(["latex", "pdf", "typesetting", "document layout"])
if is_generative:
keyword_suggestions.extend(
[
"text-to-image generation",
"text-guided image generation",
"diffusion model",
"denoising diffusion probabilistic model",
"latent diffusion",
"stable diffusion",
"classifier-free guidance",
"diffusion transformer",
"DiT",
"masked generative transformer",
"MaskGIT",
"autoregressive image generation",
"VQGAN",
"VQ-VAE",
"ControlNet",
"DreamBooth",
"textual inversion",
"LoRA fine-tuning",
]
)
default_max_results = query_defaults.get("max_results")
max_results_suggestion = str(default_max_results) if str(default_max_results or "").strip() else (
1800 if profile == "arxiv-survey" else (800 if (is_agent or is_generative or is_embodied) else 300)
)
time_from_suggestion = (
"2018" if is_embodied
else ("2022" if (is_agent and ("llm" in tlow or "language model" in tlow)) else ("2020" if is_generative else ""))
)
core_size_suggestion = str(query_defaults.get("core_size") or "").strip() or ("300" if profile == "arxiv-survey" else "")
out: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if stripped.startswith("- keywords:") and not has_keywords:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for kw in _dedupe_preserve_order(keyword_suggestions)[:14]:
out.append(f" - \"{kw}\"")
continue
if stripped.startswith("- exclude:") and not has_excludes:
out.append(line)
i += 1
while i < len(lines) and lines[i].startswith(" - "):
i += 1
for ex in _dedupe_preserve_order(exclude_suggestions)[:10]:
out.append(f" - \"{ex}\"")
continue
if stripped.startswith("- max_results:") and not has_max_results and max_results_suggestion:
out.append(f"- max_results: \"{max_results_suggestion}\"")
i += 1
continue
if stripped.startswith("- core_size:") and not has_core_size and core_size_suggestion:
out.append(f"- core_size: \"{core_size_suggestion}\"")
i += 1
continue
if stripped.startswith("- time window:") and not (has_time_from or has_time_to) and time_from_suggestion:
out.append(line)
i += 1
# Skip existing from/to lines if present.
while i < len(lines) and lines[i].startswith(" -"):
i += 1
out.append(f" - from: \"{time_from_suggestion}\"")
out.append(" - to: \"\"")
continue
out.append(line)
i += 1
out = _materialize_missing_query_defaults(out, query_defaults, allowed_fields=pipeline_overridable_query_fields(workspace))
atomic_write_text(queries_path, "\n".join(out).rstrip() + "\n")
def _dedupe_preserve_order(items: list[str]) -> list[str]:
seen: set[str] = set()
out: list[str] = []
for item in items:
item = item.strip()
if not item or item in seen:
continue
seen.add(item)
out.append(item)
return out
def _sanitize_topic_for_query_seed(topic: str) -> str:
text = str(topic or "").strip()
if not text:
return ""
patterns = [
r"(?i)\bwith\s+latex\s*/\s*pdf\s+output\b",
r"(?i)\bwith\s+latex\s+output\b",
r"(?i)\bwith\s+pdf\s+output\b",
r"(?i)\bwith\s+markdown\s+output\b",
r"(?i)\blatex\s*/\s*pdf\s+output\b",
r"(?i)\bpdf\s+output\b",
r"(?i)\blatex\s+output\b",
r"(?i)\bmarkdown\s+output\b",
r"(?i)\bfor\s+latex\s*/\s*pdf\b",
]
for pattern in patterns:
text = re.sub(pattern, "", text)
text = re.sub(r"\s+", " ", text).strip(" ,;:-")
return text or topic
_LEGACY_PIPELINE_ALIASES = {
"idea-finder": "idea-brainstorm",
"idea-finder.pipeline.md": "idea-brainstorm",
"pipelines/idea-finder.pipeline.md": "idea-brainstorm",
}
def find_repo_root(start: Path | None = None) -> Path:
"""Walk up from *start* (default: this file) looking for AGENTS.md."""
candidate = (start or Path(__file__)).resolve()
for _ in range(10):
if (candidate / "AGENTS.md").exists():
return candidate
parent = candidate.parent
if parent == candidate:
break
candidate = parent
raise FileNotFoundError("Could not find repo root (AGENTS.md marker)")
def _normalize_pipeline_lock_value(value: str) -> str:
raw = str(value or "").strip()
return _LEGACY_PIPELINE_ALIASES.get(raw, raw)
def resolve_pipeline_spec_path(*, repo_root: Path, pipeline_value: str) -> Path | None:
value = _normalize_pipeline_lock_value(pipeline_value)
if not value:
return None
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
rel_candidate = repo_root / value
if rel_candidate.exists():
return rel_candidate.resolve()
filename = Path(value).name
if filename:
direct = repo_root / "pipelines" / filename
if direct.exists():
return direct.resolve()
stem = filename
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
direct = repo_root / "pipelines" / f"{stem}.pipeline.md"
if direct.exists():
return direct.resolve()
return None
def load_workspace_pipeline_spec(workspace: Path):
from tooling.pipeline_spec import PipelineSpec
try:
repo_root = find_repo_root(workspace)
except FileNotFoundError:
repo_root = Path(__file__).resolve().parents[1]
lock_path = workspace / "PIPELINE.lock.md"
if not lock_path.exists():
return None
pipeline_name = ""
try:
for raw in lock_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if line.startswith("pipeline:"):
pipeline_name = line.split(":", 1)[1].strip()
break
except Exception:
return None
if not pipeline_name:
return None
spec_path = resolve_pipeline_spec_path(repo_root=repo_root, pipeline_value=pipeline_name)
if spec_path is None:
return None
try:
return PipelineSpec.load(spec_path)
except Exception:
return None
def pipeline_query_defaults(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.query_defaults) if spec is not None else {}
def pipeline_quality_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
return dict(spec.quality_contract) if spec is not None else {}
def pipeline_quality_contract_value(workspace: Path, *keys: str, default: Any = None) -> Any:
current: Any = pipeline_quality_contract(workspace)
for key in keys:
if not isinstance(current, dict):
return default
current = current.get(str(key))
if current is None:
return default
return current
def pipeline_query_default(workspace: Path, key: str, default: Any = None) -> Any:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return default
return spec.query_default(key, default)
def pipeline_overridable_query_fields(workspace: Path) -> set[str]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return set()
return set(spec.overridable_query_fields)
def pipeline_profile(workspace: Path) -> str:
"""Return the pipeline profile for a workspace.
Reads PIPELINE.lock.md to get the pipeline name, then loads the
pipeline spec file and reads its ``profile`` frontmatter field.
Falls back to ``"default"`` if anything is missing.
"""
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
return "default"
return str(spec.profile or "default").strip() or "default"
def latest_outline_state(workspace: Path) -> dict[str, Any]:
path = Path(workspace).resolve() / "outline" / "outline_state.jsonl"
records = [rec for rec in read_jsonl(path) if isinstance(rec, dict)]
return dict(records[-1]) if records else {}
def _materialize_missing_query_defaults(lines: list[str], query_defaults: dict[str, Any], *, allowed_fields: set[str] | None = None) -> list[str]:
if not query_defaults:
return lines
existing_keys: set[str] = set()
for raw in lines:
stripped = raw.strip()
if not stripped.startswith("- ") or ":" not in stripped:
continue
key = stripped[2:].split(":", 1)[0].strip().lower().replace(" ", "_").replace("-", "_")
if key:
existing_keys.add(key)
additions: list[str] = []
for key, value in query_defaults.items():
norm_key = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not norm_key or norm_key in existing_keys:
continue
if allowed_fields and norm_key not in allowed_fields:
continue
rendered = _render_query_scalar(value)
if rendered is None:
continue
additions.append(f'- {norm_key}: "{rendered}"')
if not additions:
return lines
out = list(lines)
if out and out[-1].strip():
out.append("")
out.extend(additions)
return out
def _render_query_scalar(value: Any) -> str | None:
if value is None:
return None
if isinstance(value, bool):
return "true" if value else "false"
if isinstance(value, (int, float)):
return str(value)
if isinstance(value, str):
text = value.strip()
return text or None
return None
FILE:tooling/executor.py
from __future__ import annotations
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from tooling.common import (
UnitsTable,
atomic_write_text,
decisions_has_approval,
ensure_dir,
latest_outline_state,
load_workspace_pipeline_spec,
now_iso_seconds,
parse_semicolon_list,
set_decisions_approval,
update_status_field,
update_status_log,
)
@dataclass(frozen=True)
class RunResult:
unit_id: str | None
status: str
message: str
def _section_first_cutover_block_message(*, workspace: Path, outputs: list[str]) -> str | None:
spec = load_workspace_pipeline_spec(workspace)
if spec is None or str(spec.structure_mode or "").strip().lower() != "section_first":
return None
if "outline/outline_state.jsonl" not in outputs:
return None
latest = latest_outline_state(workspace)
if not latest:
return "Section-first cutover is not actionable yet: `outline/outline_state.jsonl` is missing or empty. Rerun `outline-refiner` after fixing the chapter/section layer."
structure_phase = str(latest.get("structure_phase") or "").strip()
h3_status = str(latest.get("h3_status") or "").strip().lower()
reroute_target = str(latest.get("reroute_target") or "").strip()
retry_budget = str(latest.get("retry_budget_remaining") or "").strip()
reroute_reason = str(latest.get("reroute_reason") or "").strip()
if structure_phase.lower() == "decomposed" and h3_status == "stable":
return None
missing: list[str] = []
for key in ("structure_phase", "h3_status", "approval_status", "reroute_target", "retry_budget_remaining"):
if key not in latest:
missing.append(key)
continue
if key in {"structure_phase", "h3_status"} and not str(latest.get(key) or "").strip():
missing.append(key)
if missing:
return (
"Section-first cutover cannot advance because the latest `outline_state.jsonl` record is missing "
f"required fields: {', '.join(missing)}."
)
reroute_label = reroute_target or "unknown"
retry_label = retry_budget or "unknown"
reason_suffix = f" Reason: {reroute_reason}." if reroute_reason else ""
return (
"Section-first cutover is still blocked; "
f"latest outline state has structure_phase={structure_phase or 'missing'}, "
f"h3_status={h3_status or 'missing'}, reroute_target={reroute_label}, "
f"retry_budget_remaining={retry_label}.{reason_suffix} Fix the reroute target explicitly, then rerun this unit."
)
def _append_run_error(*, workspace: Path, unit_id: str, skill: str, kind: str, message: str, log_rel: str | None) -> None:
"""Append a short failure record to `output/RUN_ERRORS.md` (workspace-local).
This is a human-facing error sink that survives reruns and makes BLOCKED states debuggable.
"""
try:
ensure_dir(workspace / "output")
out_path = workspace / "output" / "RUN_ERRORS.md"
stamp = now_iso_seconds()
log_hint = f" (log: `{log_rel}`)" if log_rel else ""
line = f"- {stamp} `{unit_id}` `{skill}` `{kind}`: {message}{log_hint}"
if out_path.exists() and out_path.stat().st_size > 0:
prev = out_path.read_text(encoding="utf-8", errors="ignore").rstrip() + "\n"
else:
prev = "# Run errors\n\n"
atomic_write_text(out_path, prev + line + "\n")
except Exception:
# Never let the runner crash while trying to log an error.
return
def run_one_unit(
*,
workspace: Path,
repo_root: Path,
strict: bool = False,
auto_approve: set[str] | None = None,
) -> RunResult:
units_path = workspace / "UNITS.csv"
status_path = workspace / "STATUS.md"
if not units_path.exists():
return RunResult(unit_id=None, status="ERROR", message=f"Missing {units_path}")
table = UnitsTable.load(units_path)
runnable_idx = _find_first_runnable(table)
if runnable_idx is None:
return RunResult(unit_id=None, status="IDLE", message="No runnable unit found")
row = table.rows[runnable_idx]
unit_id = row.get("unit_id", "").strip()
skill = row.get("skill", "").strip()
owner = row.get("owner", "").strip().upper()
row["status"] = "DOING"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DOING {skill}")
auto_approve_set = {str(x or "").strip().upper() for x in (auto_approve or set()) if str(x or "").strip()}
if owner == "HUMAN":
checkpoint = row.get("checkpoint", "").strip()
if checkpoint and decisions_has_approval(workspace / "DECISIONS.md", checkpoint):
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (HUMAN approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"HUMAN approved {checkpoint}")
if checkpoint and checkpoint.upper() in auto_approve_set:
set_decisions_approval(workspace / "DECISIONS.md", checkpoint, approved=True)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE (AUTO approved {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message=f"AUTO approved {checkpoint}")
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (await HUMAN approval {checkpoint})")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Await HUMAN approval {checkpoint} in DECISIONS.md")
script_path = repo_root / "scripts" / "run.py"
if not script_path.exists():
row["status"] = "BLOCKED"
table.save(units_path)
skill_md = "SKILL.md"
update_status_log(
status_path,
(
f"{now_iso_seconds()} {unit_id} BLOCKED "
f"(no script for {skill}; run manually per {skill_md} then mark DONE)"
),
)
_refresh_status_checkpoint(status_path, table)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=(
f"No executable script for skill '{skill}'. "
f"Run it manually by following `{skill_md}`, write the required outputs, "
f"then mark the unit DONE (e.g., `python scripts/pipeline.py mark --workspace {workspace} --unit-id {unit_id} --status DONE`)."
),
)
inputs = parse_semicolon_list(row.get("inputs"))
raw_outputs = parse_semicolon_list(row.get("outputs"))
outputs = [_strip_optional_marker(rel) for rel in raw_outputs]
required_outputs = [outputs[i] for i, rel in enumerate(raw_outputs) if not rel.strip().startswith("?")]
checkpoint = row.get("checkpoint", "").strip()
cmd = [
sys.executable,
str(script_path),
"--workspace",
str(workspace),
"--unit-id",
unit_id,
"--inputs",
";".join(inputs),
"--outputs",
";".join(outputs),
"--checkpoint",
checkpoint,
]
log_rel = f"output/unit_logs/{unit_id}.{skill}.log"
log_path = workspace / log_rel
try:
completed = subprocess.run(cmd, check=False, capture_output=True, text=True)
if completed.stdout or completed.stderr or completed.returncode != 0:
ensure_dir(log_path.parent)
body = [
f"# Unit log\n",
f"- unit_id: {unit_id}\n",
f"- skill: {skill}\n",
f"- exit: {completed.returncode}\n",
f"- cmd: {' '.join(cmd)}\n",
"\n## stdout\n\n",
(completed.stdout or "(empty)") + "\n",
"\n## stderr\n\n",
(completed.stderr or "(empty)") + "\n",
]
atomic_write_text(log_path, "".join(body))
except Exception as exc: # pragma: no cover
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (exec error)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="exec_error",
message=f"{type(exc).__name__}: {exc}",
log_rel=None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=str(exc))
missing = [rel for rel in required_outputs if rel and not (workspace / rel).exists()]
if completed.returncode == 0 and not missing:
if strict:
from tooling.quality_gate import check_unit_outputs, write_quality_report
try:
issues = check_unit_outputs(skill=skill, workspace=workspace, outputs=outputs)
except Exception as exc: # pragma: no cover
from tooling.quality_gate import QualityIssue
issues = [
QualityIssue(
code="quality_gate_exception",
message=f"Quality gate crashed: {type(exc).__name__}: {exc}",
)
]
# Avoid confusing stale QUALITY_GATE.md after a successful run.
report_path = workspace / "output" / "QUALITY_GATE.md"
if issues or report_path.exists():
write_quality_report(workspace=workspace, unit_id=unit_id, skill=skill, issues=issues)
if issues:
row["status"] = "BLOCKED"
table.save(units_path)
rel_report = str((workspace / "output" / "QUALITY_GATE.md").relative_to(workspace))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (quality gate: {rel_report})")
_refresh_status_checkpoint(status_path, table)
reroute_hint = _reroute_hint(workspace)
return RunResult(
unit_id=unit_id,
status="BLOCKED",
message=f"Quality gate failed; see {rel_report}" + (f"; {reroute_hint}" if reroute_hint else ""),
)
cutover_block = _section_first_cutover_block_message(workspace=workspace, outputs=outputs)
if cutover_block:
row["status"] = "BLOCKED"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (section-first cutover)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="section_first_cutover",
message=cutover_block,
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=cutover_block)
row["status"] = "DONE"
table.save(units_path)
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} DONE {skill}")
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="DONE", message="OK")
row["status"] = "BLOCKED"
table.save(units_path)
if missing:
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (missing outputs: {', '.join(missing)})")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="missing_outputs",
message=f"Missing outputs: {', '.join(missing)}",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Missing outputs: {', '.join(missing)}" + (f"; see {log_rel}" if log_path.exists() else ""))
update_status_log(status_path, f"{now_iso_seconds()} {unit_id} BLOCKED (script failed)")
_append_run_error(
workspace=workspace,
unit_id=unit_id,
skill=skill,
kind="script_failed",
message=f"Skill script failed (exit {completed.returncode})",
log_rel=log_rel if log_path.exists() else None,
)
_refresh_status_checkpoint(status_path, table)
return RunResult(unit_id=unit_id, status="BLOCKED", message=f"Skill script failed (exit {completed.returncode})" + (f"; see {log_rel}" if log_path.exists() else ""))
def _find_first_runnable(table: UnitsTable) -> int | None:
status_ok = {"DONE", "SKIP"}
unit_by_id = {row.get("unit_id", ""): row for row in table.rows}
for idx, row in enumerate(table.rows):
if row.get("status", "").strip().upper() not in {"TODO", "BLOCKED"}:
continue
deps = parse_semicolon_list(row.get("depends_on"))
if not deps:
return idx
deps_done = True
for dep_id in deps:
dep = unit_by_id.get(dep_id)
if not dep:
deps_done = False
break
if dep.get("status", "").strip().upper() not in status_ok:
deps_done = False
break
if deps_done:
return idx
return None
def _refresh_status_checkpoint(status_path: Path, table: UnitsTable) -> None:
checkpoint = _compute_current_checkpoint(table)
update_status_field(status_path, "Current checkpoint", checkpoint)
def _compute_current_checkpoint(table: UnitsTable) -> str:
for row in table.rows:
if row.get("status", "").strip().upper() not in {"DONE", "SKIP"}:
return (row.get("checkpoint") or "").strip() or "C0"
return "DONE"
def invalidate_downstream_units(table: UnitsTable, *, root_unit_id: str) -> list[str]:
"""Reset all transitive downstream dependents of `root_unit_id` to TODO.
This is used when a previously satisfied upstream unit is reopened for rerun.
Keeping downstream units as DONE would otherwise leave stale artifacts in place
and make later `run` invocations stop too early.
"""
root = str(root_unit_id or "").strip()
if not root:
return []
direct_children: dict[str, list[dict[str, str]]] = {}
for row in table.rows:
unit_id = str(row.get("unit_id") or "").strip()
for dep in parse_semicolon_list(row.get("depends_on")):
direct_children.setdefault(dep, []).append(row)
affected: list[str] = []
seen: set[str] = set()
stack = [root]
while stack:
current = stack.pop()
for child in direct_children.get(current, []):
child_id = str(child.get("unit_id") or "").strip()
if not child_id or child_id in seen:
continue
seen.add(child_id)
if str(child.get("status") or "").strip().upper() != "TODO":
child["status"] = "TODO"
affected.append(child_id)
stack.append(child_id)
return affected
def _strip_optional_marker(relpath: str) -> str:
relpath = (relpath or "").strip()
if relpath.startswith("?"):
return relpath[1:].strip()
return relpath
def _reroute_hint(workspace: Path) -> str:
path = workspace / "output" / "REROUTE_STATE.json"
if not path.exists() or path.stat().st_size <= 0:
return ""
try:
import json
data = json.loads(path.read_text(encoding="utf-8", errors="ignore") or "{}")
except Exception:
return ""
if not isinstance(data, dict):
return ""
target = str(data.get("reroute_target") or "").strip()
status = str(data.get("status") or "").strip()
phase = str(data.get("structure_phase") or "").strip()
h3 = str(data.get("h3_status") or "").strip()
reason = str(data.get("reroute_reason") or "").strip()
if not any([target, status, phase, h3]):
return ""
parts = []
if status:
parts.append(f"reroute_status={status}")
if target:
parts.append(f"reroute_target={target}")
if phase:
parts.append(f"structure_phase={phase}")
if h3:
parts.append(f"h3_status={h3}")
if reason:
parts.append(f"reason={reason}")
return ", ".join(parts)
FILE:tooling/ideation.py
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, is_dataclass
from pathlib import Path
from typing import Any, Iterable
from tooling.common import (
atomic_write_text,
ensure_dir,
load_workspace_pipeline_spec,
pipeline_overridable_query_fields,
pipeline_query_default,
read_jsonl,
)
DEFAULT_IDEA_RUBRIC: list[tuple[str, float, str]] = [
("discussion_worthiness", 0.24, "is this direction worth a serious PI/PhD discussion right now?"),
("academic_value", 0.22, "could it open a meaningful paper/thesis-worthy research angle?"),
("evidence_grounding", 0.18, "can we point to concrete literature tensions rather than pure speculation?"),
("direction_distinctness", 0.16, "is it genuinely different from neighboring directions, not just a wording variant?"),
("first_probe_clarity", 0.10, "is there a plausible low-cost first probe that would teach us something?"),
("thesis_potential", 0.10, "does it plausibly grow beyond a one-off note into a stronger line of work?"),
]
STOPWORDS = {
"a", "an", "and", "the", "of", "to", "for", "in", "on", "with", "via", "by", "from", "or",
"work", "works", "study", "studies", "survey", "review", "benchmarks", "benchmark", "evaluation",
"agents", "agent", "model", "models", "llm", "large", "language", "using", "based",
}
CLUSTER_AXIS_HINTS: list[tuple[list[str], list[str], str, str]] = [
(["agent loop", "action"], ["observability granularity", "action-space design", "tool/environment boundary"], "mechanism", "Could sharpen how we think about the basic agent loop rather than adding yet another full stack."),
(["tool", "orchestration", "interface"], ["tool routing", "permission model", "API reliability assumptions"], "systems", "Could clarify whether interface design is the real bottleneck behind seemingly agent-level gains."),
(["planning", "reasoning"], ["search depth", "verification loop", "partial-observability handling"], "mechanism", "Could turn vague talk about reasoning quality into a sharper research question about what actually drives robust planning."),
(["memory", "retrieval", "rag"], ["retrieval policy", "state summarization cadence", "memory horizon"], "mechanism", "Could reveal when memory is a genuine capability lever versus a reporting convenience."),
(["self-improvement", "adaptation"], ["feedback type", "adaptation horizon", "evaluation signal quality"], "adaptation", "Could distinguish genuine improvement from evaluation-sensitive self-optimization."),
(["multi-agent", "coordination"], ["role specialization", "communication budget", "aggregation rule"], "coordination", "Could separate coordination benefits from simple redundancy or verification effects."),
(["benchmark", "evaluation"], ["metric choice", "task slice design", "failure-analysis lens"], "evaluation", "Could make evaluation choices themselves a research object instead of hidden background assumptions."),
(["safety", "security", "governance"], ["threat model", "monitoring granularity", "permission boundary"], "governance", "Could connect technical agent behaviors to deployment-relevant risk questions in a more principled way."),
]
GENERIC_LIMITATION_PATTERNS = [
"abstract-level evidence only",
"validate assumptions",
"full paper",
"before relying on this as key evidence",
]
BENCHMARK_HINTS = [
"HotpotQA", "FEVER", "ALFWorld", "WebShop", "HumanEval", "WebArena", "AgentBench",
"GSM8K", "MATH", "Wikipedia", "wiki", "API", "question answering", "fact verification",
]
AXIS_INSIGHT_LIBRARY: dict[str, dict[str, str]] = {
"observability granularity": {
"question": "whether several agent-loop gains are really planner gains, or whether they mostly come from changing what the planner gets to observe and when",
"confound": "planner quality and broader agent competence",
"thesis": "Several agent-loop gains remain hard to interpret because papers often improve observation access at the same time they improve the planner; before crediting planner depth, we should ask what the system was allowed to see.",
"insight": "A convincing result would not just move aggregate score. It would show which published gains survive once observation access is fixed, ideally producing a regime map of when observability—not planner depth—changes the failure story.",
"contribution_shape": "Could yield a causal-attribution result plus a reporting rule for agent-loop papers: claims about planning quality should specify and control observation access.",
"demotion": "This direction weakens sharply if the strongest prior work already holds observation access fixed while planner quality changes and still reports the same gain pattern.",
"kill_signal": "an anchor paper already fixes observation access while varying planner quality and the main conclusion still survives",
"missing_piece": "What is missing is a fixed-interface, fixed-budget comparison that varies only observation access and tracks whether the failure taxonomy—not just average score—changes.",
"program_kind": "causal attribution",
"time_to_clarity": "fast",
"priority_note": "Fastest to falsify on public tasks, and the answer would immediately change how several agent-loop results are interpreted.",
"reading_extract": "what the agent sees at each step, which ablations vary planner depth, and whether the action interface stays fixed",
"title": "Observability granularity vs planner depth",
},
"action-space design": {
"question": "whether robustness comes from better reasoning or from giving the agent a cleaner action interface",
"confound": "the shape of the action vocabulary and interface design",
"thesis": "Some agent-loop gains may be interface gains in disguise: when action vocabularies become cleaner or narrower, papers can attribute robustness to reasoning improvements that partly come from action-space design.",
"insight": "A useful result would show whether the same nominal planner still behaves very differently once the action space is widened, normalized, or made less ergonomic.",
"contribution_shape": "Could produce an action-space normalization protocol and a clearer account of which agent claims survive once interface ergonomics are controlled.",
"demotion": "This direction weakens if current papers already compare equivalent action spaces and still obtain the same ranking.",
"kill_signal": "the key anchor papers already normalize action vocabularies or API surfaces and still see the same ordering",
"missing_piece": "What is missing is an action-space-normalized comparison that keeps planner prompts fixed while changing only interface granularity and affordances.",
"program_kind": "interface normalization",
"time_to_clarity": "medium",
"priority_note": "Interesting and still thesis-relevant, but less immediate than observability because interface normalization usually requires more careful task redesign.",
"reading_extract": "how many actions are available, how semantically aligned the actions are, and whether interface cleanup happens together with reasoning changes",
"title": "Action-space design or agent competence?",
},
"tool/environment boundary": {
"question": "whether reliability depends more on boundary design between tool and environment than on the nominal reasoning loop",
"confound": "tool abstraction and environment modeling",
"thesis": "A number of agent results may really be about boundary design: where the system stops reasoning internally and starts delegating to tools can change reliability without changing the nominal planner much.",
"insight": "A strong result would show that several agent-level conclusions are actually boundary-design conclusions once tool abstraction is normalized.",
"contribution_shape": "Could yield a boundary-design taxonomy and a cleaner experimental recipe for separating planner quality from tool/interface scaffolding.",
"demotion": "This direction weakens if changing the boundary carefully barely affects the failure story.",
"kill_signal": "careful tool-boundary changes leave both the metric and the qualitative failure modes essentially unchanged",
"missing_piece": "What is missing is a comparison that keeps task semantics fixed while moving the tool/environment boundary in a controlled way.",
"program_kind": "systems boundary",
"time_to_clarity": "medium",
"priority_note": "Valuable as a systems-facing thesis line, but slower to turn into a decisive first readout than the lead mechanism questions.",
"reading_extract": "where tool calls begin, what state is exposed to the planner, and whether environment modeling changes together with the tool boundary",
"title": "Where the tool boundary really matters",
},
"search depth": {
"question": "whether apparent planning gains reflect deeper search or simply more inference-time budget to recover from weak initial choices",
"confound": "inference-time compute budget",
"thesis": "Many planning results still bundle depth with budget: deeper search often spends more tokens, branches, or retries, so it remains unclear whether depth changes reasoning quality or just buys more recovery opportunities.",
"insight": "A convincing result would show whether depth changes the nature of planning failures under a matched budget, rather than merely delaying failure by spending more compute.",
"contribution_shape": "Could produce a compute-normalized planning benchmark slice and a regime map for when search depth matters beyond extra inference budget.",
"demotion": "This direction weakens if prior work already equalizes budget and still finds depth-specific gains.",
"kill_signal": "the anchor papers already normalize token or wall-clock budget and the depth advantage remains intact",
"missing_piece": "What is missing is a compute-normalized study that holds token or wall-clock budget fixed while varying depth or branching, then checks whether failure modes actually change.",
"program_kind": "budget-normalized mechanism",
"time_to_clarity": "medium",
"priority_note": "Still thesis-sized, but it ranks behind observability because compute normalization is harder to defend and easier for prior work to have addressed already.",
"reading_extract": "the reported depth or branching settings, the effective compute budget, and whether shallow baselines were budget-matched",
"title": "Search depth or compute budget?",
},
"verification loop": {
"question": "whether verification contributes a distinct reasoning mechanism or mostly acts as expensive redundancy",
"confound": "extra compute and repeated checking",
"thesis": "Verification-heavy pipelines may look stronger because they retry, re-check, or filter more often, not necessarily because they add a distinct reasoning mechanism.",
"insight": "A useful result would separate verification as a mechanism from verification as a compute-expensive retry policy.",
"contribution_shape": "Could produce a cleaner protocol for comparing verification loops under matched retry or compute budgets.",
"demotion": "This direction weakens if verification can be replaced by simple repetition with little change in behavior.",
"kill_signal": "repeat-sampling or retry baselines already match the reported verification gain",
"missing_piece": "What is missing is a matched-budget comparison between explicit verification and simple repetition or reranking baselines.",
"program_kind": "verification audit",
"time_to_clarity": "medium",
"priority_note": "Decision-relevant when the group suspects redundancy masquerading as reasoning, though the probe is less clean than the top two directions.",
"reading_extract": "how many extra passes are spent on checking, whether retry baselines exist, and which errors verification actually fixes",
"title": "Verification or just expensive redundancy?",
},
"partial-observability handling": {
"question": "whether planning systems fail because they reason badly or because they are brittle under missing state information",
"confound": "state visibility",
"thesis": "Some planning failures may be epistemic before they are algorithmic: systems can look like weak planners when they are actually brittle under missing state information.",
"insight": "A strong result would show whether improved visibility changes the failure taxonomy more than planner changes do.",
"contribution_shape": "Could produce a planning-under-partial-observability benchmark slice and a clearer division between epistemic and algorithmic failure modes.",
"demotion": "This direction weakens if stronger visibility barely changes the failure pattern.",
"kill_signal": "improved visibility leaves both success rates and error types almost unchanged",
"missing_piece": "What is missing is a task slice that varies only state visibility while keeping planner scaffolding steady.",
"program_kind": "epistemic failure analysis",
"time_to_clarity": "medium",
"priority_note": "Useful if the group wants a deeper failure-analysis program, but it is currently less concrete than the lead directions.",
"reading_extract": "what information is hidden, which observations are restored by tools, and whether planner quality is tested separately from visibility",
"title": "Is planning failure really an observability problem?",
},
"retrieval policy": {
"question": "whether memory gains come from better stored knowledge or from better decisions about when and how retrieval is triggered",
"confound": "memory content versus retrieval timing",
"thesis": "Reported memory gains are often hard to interpret because memory content and retrieval triggers move together; some improvements may be protocol effects about when retrieval happens, not capability effects about what memory stores.",
"insight": "A convincing result would separate memory-content effects from retrieval-trigger effects and show whether the interpretation of current RAG-style gains changes once trigger policy is normalized.",
"contribution_shape": "Could yield a cleaner memory-evaluation protocol and a reusable result on when retrieval policy, rather than memory content, is the true hidden variable.",
"demotion": "This direction weakens if current work already varies retrieval policy independently of memory content and sees the same story.",
"kill_signal": "the main memory papers already keep the memory store fixed while varying retrieval triggers or timing and the conclusion does not move",
"missing_piece": "What is missing is a fixed-memory comparison that varies retrieval trigger and timing only, then checks whether the gain survives and which errors it actually removes.",
"program_kind": "protocol sensitivity",
"time_to_clarity": "medium",
"priority_note": "Different enough from the planning directions to stay in the lead set, but it ranks lower because the current evidence is thinner and more protocol-sensitive.",
"reading_extract": "what is stored in memory, when retrieval fires, what retrieval budget is allowed, and whether trigger policy is ever ablated independently",
"title": "Retrieval policy or memory content?",
},
"state summarization cadence": {
"question": "whether summarization helps because it improves reasoning or because it selectively compresses away troublesome state information",
"confound": "what gets forgotten versus what gets highlighted",
"thesis": "Summarization may help less because the agent reasons better and more because the system filters state in a favorable way; cadence choices can quietly redefine what information survives.",
"insight": "A useful result would show whether different summarization cadences preserve the same conclusions and failure taxonomy once memory content is held steady.",
"contribution_shape": "Could produce a summarization-control protocol for long-horizon agents and a clearer account of how state compression alters conclusions.",
"demotion": "This direction weakens if different summarization cadences preserve the same conclusions and failure taxonomy.",
"kill_signal": "changing summarization cadence leaves both retrieval behavior and downstream errors largely unchanged",
"missing_piece": "What is missing is a fixed-memory, fixed-task comparison that varies only summarization cadence and records what state is lost or amplified.",
"program_kind": "state compression",
"time_to_clarity": "medium",
"priority_note": "Promising but currently less mature than retrieval-policy framing because the concrete prior-work hooks are thinner.",
"reading_extract": "what gets summarized away, how often summaries are rewritten, and whether failure cases correlate with state compression choices",
"title": "What summarization cadence is really changing",
},
"memory horizon": {
"question": "whether longer memory helps because more context is useful or because evaluations reward persistence over selectivity",
"confound": "context length and evaluation design",
"thesis": "Longer memory can look like better capability even when the benchmark mainly rewards persistence or repeated access to earlier state, so horizon effects may partly be evaluation-design effects.",
"insight": "A convincing result would show whether extending horizon changes task framing and error patterns, not just aggregate score.",
"contribution_shape": "Could produce a regime map for when horizon length matters, versus when selective retrieval or better task design explains the gain.",
"demotion": "This direction weakens if memory horizon has little effect once retrieval policy is controlled.",
"kill_signal": "memory-length changes stop mattering once retrieval triggers or evaluation design are normalized",
"missing_piece": "What is missing is a horizon study that matches retrieval policy and benchmark framing while varying how much state is retained.",
"program_kind": "regime mapping",
"time_to_clarity": "medium",
"priority_note": "Conceptually useful, but currently less decisive than retrieval-policy framing for a first discussion round.",
"reading_extract": "whether longer context actually changes retrieval choices, and whether the benchmark rewards persistence more than selective recall",
"title": "Does longer memory really help?",
},
}
@dataclass(frozen=True)
class IdeaSignal:
signal_id: str
cluster: str
direction_type: str
theme: str
claim_or_observation: str
tension: str
missing_piece: str
possible_axis: str
academic_value: str
evidence_confidence: str
paper_ids: list[str]
@dataclass(frozen=True)
class DirectionCard:
direction_id: str
cluster: str
direction_type: str
title: str
focus_axis: str
main_confound: str
program_kind: str
contribution_shape: str
time_to_clarity: str
one_line_thesis: str
why_interesting: str
literature_suggests: list[str]
closest_prior_gap: list[str]
missing_piece: str
possible_variants: list[str]
academic_value: str
first_probes: list[str]
what_counts_as_insight: str
weakness_conditions: list[str]
kill_criteria: list[str]
what_would_change_mind: list[str]
best_fit: str
why_this_ranks_here: str
evidence_confidence: str
paper_ids: list[str]
signal_ids: list[str]
anchor_reading_notes: list[dict[str, str]]
@dataclass(frozen=True)
class ScreenedDirection:
direction_id: str
cluster: str
direction_type: str
title: str
total_score: float
discussion_worthiness: int
academic_value_score: int
evidence_grounding: int
direction_distinctness: int
first_probe_clarity: int
thesis_potential: int
recommendation: str
rationale: str
def read_core_set(path: Path) -> list[dict[str, str]]:
import csv
if not path.exists():
return []
with path.open("r", encoding="utf-8", newline="") as handle:
return [dict(row) for row in csv.DictReader(handle)]
def write_jsonl(path: Path, rows: Iterable[dict[str, Any] | Any]) -> None:
ensure_dir(path.parent)
encoded: list[str] = []
for row in rows:
value = asdict(row) if is_dataclass(row) else row
encoded.append(json.dumps(value, ensure_ascii=False))
atomic_write_text(path, "\n".join(encoded).rstrip() + ("\n" if encoded else ""))
def write_json(path: Path, data: Any) -> None:
ensure_dir(path.parent)
atomic_write_text(path, json.dumps(data, ensure_ascii=False, indent=2).rstrip() + "\n")
def uniq_keep_order(items: Iterable[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or "").strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def slugify(text: str) -> str:
raw = re.sub(r"[^A-Za-z0-9]+", "-", str(text or "").strip().lower())
return raw.strip("-") or "x"
def clean_text(text: str, *, limit: int = 220) -> str:
s = str(text or "").strip()
s = s.replace("\n", " ")
s = re.sub(r"\s+", " ", s)
s = s.replace("|", ", ")
s = s.strip(" \"'`")
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def clean_sentence(text: str, *, limit: int = 180) -> str:
s = clean_text(text, limit=max(limit * 2, limit))
if len(s) <= limit:
return s
window = s[:limit + 40]
pieces = re.split(r'(?<=[.!?;])\s+', window)
acc: list[str] = []
total = 0
for piece in pieces:
piece = piece.strip()
if not piece:
continue
nxt = total + len(piece) + (1 if acc else 0)
if nxt > limit and acc:
break
acc.append(piece)
total = nxt
if total >= int(limit * 0.65):
break
if acc:
return " ".join(acc).strip()
clipped = s[:limit].rsplit(" ", 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def markdown_table(headers: list[str], rows: list[list[str]]) -> str:
head = "| " + " | ".join(headers) + " |"
sep = "| " + " | ".join(["---"] * len(headers)) + " |"
body = ["| " + " | ".join(str(cell).replace("\n", " ").strip() for cell in row) + " |" for row in rows]
return "\n".join([head, sep] + body)
def write_markdown(path: Path, text: str) -> None:
ensure_dir(path.parent)
atomic_write_text(path, text.rstrip() + "\n")
def extract_goal_from_goal_md(path: Path) -> str:
if not path.exists():
return "research ideas"
lines = [ln.strip() for ln in path.read_text(encoding="utf-8", errors="ignore").splitlines() if ln.strip() and not ln.startswith("#")]
return lines[0] if lines else "research ideas"
def _idea_workspace_from_brief(path: Path) -> Path | None:
try:
return path.resolve().parents[2]
except Exception:
return None
def _require_positive_float(value: Any, *, field_name: str) -> float:
try:
parsed = float(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
def _query_int_override(workspace: Path, key: str, default: int) -> int:
queries_path = workspace / "queries.md"
normalized = str(key or "").strip().lower().replace(" ", "_").replace("-", "_")
if not normalized:
return int(default)
if normalized not in pipeline_overridable_query_fields(workspace):
return int(default)
if not queries_path.exists():
return int(default)
try:
for raw in queries_path.read_text(encoding="utf-8", errors="ignore").splitlines():
line = raw.strip()
if not line.startswith("- ") or ":" not in line:
continue
key, value = line[2:].split(":", 1)
key = key.strip().lower().replace(" ", "_").replace("-", "_")
if key != normalized:
continue
cleaned = value.split("#", 1)[0].strip().strip('"').strip("'")
if not cleaned:
raise ValueError(f"Invalid ideation override: `{normalized}` is empty in queries.md.")
try:
parsed = int(cleaned)
except Exception as exc:
raise ValueError(f"Invalid ideation override: `{normalized}` must be an integer in queries.md.") from exc
if parsed <= 0:
raise ValueError(f"Invalid ideation override: `{normalized}` must be a positive integer in queries.md.")
return parsed
except Exception:
raise
return int(default)
def _validate_score_weights(value: Any, *, field_name: str) -> dict[str, float]:
required = {
"discussion_worthiness": 0.24,
"academic_value": 0.22,
"evidence_grounding": 0.18,
"direction_distinctness": 0.16,
"first_probe_clarity": 0.10,
"thesis_potential": 0.10,
}
if not isinstance(value, dict):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
weights: dict[str, float] = {}
for key in required:
if key not in value:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{key}")
weights[key] = _require_positive_float(value.get(key), field_name=f"{field_name}.{key}")
total = sum(weights.values())
if total <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return {key: round(weight / total, 6) for key, weight in weights.items()}
def _validate_diversity_axes(value: Any, *, field_name: str) -> list[str]:
allowed = {"cluster", "direction_type", "program_kind"}
if not isinstance(value, list):
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
axes: list[str] = []
seen: set[str] = set()
for item in value:
axis = str(item or "").strip().lower().replace("-", "_")
if not axis:
continue
if axis not in allowed:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}.{axis}")
if axis in seen:
continue
seen.add(axis)
axes.append(axis)
if not axes:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return axes
def resolve_idea_contract(workspace: Path) -> dict[str, Any]:
spec = load_workspace_pipeline_spec(workspace)
if spec is None:
raise ValueError("Missing active pipeline contract.")
query_defaults = dict(spec.query_defaults)
quality_contract = dict(spec.quality_contract)
def _require_positive_int(value: Any, *, field_name: str) -> int:
try:
parsed = int(value)
except Exception as exc:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}") from exc
if parsed <= 0:
raise ValueError(f"Missing or invalid ideation contract field: {field_name}")
return parsed
signal_policy = quality_contract.get("signal_policy") or {}
direction_policy = quality_contract.get("direction_policy") or {}
screening_policy = quality_contract.get("screening_policy") or {}
brief = parse_idea_brief(workspace / "output" / "trace" / "IDEA_BRIEF.md")
focus_clusters = [str(x).strip() for x in (brief.get("focus_clusters") or []) if str(x).strip()]
direction_pool_min_default = _require_positive_int(query_defaults.get("direction_pool_min"), field_name="query_defaults.direction_pool_min")
direction_pool_max_default = _require_positive_int(query_defaults.get("direction_pool_max"), field_name="query_defaults.direction_pool_max")
shortlist_size_default = _require_positive_int(query_defaults.get("idea_shortlist_size"), field_name="query_defaults.idea_shortlist_size")
report_top_n_default = _require_positive_int(query_defaults.get("report_top_n"), field_name="query_defaults.report_top_n")
idea_screen_top_n_default = _require_positive_int(query_defaults.get("idea_screen_top_n"), field_name="query_defaults.idea_screen_top_n")
shortlist_min = _require_positive_int(direction_policy.get("shortlist_min"), field_name="quality_contract.direction_policy.shortlist_min")
shortlist_max = _require_positive_int(direction_policy.get("shortlist_max"), field_name="quality_contract.direction_policy.shortlist_max")
keep_min = _require_positive_int(direction_policy.get("keep_min"), field_name="quality_contract.direction_policy.keep_min")
cluster_diversity_min = _require_positive_int(direction_policy.get("cluster_diversity_min"), field_name="quality_contract.direction_policy.cluster_diversity_min")
lead_diversity_target = _require_positive_int(direction_policy.get("lead_diversity_target"), field_name="quality_contract.direction_policy.lead_diversity_target")
lead_diversity_axes = _validate_diversity_axes(
direction_policy.get("lead_diversity_axes"),
field_name="quality_contract.direction_policy.lead_diversity_axes",
)
signal_table_min = _require_positive_int(signal_policy.get("min_rows"), field_name="quality_contract.signal_policy.min_rows")
keep_rank_max = _require_positive_int(screening_policy.get("keep_rank_max"), field_name="quality_contract.screening_policy.keep_rank_max")
maybe_rank_max = _require_positive_int(screening_policy.get("maybe_rank_max"), field_name="quality_contract.screening_policy.maybe_rank_max")
score_weights = _validate_score_weights(
screening_policy.get("score_weights"),
field_name="quality_contract.screening_policy.score_weights",
)
direction_pool_min = _query_int_override(workspace, "direction_pool_min", direction_pool_min_default)
direction_pool_max = _query_int_override(workspace, "direction_pool_max", direction_pool_max_default)
shortlist_size = _query_int_override(workspace, "idea_shortlist_size", shortlist_size_default)
report_top_n = _query_int_override(workspace, "report_top_n", report_top_n_default)
idea_screen_top_n = _query_int_override(workspace, "idea_screen_top_n", idea_screen_top_n_default)
if direction_pool_min > direction_pool_max:
raise ValueError(f"Invalid ideation contract: direction_pool_min ({direction_pool_min}) exceeds direction_pool_max ({direction_pool_max}).")
if shortlist_size < shortlist_min or shortlist_size > shortlist_max:
raise ValueError(
f"Invalid ideation contract: idea_shortlist_size ({shortlist_size}) must stay within "
f"[{shortlist_min}, {shortlist_max}]."
)
if report_top_n > shortlist_size:
raise ValueError(f"Invalid ideation contract: report_top_n ({report_top_n}) exceeds idea_shortlist_size ({shortlist_size}).")
if maybe_rank_max < keep_rank_max:
raise ValueError(
f"Invalid ideation contract: maybe_rank_max ({maybe_rank_max}) must be >= keep_rank_max ({keep_rank_max})."
)
if idea_screen_top_n > direction_pool_max:
raise ValueError(f"Invalid ideation contract: idea_screen_top_n ({idea_screen_top_n}) exceeds direction_pool_max ({direction_pool_max}).")
return {
"focus_clusters": focus_clusters,
"direction_pool_min": direction_pool_min,
"direction_pool_max": direction_pool_max,
"idea_screen_top_n": idea_screen_top_n,
"shortlist_size": shortlist_size,
"report_top_n": report_top_n,
"shortlist_min": shortlist_min,
"shortlist_max": shortlist_max,
"signal_table_min": signal_table_min,
"keep_min": keep_min,
"cluster_diversity_min": cluster_diversity_min,
"lead_diversity_target": lead_diversity_target,
"lead_diversity_axes": lead_diversity_axes,
"keep_rank_max": keep_rank_max,
"maybe_rank_max": maybe_rank_max,
"score_weights": score_weights,
}
def parse_idea_brief(path: Path) -> dict[str, Any]:
text = path.read_text(encoding="utf-8", errors="ignore") if path.exists() else ""
out: dict[str, Any] = {
"goal": extract_goal_from_goal_md(path),
"focus_clusters": [],
"query_buckets": [],
"exclusions": [],
"constraints": [],
"targets": {},
}
cur = None
for raw in text.splitlines():
line = raw.rstrip()
if line.startswith("## "):
cur = line[3:].strip().lower()
continue
if cur == "goal" and line.startswith("- Topic:"):
out["goal"] = line.split(":", 1)[1].strip()
if cur in {"focus after c2", "focus lenses after c2"} and line.startswith("- Focus clusters:"):
clusters = [x.strip() for x in line.split(":", 1)[1].split(";") if x.strip()]
cleaned: list[str] = []
for item in clusters:
low = item.lower()
if "to be filled" in low or "fill after c2" in low or "placeholder" in low:
continue
cleaned.append(item)
out["focus_clusters"] = cleaned
if cur == "query buckets" and re.match(r"^\d+\.\s+", line):
out["query_buckets"].append(re.sub(r"^\d+\.\s+", "", line).strip())
if cur in {"exclude terms", "exclusions"} and line.startswith("- "):
value = line[2:].strip()
if value and not value.lower().startswith("none"):
out["exclusions"].append(value)
if cur == "constraints" and line.startswith("- "):
out["constraints"].append(line[2:].strip())
if cur == "targets" and line.startswith("- ") and ":" in line:
key, value = line[2:].split(":", 1)
out["targets"][key.strip().lower().replace(" ", "_")] = value.strip()
return out
def collect_note_index(path: Path) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
pid = str(rec.get("paper_id") or "").strip()
if pid:
out[pid] = rec
return out
def keywords_from_cluster(name: str) -> list[str]:
toks = [t for t in re.findall(r"[A-Za-z0-9]+", str(name or "").lower()) if t not in STOPWORDS and len(t) > 2]
return uniq_keep_order(toks)
def score_note_to_cluster(cluster_name: str, note: dict[str, Any]) -> int:
keys = keywords_from_cluster(cluster_name)
blob_parts = [str(note.get("title") or "")]
for field in ["summary_bullets", "limitations", "key_results", "method", "abstract"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts).lower()
return sum(1 for key in keys if key in blob)
def map_notes_to_clusters(taxonomy_path: Path, notes_path: Path) -> dict[str, list[dict[str, Any]]]:
import yaml
taxonomy = yaml.safe_load(taxonomy_path.read_text(encoding="utf-8", errors="ignore")) if taxonomy_path.exists() else []
notes = [r for r in read_jsonl(notes_path) if isinstance(r, dict)]
out: dict[str, list[dict[str, Any]]] = {}
for top in taxonomy or []:
if not isinstance(top, dict):
continue
for child in top.get("children") or []:
if not isinstance(child, dict):
continue
name = str(child.get("name") or "").strip()
if not name:
continue
scored: list[tuple[int, dict[str, Any]]] = []
for note in notes:
s = score_note_to_cluster(name, note)
if s > 0:
scored.append((s, note))
scored.sort(key=lambda x: (-x[0], str(x[1].get("paper_id") or "")))
out[name] = [n for _, n in scored[:8]]
return out
def _cluster_profile(cluster: str) -> tuple[str, list[str], str]:
low = cluster.lower()
for keys, axes, direction_type, academic_value in CLUSTER_AXIS_HINTS:
if any(key in low for key in keys):
return direction_type, axes, academic_value
return "research", ["assumption sensitivity", "failure analysis", "scope boundary"], "Could sharpen the way this sub-area is framed and compared."
def _note_bullets(note: dict[str, Any], field: str) -> list[str]:
val = note.get(field)
if isinstance(val, list):
return [clean_text(x, limit=220) for x in val if clean_text(x, limit=220)]
if isinstance(val, str) and clean_text(val, limit=220):
return [clean_text(val, limit=220)]
return []
def _specific_limitations(note: dict[str, Any]) -> list[str]:
items = _note_bullets(note, "limitations")
specific = []
for item in items:
low = item.lower()
if any(pat in low for pat in GENERIC_LIMITATION_PATTERNS):
continue
specific.append(item)
return specific or items
def _evidence_confidence(notes: list[dict[str, Any]]) -> str:
if not notes:
return "low"
if any(str(n.get("evidence_level") or "").strip() == "fulltext" for n in notes):
return "medium-high"
specific = sum(
1
for n in notes
if _specific_limitations(n) and not any(p in (_specific_limitations(n)[0].lower()) for p in GENERIC_LIMITATION_PATTERNS)
)
if len(notes) >= 3 and specific >= 2:
return "medium"
return "low-medium"
def _axis_profile(axis: str, cluster: str) -> dict[str, str]:
base = AXIS_INSIGHT_LIBRARY.get(axis, {})
confound = base.get("confound", "nearby design choices and evaluation framing")
return {
"question": base.get("question", f"whether {axis} is doing more conceptual work in {cluster.lower()} than current papers make explicit"),
"confound": confound,
"thesis": base.get("thesis", f"Current results in {cluster.lower()} may be hard to interpret because {axis} still moves together with {confound}."),
"insight": base.get("insight", f"A meaningful result would show whether {axis} changes how we interpret the current literature rather than merely how we narrate it."),
"contribution_shape": base.get("contribution_shape", f"Could turn {axis} into a cleaner explanatory variable for {cluster.lower()} rather than a background convenience."),
"demotion": base.get("demotion", f"This direction would weaken if the strongest prior work already isolates {axis} cleanly."),
"kill_signal": base.get("kill_signal", f"the strongest prior work already isolates {axis} against {confound}"),
"missing_piece": base.get("missing_piece", f"What is missing is a cleaner comparison that varies {axis} while keeping {confound} fixed enough to change interpretation, not just presentation."),
"program_kind": base.get("program_kind", "mechanism clarification"),
"time_to_clarity": base.get("time_to_clarity", "medium"),
"priority_note": base.get("priority_note", f"Worth discussion because it could turn {axis} into a sharper explanatory wedge for {cluster.lower()}."),
"reading_extract": base.get("reading_extract", f"which variables change together with {axis}, what comparator is used, and whether any ablation already fixes {confound}"),
"title": base.get("title", f"What {axis} is really doing"),
}
def _result_fact(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
scored: list[str] = []
for cand in candidates:
low = cand.lower()
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "outperform", "achiev", "%"]) or any(ch.isdigit() for ch in cand):
scored.append(cand)
chosen = scored[0] if scored else _claim_text(note)
chosen = re.sub(r"^(for instance|for example|e\.g\.)[:,]?\s*", "", chosen, flags=re.IGNORECASE)
low = chosen.lower()
tasks = _extract_task_mentions(note)
task_text = "/".join(tasks[:2]) if tasks else "reported setting"
percents = re.findall(r"\d+(?:\.\d+)?%", chosen)
if "pass@1" in low and percents:
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} pass@1 vs {percents[1]} in the reported comparison.", limit=95)
return clean_text(f"{task_text}: {percents[0]} pass@1 in the reported comparison.", limit=95)
if "success rate" in low and len(percents) >= 2:
baseline = " over imitation/RL baselines" if "imitation" in low or "reinforcement learning" in low else " in the reported comparison"
return clean_text(f"{task_text}: {percents[0]}/{percents[1]} success-rate gains{baseline}.", limit=95)
if len(percents) >= 2:
return clean_text(f"{task_text}: {percents[0]} vs {percents[1]} in the reported comparison.", limit=95)
if len(percents) == 1:
return clean_text(f"{task_text}: reported result reaches {percents[0]} in the abstract.", limit=95)
return clean_text(chosen, limit=95)
def _metric_phrase(notes: list[dict[str, Any]]) -> str:
blob = " ".join(_result_fact(note) for note in notes)
low = blob.lower()
if "pass@1" in low:
return "pass@1 plus failure-type shifts"
if "success rate" in low:
return "success rate plus failure-type shifts"
if "accuracy" in low:
return "accuracy plus failure-type shifts"
if "exact match" in low:
return "exact match plus failure-type shifts"
if "f1" in low:
return "F1 plus failure-type shifts"
if "%" in blob:
return "the reported benchmark metric plus failure-type shifts"
return "task success plus failure-type shifts"
def _sentence_has_concrete_hook(text: str) -> bool:
low = str(text or "").lower()
if any(ch.isdigit() for ch in str(text or "")):
return True
if any(token in low for token in ["pass@1", "accuracy", "success rate", "exact match", "f1", "alfworld", "webshop", "hotpotqa", "fever", "humaneval", "webarena", "agentbench"]):
return True
return False
def _extract_task_mentions(note: dict[str, Any]) -> list[str]:
blob_parts = []
for field in ["key_results", "summary_bullets", "method", "abstract", "title"]:
val = note.get(field)
if isinstance(val, list):
blob_parts.extend([str(x) for x in val])
elif isinstance(val, str):
blob_parts.append(val)
blob = " ".join(blob_parts)
low = blob.lower()
positions: list[tuple[int, str]] = []
for hint in BENCHMARK_HINTS:
pos = low.find(hint.lower())
if pos >= 0:
positions.append((pos, hint))
for match in re.finditer(r"\b(?:HumanEval|ALFWorld|WebShop|HotpotQA|FEVER|WebArena|AgentBench|GSM8K|MATH)\b", blob):
positions.append((match.start(), match.group(0)))
positions.sort(key=lambda item: (item[0], item[1]))
return uniq_keep_order(name for _, name in positions)
def _task_phrase(notes: list[dict[str, Any]]) -> str:
tasks: list[str] = []
for note in notes:
tasks.extend(_extract_task_mentions(note))
uniq = uniq_keep_order(tasks)
if not uniq:
return "a small public task slice"
if len(uniq) == 1:
return uniq[0]
return "/".join(uniq[:2])
def _claim_text(note: dict[str, Any]) -> str:
candidates = _note_bullets(note, "key_results") + _note_bullets(note, "summary_bullets")
for cand in candidates:
if len(cand.split()) >= 8:
return cand
return clean_text(note.get("method") or note.get("title") or "", limit=220)
def _limitation_text(note: dict[str, Any], axis: str, cluster: str) -> str:
limitations = _specific_limitations(note)
if limitations and not any(pat in limitations[0].lower() for pat in GENERIC_LIMITATION_PATTERNS):
return limitations[0]
profile = _axis_profile(axis, cluster)
return f"The paper still leaves unclear whether {axis} is genuinely responsible for the reported story in {cluster.lower()}, or whether that story is really driven by {profile['confound']}."
def _anchor_notes(note_index: dict[str, dict[str, Any]], paper_ids: list[str]) -> list[dict[str, Any]]:
return [note_index[pid] for pid in paper_ids if pid in note_index]
def _paper_annotation(note: dict[str, Any], axis: str, cluster: str) -> str:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
profile = _axis_profile(axis, cluster)
result = _result_fact(note)
return clean_sentence(
f"{title}: {result} Open gap: {axis} still moves with {profile['confound']}.",
limit=300,
)
def _synthesis_annotation(notes: list[dict[str, Any]], axis: str, cluster: str) -> str:
profile = _axis_profile(axis, cluster)
task_phrase = _task_phrase(notes)
return clean_sentence(
f"Across the anchor papers, the live question is whether {axis} changes interpretation itself or merely rides along with {profile['confound']}, especially on {task_phrase}.",
limit=260,
)
def build_signal_rows(*, cluster: str, notes: list[dict[str, Any]]) -> list[IdeaSignal]:
if not notes:
return []
direction_type, axes, academic_value = _cluster_profile(cluster)
evidence_confidence = _evidence_confidence(notes)
anchors = notes[:3]
paper_ids = uniq_keep_order(str(n.get("paper_id") or "").strip() for n in anchors)
signals: list[IdeaSignal] = []
for idx, axis in enumerate(axes[:3], start=1):
anchor = anchors[(idx - 1) % len(anchors)]
claim = _claim_text(anchor)
tension = _limitation_text(anchor, axis, cluster)
profile = _axis_profile(axis, cluster)
missing_piece = clean_sentence(
f"What is still missing is a direction that isolates whether {axis} is the real explanatory variable in {cluster.lower()}, rather than leaving it entangled with {profile['confound']}.",
limit=240,
)
signals.append(IdeaSignal(
signal_id=f"SIG-{slugify(cluster)[:16]}-{idx}",
cluster=cluster,
direction_type=direction_type,
theme=f"{profile['title']} in {cluster}",
claim_or_observation=claim,
tension=clean_text(tension, limit=220),
missing_piece=missing_piece,
possible_axis=axis,
academic_value=academic_value,
evidence_confidence=evidence_confidence,
paper_ids=paper_ids,
))
return signals
def signal_table_markdown(rows: list[IdeaSignal]) -> str:
table_rows = []
for row in rows:
table_rows.append([
row.signal_id,
row.cluster,
row.theme,
row.claim_or_observation,
row.tension,
row.missing_piece,
row.possible_axis,
row.academic_value,
row.evidence_confidence,
", ".join(row.paper_ids),
])
return "\n".join([
"# IDEA_SIGNAL_TABLE",
"",
"## Research signals",
"",
markdown_table(["Signal ID", "Cluster", "Theme", "Claim / observation", "Tension", "Missing piece", "Possible axis", "Academic value", "Confidence", "Paper IDs"], table_rows),
"",
])
def _title_from_signal(signal: IdeaSignal) -> str:
profile = _axis_profile(signal.possible_axis, signal.cluster)
return clean_sentence(profile["title"], limit=72)
def _best_fit(direction_type: str, axis: str) -> str:
if direction_type == "governance":
return "Best fit when the group wants a deployment-relevant reading of agent behavior rather than another internal benchmark story."
if direction_type == "evaluation":
return "Best fit when the discussion is really about what counts as convincing evidence, not just how to raise scores."
if axis in {"search depth", "verification loop", "retrieval policy"}:
return "Best fit for a PhD discussion that wants one sharper mechanism/confound question rather than a broad systems buildout."
return "Best fit for PI/PhD discussions that want a thesis-worthy mechanism question rather than a generic benchmark wrapper."
def signals_to_direction_cards(signals: list[IdeaSignal], *, note_index: dict[str, dict[str, Any]], focus_clusters: list[str], pool_min: int, pool_max: int) -> list[DirectionCard]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
cards: list[DirectionCard] = []
for idx, signal in enumerate(signals, start=1):
profile = _axis_profile(signal.possible_axis, signal.cluster)
anchors = _anchor_notes(note_index, signal.paper_ids)
task_phrase = _task_phrase(anchors)
metric_phrase = _metric_phrase(anchors)
nearest = anchors[0] if anchors else {}
nearest_title = clean_text((nearest.get("title") if isinstance(nearest, dict) else "") or (signal.paper_ids[0] if signal.paper_ids else signal.cluster), limit=90)
nearest_task_phrase = _task_phrase([nearest]) if nearest else task_phrase
literature_suggests = [_paper_annotation(note, signal.possible_axis, signal.cluster) for note in anchors[:2]]
literature_suggests.append(_synthesis_annotation(anchors, signal.possible_axis, signal.cluster))
closest_prior_gap = [
clean_sentence(
f"{nearest_title} is the closest prior anchor because it already reports concrete behavior on {nearest_task_phrase}. The unresolved point is whether those gains survive once {profile['confound']} is held fixed while {signal.possible_axis} is varied.",
limit=220,
),
clean_sentence(
"The novelty test here is narrow, not rhetorical: if a strong anchor paper already runs that single-variable control, this direction should collapse quickly rather than stay alive as a vague confound story.",
limit=220,
),
]
one_line_thesis = clean_sentence(profile["thesis"], limit=230)
why_interesting = clean_sentence(
f"This is not just another benchmark wedge. It opens a {profile['program_kind']} line around {signal.possible_axis}: {profile['question']}.",
limit=220,
)
missing_piece = clean_sentence(profile["missing_piece"], limit=230)
possible_variants = [
clean_sentence(f"Single-variable control on {task_phrase}: vary {signal.possible_axis} while holding {profile['confound']} fixed as far as the setup allows.", limit=210),
clean_sentence("Replace score-only reporting with a failure-type comparison after the same intervention, so the result says more than whether the average went up.", limit=210),
clean_sentence("Check whether the conclusion survives on one simple public task slice and one more tool- or environment-heavy setting before treating it as a general claim.", limit=210),
]
first_probes = [
clean_sentence(
f"Intervention: vary {signal.possible_axis} while holding {profile['confound']} as fixed as possible on {task_phrase}. Readout: {metric_phrase}. Decisive if the interpretation changes even after the control.",
limit=220,
),
clean_sentence(
f"Prior-work audit: inspect {nearest_title} for any ablation that already fixes {profile['confound']}, and if the conclusion survives, demote this direction.",
limit=180,
),
]
what_counts_as_insight = clean_sentence(profile['insight'], limit=190)
kill_criteria = [
clean_sentence(f"Kill quickly if {profile['kill_signal']}.", limit=170),
clean_sentence(f"Kill if the first controlled probe leaves both {metric_phrase} and the failure taxonomy essentially unchanged.", limit=170),
]
weakness_conditions = [
clean_sentence(f"This direction is weaker if changing {signal.possible_axis} mostly rescales the aggregate metric without changing which failure modes appear or disappear.", limit=175),
clean_sentence(profile['demotion'], limit=170),
]
anchor_reading_notes: list[dict[str, str]] = []
for note in anchors[:3]:
title = clean_text(note.get("title") or note.get("paper_id") or "paper", limit=110)
result = _result_fact(note)
anchor_reading_notes.append({
"paper_title": title,
"why_read": clean_sentence(f"Closest {signal.cluster.lower()} anchor with a concrete result hook on {task_phrase}: {result}", limit=180),
"what_to_extract": clean_sentence(f"Extract the metric and comparator, then check whether {profile['reading_extract']}.", limit=180),
"current_hook": clean_sentence(result, limit=180),
"kill_signal": clean_sentence(f"Weaken this direction if the paper already shows {profile['kill_signal']}.", limit=180),
})
why_this_ranks_here = clean_sentence(profile['priority_note'], limit=170)
cards.append(DirectionCard(
direction_id=f"DIR-{idx:03d}",
cluster=signal.cluster,
direction_type=signal.direction_type,
title=_title_from_signal(signal),
focus_axis=signal.possible_axis,
main_confound=profile['confound'],
program_kind=profile['program_kind'],
contribution_shape=profile['contribution_shape'],
time_to_clarity=profile['time_to_clarity'],
one_line_thesis=one_line_thesis,
why_interesting=why_interesting,
literature_suggests=literature_suggests,
closest_prior_gap=closest_prior_gap,
missing_piece=missing_piece,
possible_variants=possible_variants,
academic_value=profile['contribution_shape'],
first_probes=first_probes,
what_counts_as_insight=what_counts_as_insight,
weakness_conditions=weakness_conditions,
kill_criteria=kill_criteria,
what_would_change_mind=kill_criteria,
best_fit=_best_fit(signal.direction_type, signal.possible_axis),
why_this_ranks_here=why_this_ranks_here,
evidence_confidence=signal.evidence_confidence,
paper_ids=signal.paper_ids,
signal_ids=[signal.signal_id],
anchor_reading_notes=anchor_reading_notes,
))
cards.sort(key=lambda c: (0 if c.cluster in focus else 1, c.cluster, c.program_kind, c.title))
target = max(pool_min, min(pool_max, len(cards)))
return cards[:target]
def direction_pool_markdown(cards: list[DirectionCard]) -> str:
rows: list[list[str]] = []
for card in cards:
rows.append([
card.direction_id,
card.cluster,
card.direction_type,
card.program_kind,
card.title,
clean_sentence(card.one_line_thesis, limit=100),
clean_sentence(card.why_interesting, limit=100),
clean_sentence(card.missing_piece, limit=100),
clean_sentence(" / ".join(card.possible_variants), limit=120),
clean_sentence(card.academic_value, limit=90),
clean_sentence(" / ".join(card.first_probes), limit=120),
card.evidence_confidence,
", ".join(card.paper_ids),
])
return "\n".join([
"# IDEA_DIRECTION_POOL",
"",
"## Candidate research directions",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Program", "Title", "One-line thesis", "Why interesting", "Missing piece", "Possible variants", "Academic value", "First probes", "Confidence", "Paper IDs"], rows),
"",
])
def _thesis_potential(direction_type: str, cluster: str) -> int:
if direction_type in {"mechanism", "coordination", "adaptation"}:
return 5
if direction_type in {"systems", "governance"}:
return 4
if "benchmark" in cluster.lower() or direction_type == "evaluation":
return 3
return 4
def score_direction_cards(
cards: list[DirectionCard],
*,
focus_clusters: list[str],
keep_rank_max: int,
maybe_rank_max: int,
score_weights: dict[str, float] | None = None,
) -> list[ScreenedDirection]:
focus = {x.strip() for x in focus_clusters if str(x).strip()}
keep_rank_max = int(keep_rank_max)
maybe_rank_max = int(maybe_rank_max)
if keep_rank_max <= 0:
raise ValueError("Invalid ideation scoring contract: keep_rank_max must be positive.")
if maybe_rank_max < keep_rank_max:
raise ValueError("Invalid ideation scoring contract: maybe_rank_max must be >= keep_rank_max.")
weights = _validate_score_weights(score_weights or {}, field_name="score_direction_cards.score_weights")
cluster_counts: dict[str, int] = {}
program_counts: dict[str, int] = {}
for card in cards:
cluster_counts[card.cluster] = cluster_counts.get(card.cluster, 0) + 1
program_counts[card.program_kind] = program_counts.get(card.program_kind, 0) + 1
provisional: list[tuple[DirectionCard, float, int, int, int, int, int, int]] = []
for card in cards:
discussion_worthiness = 5 if card.cluster in focus or card.time_to_clarity == "fast" else 4
academic_value_score = 5 if any(token in card.contribution_shape.lower() for token in ["rule", "protocol", "regime map", "causal"]) else 4
concrete_hooks = sum(1 for item in card.literature_suggests if _sentence_has_concrete_hook(item))
evidence_grounding = 5 if len(card.paper_ids) >= 3 and concrete_hooks >= 2 else 4 if len(card.paper_ids) >= 2 and concrete_hooks >= 1 else 3
if program_counts.get(card.program_kind, 0) == 1 and cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 5
elif program_counts.get(card.program_kind, 0) == 1 or cluster_counts.get(card.cluster, 0) == 1:
direction_distinctness = 4
else:
direction_distinctness = 3
probe_blob = " ".join(card.first_probes).lower()
first_probe_clarity = 5 if all(token in probe_blob for token in ["intervention:", "readout:", "decisive if"]) else 4 if "intervention:" in probe_blob else 3
thesis_potential = _thesis_potential(card.direction_type, card.cluster)
total = round(
discussion_worthiness * weights["discussion_worthiness"]
+ academic_value_score * weights["academic_value"]
+ evidence_grounding * weights["evidence_grounding"]
+ direction_distinctness * weights["direction_distinctness"]
+ first_probe_clarity * weights["first_probe_clarity"]
+ thesis_potential * weights["thesis_potential"],
2,
)
provisional.append((card, total, discussion_worthiness, academic_value_score, evidence_grounding, direction_distinctness, first_probe_clarity, thesis_potential))
provisional.sort(key=lambda row: (-row[1], 0 if row[0].time_to_clarity == "fast" else 1, row[0].cluster, row[0].title))
rows: list[ScreenedDirection] = []
for idx, (card, total, dw, av, eg, dd, fp, tp) in enumerate(provisional, start=1):
recommendation = "keep" if idx <= keep_rank_max else "maybe" if idx <= maybe_rank_max else "drop"
strengths: list[str] = []
if eg >= 5:
strengths.append("concrete anchor evidence")
if dd >= 5:
strengths.append(f"distinct {card.program_kind} wedge")
if fp >= 5:
strengths.append("clean first probe")
if av >= 5:
strengths.append("thesis-sized payoff")
if not strengths:
strengths.append("discussion value")
risk = "still abstract-first" if str(card.evidence_confidence).startswith("low") else f"main risk is controlling {card.main_confound} cleanly"
rationale = clean_sentence(f"Strongest on {' + '.join(strengths[:2])}; {risk}.", limit=160)
rows.append(ScreenedDirection(
direction_id=card.direction_id,
cluster=card.cluster,
direction_type=card.direction_type,
title=card.title,
total_score=total,
discussion_worthiness=dw,
academic_value_score=av,
evidence_grounding=eg,
direction_distinctness=dd,
first_probe_clarity=fp,
thesis_potential=tp,
recommendation=recommendation,
rationale=rationale,
))
return rows
def screening_table_markdown(rows: list[ScreenedDirection]) -> str:
data: list[list[str]] = []
for row in rows:
data.append([
row.direction_id,
row.cluster,
row.direction_type,
row.title,
f"{row.total_score:.2f}",
str(row.discussion_worthiness),
str(row.academic_value_score),
str(row.evidence_grounding),
str(row.direction_distinctness),
str(row.first_probe_clarity),
str(row.thesis_potential),
row.recommendation,
row.rationale,
])
return "\n".join([
"# IDEA_SCREENING_TABLE",
"",
"## Discussion-first screening",
"",
markdown_table(["Direction ID", "Cluster", "Type", "Title", "Total", "Discussion", "Academic value", "Evidence", "Distinctness", "First probe", "Thesis potential", "Decision", "Rationale"], data),
"",
])
def shortlist_markdown(records: list[dict[str, Any]]) -> str:
lines = ["# IDEA_SHORTLIST", "", "## Prioritized directions", ""]
for record in records:
lines.extend([
f"### Direction {record['rank']}. {record['title']}",
f"- Cluster: {record['cluster']}",
f"- Type: {record['direction_type']}",
f"- Focus axis: {record.get('focus_axis')}",
f"- Program kind: {record.get('program_kind')}",
f"- Main confound: {record.get('main_confound')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
f"- One-line thesis: {record['one_line_thesis']}",
f"- Why this is interesting: {record['why_interesting']}",
f"- Why this ranks here: {record['why_this_ranks_here']}",
f"- Contribution shape: {record.get('contribution_shape')}",
"- What the literature already suggests:",
])
for item in record.get("literature_suggests") or []:
lines.append(f" - {item}")
lines.extend([
"- Closest prior work and why it does not settle the question:",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f" - {item}")
lines.extend([
f"- What is still missing: {record['missing_piece']}",
"- Possible variants:",
])
for item in record.get("possible_variants") or []:
lines.append(f" - {item}")
lines.extend([
f"- Why this could matter academically: {record['academic_value']}",
"- First probes:",
])
for item in record.get("first_probes") or []:
lines.append(f" - {item}")
lines.extend([
f"- What would count as actual insight: {record['what_counts_as_insight']}",
"- What would make this weak or unconvincing:",
])
for item in record.get("weakness_conditions") or []:
lines.append(f" - {item}")
lines.extend([
"- Quick kill criteria:",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f" - {item}")
lines.extend([
f"- Best fit: {record['best_fit']}",
f"- Evidence confidence: {record['evidence_confidence']}",
"- Anchor papers: " + ", ".join(f"`{pid}`" for pid in record.get("paper_ids") or []),
f"- Why prioritized now: {record['why_prioritized']}",
"",
])
return "\n".join(lines)
def shortlist_snapshot_table(records: list[dict[str, Any]]) -> str:
rows = []
for rec in records:
rows.append([
str(rec.get("rank") or ""),
rec.get("title", ""),
clean_sentence(rec.get("why_this_ranks_here", ""), limit=52),
clean_sentence(rec.get("contribution_shape", rec.get("academic_value", "")), limit=56),
clean_sentence((rec.get("kill_criteria") or rec.get("what_would_change_mind") or [""])[0], limit=54),
])
return markdown_table(["Rank", "Direction", "Why now", "If it survives", "Fast kill signal"], rows)
def build_report_payload(*, topic: str, shortlist: list[dict[str, Any]], deferred: list[dict[str, Any]], trace_paths: dict[str, str]) -> dict[str, Any]:
top = list(shortlist)
takeaways: list[str] = []
if top:
takeaways.append("The strongest directions are the ones most likely to change how existing results are interpreted, not just add another benchmark win.")
takeaways.append("The current rank order reflects a tradeoff between time-to-clarity, thesis-sized payoff, and how concrete the nearest prior-work gap already looks.")
program_kinds = uniq_keep_order(str(rec.get("program_kind") or "").strip() for rec in top)
if len(program_kinds) >= 2:
takeaways.append(f"The lead set is intentionally not one confound template repeated three times: it spans {', '.join(program_kinds[:3])} rather than a single explanatory mold.")
else:
takeaways.append("The lead set still sits in one tight neighborhood, so the next reading pass should stress-test whether those directions are genuinely distinct thesis lines.")
if any(str(rec.get("evidence_confidence") or "").startswith("low") for rec in top):
takeaways.append("Most anchors are still abstract-first, so the memo is best used to decide the next reading and falsification pass rather than to lock a project immediately.")
else:
takeaways.append("The evidence is already concrete enough for a serious PI/PhD discussion, though one deeper paper pass could still reshuffle the exact order.")
lead = top[0] if top else {}
lead_anchor = (lead.get("anchor_reading_notes") or [{}])[0] if top else {}
lead_anchor_title = lead_anchor.get("paper_title") if isinstance(lead_anchor, dict) else "the first anchor paper"
discussion_questions = [
"Which direction still survives once we ask for a single-variable control rather than a suggestive confound story?",
"Which lead direction would still matter if the nearest prior work already addressed the obvious control we are worried about?",
"Which candidate could plausibly turn into a thesis line with a reusable method, protocol, or regime map rather than a one-off empirical note?",
"Which ranking would change most after one full-paper reading pass on the anchor set?",
]
uncertainties = [
"The main remaining risk is novelty risk, not idea scarcity: closer reading may reveal that one lead direction is already settled by an existing control or ablation.",
"Evidence confidence is still bounded by abstract-first notes for part of the lead set, so paper-specific details could either strengthen or kill a direction quickly.",
"The memo is more reliable as a discussion and triage artifact than as a final commitment on exact project order.",
]
next_steps = []
if top:
next_steps.append(clean_sentence(f"Start with {lead.get('title')}: read {lead_anchor_title or 'the first anchor paper'} looking specifically for whether {lead.get('focus_axis')} is already isolated against {lead.get('main_confound')}.", limit=220))
first_probe = (lead.get("first_probes") or [""])[0]
if first_probe:
next_steps.append(clean_sentence(first_probe, limit=220))
next_steps.append(clean_sentence(f"Re-rank only after checking the quick kill criteria for {lead.get('title')} and at least one competing direction in the lead set.", limit=220))
else:
next_steps.extend([
"Read the anchor papers for the current top direction in full and test whether the memo's stated hidden variable still looks unresolved.",
"In the next PI/PhD discussion, use the ranking arguments and kill criteria to decide whether any direction deserves promotion into a sharper thesis candidate.",
])
return {
"topic": topic,
"takeaways": takeaways,
"top_directions": top,
"deferred_directions": deferred,
"discussion_questions": discussion_questions,
"uncertainties": uncertainties,
"next_steps": next_steps,
"trace_artifacts": trace_paths,
}
def report_markdown(payload: dict[str, Any]) -> str:
top_dirs = payload.get("top_directions") or []
deferred_idx = 3 + len(top_dirs)
discussion_idx = deferred_idx + 1
uncertainty_idx = deferred_idx + 2
next_idx = deferred_idx + 3
appendix_idx = deferred_idx + 4
lines = [
"# Research Idea Brainstorm Memo",
"",
"## 0. Scope and framing",
"",
f"- Topic: {payload.get('topic') or 'research ideas'}",
"- Intended readers: PI / PhD",
"- Goal: surface a small number of discussion-worthy research directions rather than force a final project choice.",
"- Current evidence basis: abstract-first notes unless otherwise stated.",
"- What this memo is not: not a final project spec, not a survey draft, and not a symmetric top-3 proposal pack.",
"",
"## 1. Big-picture takeaways",
"",
]
for item in payload.get("takeaways") or []:
lines.append(f"- {item}")
lines.extend([
"",
"## 2. Top directions at a glance",
"",
shortlist_snapshot_table(payload.get("top_directions") or []),
"",
])
for idx, record in enumerate(top_dirs, start=1):
lines.extend([
f"## {idx + 2}. Direction {idx} — {record.get('title')}",
"",
"### One-line thesis",
f"- {record.get('one_line_thesis')}",
"",
"### Why it belongs in the lead set",
f"- {record.get('why_this_ranks_here')}",
f"- Time to clarity: {record.get('time_to_clarity')}",
"",
"### What the current literature actually shows",
])
for item in record.get("literature_suggests") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Closest prior work and remaining gap",
])
for item in record.get("closest_prior_gap") or []:
lines.append(f"- {item}")
lines.extend([
f"- Missing piece: {record.get('missing_piece')}",
"",
"### If this direction is right, what contribution emerges",
f"- {record.get('contribution_shape') or record.get('academic_value')}",
f"- {record.get('what_counts_as_insight')}",
"",
"### Smallest decisive probe",
])
for item in record.get("first_probes") or []:
lines.append(f"- {item}")
lines.extend([
"",
"### Quick kill criteria",
])
for item in record.get("kill_criteria") or record.get("what_would_change_mind") or []:
lines.append(f"- {item}")
lines.extend(["",])
lines.extend([
f"## {deferred_idx}. Other promising but not prioritized directions",
"",
])
deferred = payload.get("deferred_directions") or []
if deferred:
for record in deferred:
lines.append(f"- **{record.get('title')}** — {record.get('one_line_thesis')} (why not prioritized now: {record.get('why_not_prioritized')})")
else:
lines.append("- No additional deferred directions were retained in the current memo.")
lines.extend([
"",
f"## {discussion_idx}. Cross-cutting discussion questions",
"",
])
for item in payload.get("discussion_questions") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {uncertainty_idx}. Uncertainty and disagreement",
"",
])
for item in payload.get("uncertainties") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {next_idx}. Suggested next reading / next discussion step",
"",
])
for item in payload.get("next_steps") or []:
lines.append(f"- {item}")
lines.extend([
"",
f"## {appendix_idx}. Appendix guide",
"",
"- `output/APPENDIX.md` for anchor-paper reading notes and deferred directions.",
"- `output/REPORT.json` for the structured version of this memo.",
])
return "\n".join(lines)
def appendix_markdown(payload: dict[str, Any], *, core_titles: dict[str, str]) -> str:
lines = [
"# Appendix to the Research Idea Brainstorm Memo",
"",
"## A. Deferred but still promising directions",
"",
]
deferred = payload.get("deferred_directions") or []
if deferred:
for rec in deferred:
lines.extend([
f"### {rec.get('title')}",
f"- One-line thesis: {rec.get('one_line_thesis')}",
f"- Program kind: {rec.get('program_kind')}",
f"- Why interesting: {rec.get('why_interesting')}",
f"- Why not prioritized now: {rec.get('why_not_prioritized')}",
"",
])
else:
lines.append("- (none)")
lines.extend([
"## B. Anchor papers and what to extract",
"",
])
for rec in payload.get("top_directions") or []:
lines.append(f"### {rec.get('title')}")
lines.append(f"- Program kind: {rec.get('program_kind')}")
lines.append(f"- Lead-set reason: {rec.get('why_this_ranks_here')}")
notes = rec.get("anchor_reading_notes") or []
if notes:
rows: list[list[str]] = []
for item in notes:
if not isinstance(item, dict):
continue
rows.append([
item.get("paper_title", "paper"),
item.get("why_read", ""),
item.get("what_to_extract", ""),
item.get("current_hook", ""),
item.get("kill_signal", ""),
])
if rows:
lines.append(markdown_table(["Anchor paper", "Why read now", "What to extract", "Current evidence hook", "Kill signal"], rows))
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
else:
for pid in rec.get("paper_ids") or []:
title = core_titles.get(pid, pid)
lines.append(f"- {title}")
lines.append("")
return "\n".join(lines)
FILE:tooling/pipeline_spec.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
@dataclass(frozen=True)
class PipelineStage:
id: str
title: str
checkpoint: str
mode: str
required_skills: tuple[str, ...]
optional_skills: tuple[str, ...]
produces: tuple[str, ...]
human_checkpoint: dict[str, Any]
@staticmethod
def from_mapping(stage_id: str, data: Any, *, path: Path) -> "PipelineStage":
if not isinstance(data, dict):
raise ValueError(f"`stages.{stage_id}` must be a mapping in {path}")
return PipelineStage(
id=str(stage_id or "").strip(),
title=str(data.get("title") or stage_id).strip() or str(stage_id),
checkpoint=str(data.get("checkpoint") or stage_id).strip() or str(stage_id),
mode=str(data.get("mode") or "").strip(),
required_skills=_string_tuple(data.get("required_skills"), field_name=f"stages.{stage_id}.required_skills", path=path),
optional_skills=_string_tuple(data.get("optional_skills"), field_name=f"stages.{stage_id}.optional_skills", path=path),
produces=_string_tuple(data.get("produces"), field_name=f"stages.{stage_id}.produces", path=path),
human_checkpoint=_mapping(data.get("human_checkpoint"), field_name=f"stages.{stage_id}.human_checkpoint", path=path, required=False),
)
@dataclass(frozen=True)
class PipelineSpec:
path: Path
name: str
version: str
units_template: str
default_checkpoints: tuple[str, ...]
profile: str
routing_hints: tuple[str, ...]
routing_default: bool
routing_priority: int
target_artifacts: tuple[str, ...]
contract_model: str
structure_mode: str
pre_retrieval_shell: dict[str, Any]
binding_layers: tuple[str, ...]
core_chapter_h3_target: int
query_defaults: dict[str, Any]
overridable_query_fields: tuple[str, ...]
quality_contract: dict[str, Any]
loop_policy: dict[str, Any]
stages: dict[str, PipelineStage]
variant_of: str
variant_overrides: dict[str, Any]
@staticmethod
def load(path: Path) -> "PipelineSpec":
resolved = path.resolve()
raw_frontmatter, frontmatter = _load_variant_aware_frontmatter(resolved)
name = str(frontmatter.get("name") or path.stem.replace(".pipeline", ""))
units_template = str(frontmatter.get("units_template") or "")
version = str(frontmatter.get("version") or "")
default_checkpoints = _string_tuple(frontmatter.get("default_checkpoints"), field_name="default_checkpoints", path=resolved)
profile = str(frontmatter.get("profile") or "default").strip() or "default"
routing_hints = _string_tuple(frontmatter.get("routing_hints"), field_name="routing_hints", path=resolved)
routing_default = bool(frontmatter.get("routing_default"))
try:
routing_priority = int(frontmatter.get("routing_priority") or 0)
except Exception:
routing_priority = 0
target_artifacts = _string_tuple(frontmatter.get("target_artifacts"), field_name="target_artifacts", path=resolved)
contract_model = str(frontmatter.get("contract_model") or "").strip()
structure_mode = str(frontmatter.get("structure_mode") or "").strip()
pre_retrieval_shell = _mapping(frontmatter.get("pre_retrieval_shell"), field_name="pre_retrieval_shell", path=resolved, required=False)
binding_layers = _string_tuple(frontmatter.get("binding_layers"), field_name="binding_layers", path=resolved)
try:
core_chapter_h3_target = int(frontmatter.get("core_chapter_h3_target") or 0)
except Exception:
core_chapter_h3_target = 0
query_defaults = _mapping(frontmatter.get("query_defaults"), field_name="query_defaults", path=resolved, required=False)
overridable_query_fields = _string_tuple(
frontmatter.get("overridable_query_fields"),
field_name="overridable_query_fields",
path=resolved,
)
quality_contract = _mapping(frontmatter.get("quality_contract"), field_name="quality_contract", path=resolved, required=False)
loop_policy = _mapping(frontmatter.get("loop_policy"), field_name="loop_policy", path=resolved, required=False)
stages = _parse_stages(frontmatter.get("stages"), path=resolved)
variant_of = str(raw_frontmatter.get("variant_of") or "").strip()
variant_overrides = _mapping(raw_frontmatter.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False)
if not units_template:
raise ValueError(f"Missing units_template in pipeline front matter: {resolved}")
return PipelineSpec(
path=resolved,
name=name,
version=version,
units_template=units_template,
default_checkpoints=default_checkpoints,
profile=profile,
routing_hints=routing_hints,
routing_default=routing_default,
routing_priority=routing_priority,
target_artifacts=target_artifacts,
contract_model=contract_model,
structure_mode=structure_mode,
pre_retrieval_shell=pre_retrieval_shell,
binding_layers=binding_layers,
core_chapter_h3_target=core_chapter_h3_target,
query_defaults=query_defaults,
overridable_query_fields=overridable_query_fields,
quality_contract=quality_contract,
loop_policy=loop_policy,
stages=stages,
variant_of=variant_of,
variant_overrides=variant_overrides,
)
def query_default(self, key: str, default: Any = None) -> Any:
return self.query_defaults.get(str(key or "").strip(), default)
def allows_query_override(self, key: str) -> bool:
return str(key or "").strip() in set(self.overridable_query_fields)
def _parse_frontmatter(text: str) -> dict[str, Any]:
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
raise ValueError("Pipeline file must start with YAML front matter '---'")
end_idx = None
for idx in range(1, len(lines)):
if lines[idx].strip() == "---":
end_idx = idx
break
if end_idx is None:
raise ValueError("Unterminated YAML front matter (missing closing '---')")
raw = "\n".join(lines[1:end_idx])
data = yaml.safe_load(raw) or {}
if not isinstance(data, dict):
raise ValueError("Pipeline YAML front matter must be a mapping")
return data
def _load_variant_aware_frontmatter(path: Path, seen: set[Path] | None = None) -> tuple[dict[str, Any], dict[str, Any]]:
resolved = path.resolve()
active = set(seen or set())
if resolved in active:
chain = " -> ".join(str(p) for p in [*active, resolved])
raise ValueError(f"Cyclic `variant_of` chain detected: {chain}")
active.add(resolved)
text = resolved.read_text(encoding="utf-8")
raw = _parse_frontmatter(text)
variant_of = str(raw.get("variant_of") or "").strip()
if not variant_of:
return raw, dict(raw)
_validate_raw_variant_frontmatter(raw, path=resolved)
base_path = _resolve_pipeline_reference(resolved, variant_of)
_, base_effective = _load_variant_aware_frontmatter(base_path, seen=active)
current = {key: raw[key] for key in ("name", "version") if key in raw}
merged = _deep_merge(base_effective, current)
merged = _deep_merge(
merged,
_mapping(raw.get("variant_overrides"), field_name="variant_overrides", path=resolved, required=False),
)
merged["variant_of"] = variant_of
merged["variant_overrides"] = raw.get("variant_overrides") or {}
return raw, merged
def _validate_raw_variant_frontmatter(raw: dict[str, Any], *, path: Path) -> None:
allowed_raw_keys = {"name", "version", "variant_of", "variant_overrides"}
extra_keys = sorted(str(key) for key in raw.keys() if str(key) not in allowed_raw_keys)
if extra_keys:
raise ValueError(
"Variant pipeline files may only keep top-level keys "
"`name`, `version`, `variant_of`, `variant_overrides`; "
f"move the rest under `variant_overrides` in {path}: {', '.join(extra_keys)}"
)
def _resolve_pipeline_reference(path: Path, ref: str) -> Path:
value = str(ref or "").strip()
if not value:
raise ValueError(f"Empty `variant_of` reference in {path}")
candidate = Path(value)
if candidate.is_absolute() and candidate.exists():
return candidate.resolve()
repo_root = path.parent.parent
for base in (path.parent, repo_root, repo_root / "pipelines"):
direct = (base / value).resolve()
if direct.exists():
return direct
stem = Path(value).name
if stem.endswith(".pipeline.md"):
stem = stem[: -len(".pipeline.md")]
if stem:
for base in (path.parent, repo_root / "pipelines"):
direct = (base / f"{stem}.pipeline.md").resolve()
if direct.exists():
return direct
raise ValueError(f"Could not resolve `variant_of: {value}` from {path}")
def _deep_merge(base: Any, override: Any) -> Any:
if isinstance(base, list) and isinstance(override, dict) and any(str(key).startswith("__") for key in override.keys()):
return _apply_list_patch(base, override)
if isinstance(base, dict) and isinstance(override, dict):
merged: dict[str, Any] = {str(k): v for k, v in base.items()}
for key, value in override.items():
if key in merged:
merged[key] = _deep_merge(merged[key], value)
else:
merged[key] = value
return merged
return override
def _apply_list_patch(base: list[Any], override: dict[str, Any]) -> list[Any]:
allowed_keys = {"__append__", "__prepend__", "__remove__", "__replace__"}
bad_keys = sorted(str(key) for key in override.keys() if str(key) not in allowed_keys)
if bad_keys:
raise ValueError(
"List patch overrides only support "
"`__append__`, `__prepend__`, `__remove__`, `__replace__`; "
f"got: {', '.join(bad_keys)}"
)
if "__replace__" in override:
replacement = override.get("__replace__")
if not isinstance(replacement, list):
raise ValueError("List patch `__replace__` must be a YAML list")
return list(replacement)
current = list(base)
remove_values = override.get("__remove__", [])
prepend_values = override.get("__prepend__", [])
append_values = override.get("__append__", [])
for field_name, value in (
("__remove__", remove_values),
("__prepend__", prepend_values),
("__append__", append_values),
):
if not isinstance(value, list):
raise ValueError(f"List patch `{field_name}` must be a YAML list")
if remove_values:
current = [item for item in current if item not in remove_values]
if prepend_values:
current = list(prepend_values) + current
if append_values:
current = current + list(append_values)
return current
def _mapping(value: Any, *, field_name: str, path: Path, required: bool) -> dict[str, Any]:
if value is None:
return {}
if not isinstance(value, dict):
raise ValueError(f"`{field_name}` must be a mapping in {path}")
return {str(key): item for key, item in value.items()}
def _string_tuple(value: Any, *, field_name: str, path: Path) -> tuple[str, ...]:
if value is None:
return ()
if not isinstance(value, list):
raise ValueError(f"`{field_name}` must be a YAML list in {path}")
out: list[str] = []
for item in value:
text = str(item or "").strip()
if text:
out.append(text)
return tuple(out)
def _parse_stages(value: Any, *, path: Path) -> dict[str, PipelineStage]:
if value is None:
return {}
items: list[tuple[str, Any]] = []
if isinstance(value, dict):
items = [(str(stage_id or "").strip(), data) for stage_id, data in value.items()]
elif isinstance(value, list):
for idx, item in enumerate(value):
if not isinstance(item, dict):
raise ValueError(f"`stages[{idx}]` must be a mapping in {path}")
stage_id = str(item.get("id") or item.get("stage") or "").strip()
if not stage_id:
raise ValueError(f"`stages[{idx}]` is missing `id` in {path}")
items.append((stage_id, item))
else:
raise ValueError(f"`stages` must be a mapping or list in {path}")
stages: dict[str, PipelineStage] = {}
for stage_id, data in items:
stage = PipelineStage.from_mapping(stage_id, data, path=path)
if not stage.id:
raise ValueError(f"`stages` contains an empty stage id in {path}")
stages[stage.id] = stage
return stages
FILE:tooling/pipeline_text.py
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from tooling.common import load_yaml, read_jsonl
def slug_unit_id(unit_id: str) -> str:
raw = str(unit_id or '').strip()
out: list[str] = []
for ch in raw:
out.append(ch if ch.isalnum() else '_')
safe = ''.join(out).strip('_')
return f'S{safe}' if safe else 'S'
def load_outline_sections(path: Path) -> list[dict[str, Any]]:
outline = load_yaml(path) if path.exists() else []
if not isinstance(outline, list):
return []
out: list[dict[str, Any]] = []
for sec in outline:
if not isinstance(sec, dict):
continue
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
subsections: list[dict[str, str]] = []
for sub in sec.get('subsections') or []:
if not isinstance(sub, dict):
continue
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
subsections.append({'id': sub_id, 'title': sub_title})
if sec_id and sec_title:
out.append({'id': sec_id, 'title': sec_title, 'subsections': subsections})
return out
def iter_h3_units(path: Path) -> list[dict[str, str]]:
units: list[dict[str, str]] = []
for sec in load_outline_sections(path):
sec_id = str(sec.get('id') or '').strip()
sec_title = str(sec.get('title') or '').strip()
for sub in sec.get('subsections') or []:
sub_id = str(sub.get('id') or '').strip()
sub_title = str(sub.get('title') or '').strip()
if sub_id and sub_title:
units.append(
{
'section_id': sec_id,
'section_title': sec_title,
'sub_id': sub_id,
'title': sub_title,
}
)
return units
def read_jsonl_map(path: Path, key: str) -> dict[str, dict[str, Any]]:
out: dict[str, dict[str, Any]] = {}
for rec in read_jsonl(path):
if not isinstance(rec, dict):
continue
val = str(rec.get(key) or '').strip()
if val:
out[val] = rec
return out
def read_bib_keys(path: Path) -> set[str]:
if not path.exists() or path.stat().st_size <= 0:
return set()
text = path.read_text(encoding='utf-8', errors='ignore')
return set(re.findall(r'(?im)^@\w+\s*\{\s*([^,\s]+)\s*,', text))
def uniq_keep_order(items: list[str]) -> list[str]:
out: list[str] = []
seen: set[str] = set()
for item in items:
value = str(item or '').strip()
if not value or value in seen:
continue
seen.add(value)
out.append(value)
return out
def clean_excerpt(text: str, *, limit: int = 220) -> str:
s = str(text or '').strip()
s = s.replace('\n', ' ')
s = re.sub(r'\s+', ' ', s)
s = s.strip(' "\'`')
s = s.replace('|', ', ')
if len(s) <= limit:
return s
clipped = s[:limit].rsplit(' ', 1)[0].strip()
return clipped if clipped else s[:limit].strip()
def citation_list(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'[@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def inline_evidence_phrase(keys: list[str], *, max_keys: int = 3) -> str:
items = uniq_keep_order(keys)[:max_keys]
if not items:
return ''
cites = [f'in [@{k}]' for k in items]
if len(cites) == 1:
return cites[0]
if len(cites) == 2:
return f'{cites[0]} and {cites[1]}'
return ', '.join(cites[:-1]) + f', and {cites[-1]}'
def heading_blocks(md: str) -> list[tuple[str, str]]:
blocks: list[tuple[str, str]] = []
current = ''
lines: list[str] = []
for raw in (md or '').splitlines():
if raw.startswith('### '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = raw[4:].strip()
lines = []
continue
if raw.startswith('## '):
if current:
blocks.append((current, '\n'.join(lines).strip()))
current = ''
lines = []
continue
if current:
lines.append(raw)
if current:
blocks.append((current, '\n'.join(lines).strip()))
return blocks
def dump_jsonl_lines(records: list[dict[str, Any]]) -> str:
return '\n'.join(json.dumps(r, ensure_ascii=False) for r in records).rstrip() + ('\n' if records else '')