@clawhub-c-narcissus-fa49e1df87
Design and refine publication-facing research paper figures from drafts, abstracts, reviewer comments, or short method descriptions. Use when the user needs...
---
name: inspiration-case-figure-guide
description: >
Design and refine publication-facing research paper figures from drafts,
abstracts, reviewer comments, or short method descriptions. Use when the
user needs an inspiration-source figure, case schematic, toy example
walkthrough, motivation/problem-gap board, idea-to-model bridge,
visual-first candidate board, style comparison, image-generation brief, or
ChatGPT/OpenAI image prompt for scientific figures.
license: MIT-0
metadata:
version: "2.7.0"
openclaw:
skillKey: inspiration-case-figure-guide
---
# Inspiration / Case Figure Guide
Use this skill as a stateful, human-in-the-loop figure director for research
paper figures. Start from the intended reader effect and paper logic, then
choose figure role, layout, visual rhetoric, style family, and image-generation
brief.
Load only the reference files needed for the current step. Do not bulk-load all
references.
## Hard Rules
- Do not produce SVG, Mermaid, TikZ, Graphviz, HTML/CSS, matplotlib, or other
code-rendered output as the final figure.
- Use the host's native OpenAI image-generation capability for final visuals
when available. If image generation is unavailable, stop before generation
and provide a copyable prompt handoff.
- Keep each turn to one modality:
- `TEXT_ONLY`: analysis, plan, state update, brief, prompt, critique, or
question. No image generation.
- `IMAGE_ONLY`: image generation only. No prose.
- Do not fake image generation or claim an image was generated when no image
tool was used.
- Ask for optional reference images at useful decision points, but never make
them required.
- Analyze reference images as design evidence. Extract layout, hierarchy,
density, label strategy, color semantics, and transferable visual grammar;
do not copy them exactly.
- Give an opinionated default recommendation in every planning or selection
reply, and state what would make another option better.
- Put copyable next-turn user prompts only in the final section named
`下一步你可以这样问`.
## First Trigger Gate
If this is a new figure-design task and there is no active state with
`start_confirmed: true`, the first reply must be `STARTUP_PLAN_ONLY`.
In that first reply:
1. Show `当前执行计划` near the beginning with
`当前处于:第 0/N 步 — 启动确认与流程预览`.
2. Preview the complete workflow and explain what each step will do.
3. Briefly say what can be inferred from the user's material and what the user
may optionally provide.
4. Mention that reference images can help but are optional.
5. Recommend the default route for proceeding directly.
6. Do not perform substantive figure analysis, scheme ranking, prompt
construction, or image generation yet.
7. End with `当前状态与产物` and `下一步你可以这样问`.
Treat confirmation phrases such as "开始", "继续", "确认开始", "直接开始",
"按默认路线开始", or a new paper/material bundle plus a request to proceed as
confirmation. Then set `start_confirmed: true` and move to intake.
For the full gate behavior, read
`references/startup-confirmation-gate-protocol.md`.
## Visible Plan And State
Every `TEXT_ONLY` reply after the gate must include a visible
`当前执行计划` block near the beginning. Include:
- `当前处于:第 X/Y 步 — <step name>`
- `本轮目标:...`
- `计划步骤:...`
- `本轮是否调整计划:无 / 因为...,调整为...`
Every `TEXT_ONLY` reply must also end with:
```markdown
## 当前状态与产物
- state_version: v2.7
- start_confirmed:
- 当前处于计划第 X/Y 步:
- working_plan:
- fixed_decisions:
- changed_assumptions:
- recommended_default:
- reference_image_status:
- artifacts_so_far:
- immediate_next_action:
## 下一步你可以这样问
1. 请根据引导skill以及当前的状态,继续...
2. ...
```
Make the visible plan, footer, default recommendation, and final prompt
suggestions consistent. Read these references when the conversation involves
continuation, recovery, or plan changes:
- `references/planning-and-state-update-protocol.md`
- `references/plan-step-visibility-protocol.md`
- `references/state-and-turn-contract.md`
- `references/session-state-schema-v2.md`
- `references/next-step-consistency-protocol.md`
## Core Workflow
1. Startup confirmation gate.
2. Intake and source readiness: identify paper material, figure slot, missing
inputs, and optional reference images.
3. Figure Effect Contract: define what the reader should understand in 10
seconds and 60 seconds, and what misconception the figure must prevent.
4. Paper compression and bottleneck diagnosis: compress claim, gap, mechanism,
evidence, and the main explanation bottleneck.
5. Figure opportunity map: compare plausible figure roles and recommend one
default direction.
6. Candidate scheme generation: produce multiple text schemes with reader
effect, layout skeleton, content units, style risk, and reviewer risk.
7. Selection and locking: choose or combine a scheme; update fixed decisions.
8. Content architecture and panel choreography: define reading order, labels,
panel count, hierarchy, and aspect ratio.
9. Visual decision rounds: use a figure-direction, layout, style, metaphor, or
density board when the next choice is easier to make visually.
10. Image brief and prompt: prepare a generation-ready brief and negative
constraints.
11. `IMAGE_ONLY` generation: generate the agreed candidate batch only after a
text-only brief turn.
12. Review and final package: diagnose the image, revise if needed, then provide
title, caption draft, callout labels, and paper-integration notes.
Use `references/request-template.md` or
`references/user-input-bundle-template.md` to compress incomplete user input.
## Figure Direction References
For paper-logic and figure-type decisions, read only the relevant files:
- `references/human-best-practice-methodology.md`: high-level method for
top-tier inspiration/case figures.
- `references/taxonomy-reference.md`: reader question, gap type, narrative
role, rhetoric, style, density, and risk taxonomy.
- `references/inspiration-case-patterns.md`: inspiration-source and case
schematic pattern families.
- `references/figure-scheme-patterns.md`: reusable figure schemes such as
motivation contrast, toy storyboard, method pipeline, and idea-to-model
bridge.
- `references/design-principles-by-type.md`: detailed design principles by
figure type.
- `references/recommendation-and-reference-image-protocol.md`: default
recommendation and reference-image handling.
## Visual Decision Boards
Do not force the user to choose category, layout, style, metaphor, or density
from prose only when seeing candidates would be more informative.
Before a visual board, use a `TEXT_ONLY` reply to specify:
- what stays fixed
- what varies
- how many candidates will be generated
- what the user should compare
- the recommended default if the user wants to proceed
The following turn may be `IMAGE_ONLY`.
Typical board choices:
- Figure-direction board: 3-5 candidates with different figure roles.
- Layout board: 3-5 panel skeletons with the same role and style.
- Style board: 4-8 style families when style is a live decision.
- Metaphor board: 3-5 visual metaphors.
- Density board: 2-4 variants from sparse hero to denser evidence board.
Read these files when visual choices are active:
- `references/visual-first-decision-board-protocol.md`
- `references/visual-decision-protocol.md`
- `references/visual-style-taxonomy-and-selection.md`
- `references/image-generation-policy.md`
## Image Brief Contract
A generation brief must include:
- figure role and paper slot
- primary reader effect
- central claim/gap/mechanism
- anchor case or analogy
- layout skeleton and reading path
- panel count and aspect ratio
- style family and risk controls
- color semantics
- label and text-density rules
- candidate count
- negative constraints
For reusable wording, read `references/prompt-library.md`.
## Completion Criteria
The task is complete when the user has one or more of these artifacts:
- locked figure thesis and role
- selected candidate scheme or visual board direction
- generation-ready image prompt
- generated candidate image batch
- review diagnosis and revision plan
- final title, caption, callout labels, and paper-placement notes
## ClawHub Safety Notes
This is an instruction-only skill. It declares no environment variables, no CLI
binaries, no install steps, and no external service credentials. Publish it
under MIT-0 only; do not add conflicting license language.
FILE:references/design-principles-by-type.md
# Detailed Design Principles by Figure Type
This reference gives richer verbal guidance for how each major figure type should be designed. Use it when the user asks for design philosophy, when a candidate direction needs deeper explanation, or when you want stronger refinement advice without relying on images or code.
## How to Read This Reference
Each figure type is described through:
- `Purpose`: what this type is trying to accomplish in the paper
- `Reader effect`: what should become easier for the reader after seeing it
- `Best paper slot`: where this figure usually belongs in the paper and why
- `Structural priorities`: what must be clear in the layout
- `Text strategy`: how much wording belongs in the figure
- `Typical misuse`: how this type often goes wrong
- `Refinement cues`: how to improve it without changing its role
## A. Motivation / Problem-Gap Figure
### Purpose
This figure establishes necessity. Its job is to make the reader feel that the problem is concrete, the existing framing misses something, and the new method is not solving an invented inconvenience.
### Reader effect
After reading it, the reader should say:
- "I see the failure mode."
- "I understand why current methods do not fully solve it."
- "I now expect the method section to address this exact gap."
### Best paper slot
- `intro`: strongest default; this figure prepares the reader to accept the method
- `analysis`: useful when the paper later returns to a richer failure taxonomy
- `appendix`: acceptable when the motivation is real but not central to the final contribution mix
- `method`: usually only when the paper's structure merges the problem framing tightly into formulation
### Structural priorities
- Put the failure signal first, not the explanation.
- Use a small number of contrasts.
- Organize the figure so the problematic phenomenon is visible before any fix is shown.
- If there is a baseline, show why it fails in the same coordinate system or narrative frame.
### Text strategy
- Use short labels for the failure, missing signal, or bottleneck.
- Keep long explanation out of the figure body.
- If one sentence must appear inside the figure, it should express the entire takeaway.
### Typical misuse
- The figure becomes a mini related-work board.
- The failure only becomes understandable after reading the caption twice.
- The figure announces the method too early and weakens the problem setup.
### Refinement cues
- Remove one panel if two panels already establish the gap.
- Replace explanatory text with one visual contradiction.
- If the problem still feels abstract, swap the broad overview for one strong case.
## B. Toy Example / Case Walkthrough Figure
### Purpose
This figure makes the idea concrete through one controlled example. It is often the best way to show why the method matters when the method itself is conceptually simple but easy to misunderstand.
### Reader effect
After reading it, the reader should be able to retell the example in words and explain what changed between the failing state and the corrected state.
### Best paper slot
- `intro`: use here when one simple case can teach the problem immediately
- `method`: use here when the case is the cleanest bridge into the proposed mechanism
- `analysis`: use here when the case mainly serves interpretation or diagnosis
- `appendix`: use here for longer walkthroughs that would otherwise slow the main story
### Structural priorities
- Keep the same entities, tokens, or objects across all stages.
- Make the transition point visually explicit.
- If there are four panels, they should usually follow setup -> conflict -> intervention -> resolution.
- Avoid branching unless the whole point is comparison.
### Text strategy
- Labels should attach to states or changes, not narrate the whole case.
- Use arrows or numbered stage labels to preserve order.
- Avoid mixing two examples unless the user explicitly needs comparison.
### Typical misuse
- Different examples appear across panels and the reader cannot track identity.
- Every panel contains multiple sub-events, so the story loses rhythm.
- The case is too realistic and noisy for the intended conceptual point.
### Refinement cues
- Reduce to one exemplary transition.
- Replace decorative detail with stage clarity.
- If the figure feels flat, add one intermediate state rather than more commentary.
## C. Method Overview / Architecture Figure
### Purpose
This figure provides the navigational map for the paper. It is the figure readers return to when they forget where a module sits or what flows into what.
### Reader effect
After reading it, the reader should know:
- the major components
- the direction of information flow
- what enters and leaves each important stage
- where the paper's novelty sits
### Best paper slot
- `method`: strongest default; this is normally the anchor figure of the section
- `intro`: useful as an early teaser if the full pipeline is simple enough to preview
- `analysis`: only if the figure is being reused as a decomposition tool rather than as the main architecture
- `appendix`: good for expanded versions, engineering detail, or alternative variants
### Structural priorities
- One dominant reading path is mandatory.
- Novel modules should be visually distinct from routine ones.
- Inputs and outputs should be clearly typed: data, prompts, states, embeddings, losses, actions, etc.
- If there are multiple flows, choose one primary flow and subordinate the rest.
### Text strategy
- Module names should match the paper text exactly.
- Put detail in the caption, not inside every block.
- If a block needs explanation longer than a short phrase, it may belong in a secondary figure.
### Typical misuse
- Every block has the same weight, so novelty and plumbing look identical.
- Auxiliary losses, data sources, and evidence panels are all squeezed into the same board.
- The figure tries to be both an overview and a full algorithm trace.
### Refinement cues
- Pull evidence out into a separate figure if the overview is crowded.
- Highlight only the interfaces the reader must remember.
- Use grouping, spacing, or background regions to show hierarchy.
## D. Idea-to-Model Bridge Figure
### Purpose
This figure explains why the formal model is the right implementation of the intuitive idea. It is especially important when the method introduces new variables, losses, latent states, rewards, constraints, or optimization structures.
### Reader effect
After reading it, the reader should be able to answer:
- why this variable exists
- why this constraint or objective is needed
- how the intuitive principle is encoded in the model
### Best paper slot
- `method`: strongest default because the bridge usually belongs near formalization
- `intro`: useful when the paper needs an intuition-preserving preview before equations
- `analysis`: useful when the bridge is primarily explanatory after the fact
- `appendix`: useful for extended derivational bridges or extra interpretive detail
### Structural priorities
- Preserve a visible path from idea -> intermediate representation -> formal mechanism.
- The middle layer is usually the decisive one.
- Reuse symbols, shapes, or color semantics between the conceptual and formal sides.
- If equations appear, they should anchor the structure rather than dominate it.
### Text strategy
- Prefer naming one quantity per visual role.
- Use concise phrases such as "shared latent factor", "consistency score", or "routing signal".
- Put derivational detail in the caption or body text.
### Typical misuse
- The figure jumps directly from intuition to equations.
- The visual side and formal side use unrelated language and symbols.
- The figure explains the model but never explains why this model reflects the idea.
### Refinement cues
- Add one intermediate state, map, or variable family.
- Mirror notation across left and right halves.
- If the figure feels too mathematical, convert one equation block into a semantic diagram.
## E. Process / Loop / Timeline Figure
### Purpose
This figure explains how the system evolves. Use it when the novelty is procedural, iterative, agentic, or time-dependent.
### Reader effect
After reading it, the reader should understand state transition: what exists before the step, what is updated, what feedback is collected, and how the next step differs.
### Best paper slot
- `method`: strongest default when the process itself is part of the contribution
- `analysis`: strong when the loop view mainly explains behavior or diagnostics
- `intro`: good when procedural novelty is the paper's core hook
- `appendix`: appropriate for long traces, rollout detail, or implementation-specific unrolling
### Structural priorities
- Make the recurrent state explicit.
- Make stage boundaries explicit.
- Show one full cycle before collapsing into abstraction.
- If the loop is complex, number the phases.
### Text strategy
- Time markers help more than long descriptions.
- State names should be stable across the whole loop.
- If arrows proliferate, replace some arrows with stage regions or numbered bands.
### Typical misuse
- The figure is drawn like an overview pipeline even though the point is recurrence.
- Inputs, states, and outputs are visually indistinguishable.
- Feedback arrows overwhelm the main forward path.
### Refinement cues
- Reduce secondary arrows.
- Emphasize the one state that persists across iterations.
- If the loop still feels opaque, add one side panel showing a concrete trace.
## F. Dataset / Benchmark / Protocol Figure
### Purpose
This figure builds trust in the evidence pipeline. It tells the reader how raw material becomes benchmark items, what the evaluation slices are, and why the final evidence should be considered valid.
### Reader effect
After reading it, the reader should understand the benchmark's construction logic rather than treating it as an opaque source of numbers.
### Best paper slot
- `method`: strongest when dataset or benchmark construction is part of the technical contribution
- `intro`: appropriate when the benchmark itself is the headline contribution
- `analysis`: useful when protocol interpretation matters for understanding downstream evidence
- `appendix`: appropriate for curation rules, annotation detail, and extra slice definitions
### Structural priorities
- Funnels, filters, and slices should be aligned.
- Counts or scale markers should be easy to compare.
- If there are multiple benchmark dimensions, show the organizing principle once and reuse it.
- Keep protocol separate from results whenever possible.
### Text strategy
- Use counts, stage names, and slice labels.
- Avoid stuffing definitions and evaluation claims into the same panel.
- If there are many labels, prefer tables or aligned lists outside the main graphic flow.
### Typical misuse
- The protocol is reduced to one tiny box before the paper jumps to results.
- The figure is dominated by decorative arrows but hides the actual curation logic.
- Benchmark taxonomy and metric reporting are mixed into one unreadable sheet.
### Refinement cues
- Split source -> curation -> evaluation into explicit stages.
- Add counts only where they clarify scale.
- If trust is the issue, emphasize filtering criteria rather than ornament.
## G. Evidence / Result / Ablation Figure
### Purpose
This figure persuades. It should tell the reader what to believe, why the claim is supported, and which comparison matters.
### Reader effect
After reading it, the reader should know the main claim and the strongest supporting evidence without scanning every subplot equally.
### Best paper slot
- `analysis`: strongest default; this is where evidence boards usually deliver the most value
- `intro`: use only for one especially persuasive evidence figure that helps sell the paper early
- `method`: usually a weak fit unless evidence is intentionally threaded into the method exposition
- `appendix`: ideal for extra ablations, stress tests, and secondary comparisons
### Structural priorities
- One panel should dominate.
- Supporting panels should answer a clear question such as "is it robust?", "why does it work?", or "where does it fail?"
- The comparison target should be intentional rather than exhaustive.
### Text strategy
- Highlight the claimed result, not every result.
- Use legends and annotations only where the reader may misread the claim.
- Keep axes and legends aligned across related subplots.
### Typical misuse
- The figure is just a dump of all experimental outputs.
- Qualitative and quantitative evidence are mixed with no hierarchy.
- The main claim is hidden because every panel is equally emphasized.
### Refinement cues
- Promote one main comparison.
- Move secondary ablations to appendix if they dilute the message.
- Add a subtitle for the board-level claim if the evidence is visually diverse.
## H. Theory / Proof Intuition Figure
### Purpose
This figure lowers the barrier to formal reasoning. It should provide an intuitive picture for the theorem, proof strategy, or geometry behind the result.
### Reader effect
After reading it, the reader should know what quantity moves, what structure constrains it, and why the formal conclusion is plausible.
### Best paper slot
- `method`: useful when theoretical framing is central to the technical definition
- `analysis`: strongest when the figure interprets theory after the reader already knows the method
- `appendix`: common when the theory matters but is not required for first-pass understanding
- `intro`: rare, but possible when the paper is primarily theoretical and the theorem-level intuition is the main hook
### Structural priorities
- Keep the number of tracked objects very small.
- Spatial structure should correspond to logical structure.
- If there is a bound, show what is bounded and in which direction the argument works.
### Text strategy
- Use labels for quantities, sets, regions, directions, or boundaries.
- Avoid full theorem statements inside the figure.
- Use the caption to connect the visual metaphor to the formal claim.
### Typical misuse
- The figure simply redraws the theorem statement with arrows.
- Too many symbols make the "intuition" harder than the proof sketch.
- The visual metaphor conflicts with the actual logic of the theorem.
### Refinement cues
- Keep only the minimal quantities necessary.
- Turn algebraic dependence into spatial dependence when possible.
- If the theorem has multiple claims, illustrate only the one that unlocks intuition.
## Cross-Type Principles
### 1. Design around the reader's bottleneck
The strongest figure is usually the one that resolves the reader's hardest unresolved question, not the one with the most content.
### 2. Prefer one dominant message per figure
If a figure is trying to motivate, explain the method, and prove the result at once, it usually underperforms on all three jobs.
### 3. Use visual hierarchy as argument hierarchy
The most important panel, path, or quantity should be visually dominant. If everything is equally strong, nothing is prioritized.
### 4. Let captions carry prose
Figures should carry structure. Captions should carry the explanatory sentences that are too long for the drawing.
### 5. Add detail by clarifying transitions, not by adding decoration
When a figure feels weak, the missing piece is often:
- an intermediate state
- a clearer stage boundary
- a stronger case anchor
- a more explicit comparison
It is less often solved by extra icons, colors, or visual embellishment.
### 6. Match the figure to the paper slot
The same figure type changes character depending on placement:
- in `intro`, the figure should reduce entry cost and build motivation quickly
- in `method`, the figure should stabilize the technical map
- in `analysis`, the figure should strengthen belief or interpretation
- in `appendix`, the figure can afford density, edge cases, and extended detail
If a figure feels misplaced, the problem is often not the drawing style but the slot.
FILE:references/figure-scheme-patterns.md
# Figure Scheme Patterns
Use these patterns to convert a structured request into concrete candidate figure directions.
## 1. Motivation Contrast Board
Design goal:
- make the problem feel inevitable rather than optional
- create enough tension that the reader wants the method before seeing it
Best for:
- `why_is_this_problem_real`
- `idea_motivation_or_problem_gap`
- `contrast_before_after`
What this figure is doing cognitively:
- it is not teaching the full method
- it is establishing that the old framing misses something important
- it should reduce the reader's resistance before they enter the method section
Best paper slot:
- `intro`: strongest default; this is where the figure can create necessity before the method appears
- `analysis`: useful when the paper introduces a new failure taxonomy later, but weaker than intro for first use
- `appendix`: only if the motivation is secondary and the main paper must stay narrow
- `method`: usually a poor fit unless the paper structure merges motivation and formulation tightly
Recommended panel recipe:
- Panel A: failure, contradiction, or bottleneck
- Panel B: why the old framing fails
- Panel C: the desired takeaway or target behavior
Design priorities:
- the failure signal should be legible in under three seconds
- the contrast should be structural, not just color-coded
- the conclusion should be a single takeaway, not a mini literature review
Typical failure mode:
- too much text and not enough visual contrast
Refinement lever:
- keep only one memorable failure case
## 2. Toy Case Storyboard
Design goal:
- compress an abstract idea into one concrete, inspectable path
- let the reader mentally replay the logic after leaving the page
Best for:
- `can_i_see_the_core_case`
- `toy_example_or_case_evidence`
- `storyboard_case_walkthrough`
What this figure is doing cognitively:
- it anchors the method in a case the reader can simulate
- it is useful when the abstraction would otherwise feel arbitrary
- it can also serve as a bridge from motivation to mechanism
Best paper slot:
- `intro`: strong when one case can teach the whole problem quickly
- `method`: strong when the case is the cleanest bridge into the mechanism
- `analysis`: useful when the case is mainly diagnostic or interpretive
- `appendix`: good for extended walkthroughs that would overload the core narrative
Recommended panel recipe:
- setup
- confusing state or failure
- intervention
- corrected or revealing outcome
Design priorities:
- keep the same example identity across all panels
- every stage should add one new piece of information
- annotations should sit on the transition, not all over the frame
Typical failure mode:
- too many case branches in one figure
Refinement lever:
- carry one single case consistently across all panels
## 3. Method Overview Pipeline
Design goal:
- give the reader a stable mental map of the whole method
- show where the novelty lives without making the whole pipeline look equally important
Best for:
- `what_are_the_parts_and_data_flow`
- `method_overview_or_architecture`
- `block_arrow_pipeline`
What this figure is doing cognitively:
- it reduces navigation cost for the rest of the paper
- it tells the reader what each block is responsible for
- it should support the section structure, not compete with it
Best paper slot:
- `method`: strongest default; this is usually the anchor figure for the whole section
- `intro`: useful when the paper needs one early high-level teaser of the system
- `analysis`: only if the figure is reframed as a diagnostic decomposition rather than a full overview
- `appendix`: suitable for expanded variants, implementation detail, or extended module breakdown
Recommended panel recipe:
- inputs and setup
- core modules and flow
- outputs and learning targets
Design priorities:
- preserve one dominant reading path
- visually emphasize novel modules, not routine plumbing
- name blocks the same way the text and caption name them
Typical failure mode:
- all modules are drawn with equal weight, so the novelty disappears
Refinement lever:
- visually emphasize only the novel or decision-critical blocks
## 4. Idea-to-Model Bridge
Design goal:
- turn "this idea sounds good" into "I now understand why these variables or losses exist"
- make the implementation feel logically earned
Best for:
- `how_does_the_idea_become_a_model`
- `idea_to_model_logic_bridge`
- `equation_diagram_hybrid`
What this figure is doing cognitively:
- it fills the gap between intuition and formalization
- it is often the missing figure when reviewers say the method feels heuristic
- it is especially useful when the paper introduces constraints, latent states, rewards, or multi-term losses
Best paper slot:
- `method`: strongest default; the figure usually belongs near the formalization or objective section
- `intro`: works if the paper needs an early intuitive bridge before equations
- `analysis`: useful when the bridge is mainly interpretive rather than definitional
- `appendix`: suitable for full derivational bridges or extra mechanism detail
Recommended panel recipe:
- intuition or principle
- intermediate state, variable, or constraint
- loss, module, or objective
Design priorities:
- the middle layer matters most; without it the bridge usually fails
- reuse symbols or visual motifs across concept and model sides
- equations should support the picture, not replace it
Typical failure mode:
- the figure jumps directly from intuition to equations
Refinement lever:
- add one intermediate representation or variable layer
## 5. Process Loop Unroll
Design goal:
- make a time-dependent or iterative contribution easy to follow
- reveal not just components, but state changes and feedback
Best for:
- `what_happens_over_time`
- `training_or_inference_process`
- `feedback_loop_or_cycle`
What this figure is doing cognitively:
- it helps the reader track evolution, not just structure
- it is useful for training curricula, agent loops, search, retrieval, planning, self-improvement, and multi-stage inference
- it usually answers questions the overview pipeline cannot answer alone
Best paper slot:
- `method`: strongest default when the process is part of the core contribution
- `analysis`: strong when the process view is mainly diagnostic or explanatory
- `intro`: useful if the procedural novelty is the paper's most memorable hook
- `appendix`: good for full traces, long rollouts, or implementation-level unrolling
Recommended panel recipe:
- current state
- action or update
- evaluation or feedback
- next state
Design priorities:
- a recurrent state should be clearly marked
- show one full cycle before abstracting
- numbered phases usually help more than extra arrows
Typical failure mode:
- arrows go in too many directions and the reader loses order
Refinement lever:
- force a dominant reading direction and number the stages
## 6. Dataset / Benchmark Protocol Figure
Design goal:
- make the evidence pipeline feel auditable and fair
- show how examples become benchmark units and how claims are measured
Best for:
- `data_or_benchmark_construction`
- `benchmark_or_dataset_protocol`
- `what_evidence_should_i_believe`
What this figure is doing cognitively:
- it builds trust in the benchmark or data contribution
- it prevents the reader from collapsing all evidence into "they collected some data somehow"
- it is especially valuable when curation, slicing, annotation, or evaluation axes are part of the contribution
Best paper slot:
- `method`: strongest when data or benchmark construction is a contribution section of its own
- `intro`: useful when the benchmark itself is the headline contribution
- `analysis`: works when protocol interpretation matters for reading the results
- `appendix`: appropriate for expanded curation details, annotation policies, or extra benchmark slices
Recommended panel recipe:
- source pool
- filtering / annotation
- benchmark slices or evaluation axes
Design priorities:
- counts, funnels, and splits should be visually aligned
- protocol figures should foreground process before headline numbers
- if there are multiple task slices, show the organizing principle once
Typical failure mode:
- protocol and final results are mixed together
Refinement lever:
- separate protocol explanation from evidence comparison
## 7. Evidence Comparison Panel
Design goal:
- make one claim believable with the minimum number of panels
- emphasize evidential force rather than architectural completeness
Best for:
- `result_or_ablation_evidence`
- `quantitative_result_plot`
- `ablation_or_mechanism_probe`
What this figure is doing cognitively:
- it converts results into conviction
- it should make it obvious what changed, whether it helped, and why that support matters
- it is often the strongest figure for rebuttal or camera-ready strengthening
Best paper slot:
- `analysis`: strongest default; this is where evidence boards usually have the most force
- `intro`: useful only when one evidence board is needed to sell the whole paper early
- `method`: usually a poor fit unless the method and evidence are intentionally interleaved
- `appendix`: ideal for extra ablations, failure taxonomies, or lower-priority comparisons
Recommended panel recipe:
- main metric comparison
- qualitative or mechanism evidence
- ablation or boundary case
Design priorities:
- one main panel should dominate
- supporting panels should answer "why believe this result"
- highlight only the comparison that matters to the claim
Typical failure mode:
- too many subplots without one main claim
Refinement lever:
- make one panel dominant and demote the rest to support
## 8. Theory Intuition Figure
Design goal:
- give the reader a picture for a theorem, bound, or proof strategy
- lower the cost of entering formal analysis
Best for:
- `theory_or_proof_intuition`
- `what_is_the_proof_intuition`
- `minimal_vector_or_plot`
What this figure is doing cognitively:
- it translates symbolic structure into spatial or causal structure
- it should explain the role of quantities, not restate the theorem
- it is especially useful when the proof depends on geometry, invariance, partitioning, or monotonic movement
Best paper slot:
- `method`: useful when the theoretical picture is part of the main technical story
- `analysis`: strongest when the theory is interpreted after the core method is already known
- `appendix`: common when the theory is important but not needed for first-pass reading
- `intro`: rare, but possible if the paper's main claim is fundamentally theoretical
Recommended panel recipe:
- visual metaphor
- key quantity
- intuitive consequence
Design priorities:
- minimalism matters more here than elsewhere
- the reader should know exactly which quantity to track
- labels should attach to shapes or directions, not float as prose
Typical failure mode:
- the figure becomes another theorem statement in disguise
Refinement lever:
- remove most symbolic detail and keep only the quantities the viewer must track
## Default Answer Pattern
When the user does not specify a figure family, propose:
1. one direction that solves the logic gap most directly
2. one direction that is easiest for readers to scan
3. one direction that is strongest for persuasion
Then recommend one of the three based on the user's `figure_slot` and `density`.
FILE:references/human-best-practice-methodology.md
# Human Best-Practice Methodology for Inspiration / Case Scientific Figures
This reference converts human scientific-figure practice into a repeatable workflow.
## 1. Start with reader effect
Before drawing, specify:
- 10-second takeaway
- 60-second takeaway
- reader question
- misconception to prevent
- figure thesis
- anchor case or evidence
A figure should be designed as a visual argument, not as decoration.
## 2. Compress the paper logic
Use this compression chain:
```text
Prior framing → observed failure / blind spot → design principle → mechanism → expected benefit
```
For case figures:
```text
Case setup → baseline confusion / failure → proposed intervention → corrected outcome → why it matters
```
For inspiration-source figures:
```text
Observed phenomenon / human practice → transferable principle → computational mechanism → paper contribution
```
## 3. Diagnose the bottleneck
Common bottlenecks:
- motivation is abstract
- method leap is abrupt
- baseline-vs-ours difference is unclear
- analogy is decorative
- case is vivid but not tied to claim
- contribution is hidden inside modules
- figure is a list disguised as a diagram
## 4. Choose figure role before style
Choose among:
- motivation / problem-gap board
- inspiration-source figure
- toy example / case walkthrough
- idea-to-model bridge
- method intuition figure
- qualitative example figure
- mechanism + result snapshot
## 5. Choreograph panels
Ask:
- where should the eye enter?
- what action happens next?
- what is the turning point?
- what is the final insight?
- what can move to the caption?
## 6. Design layout
Common layouts:
- left-to-right transformation
- top-down narrative
- before / after / why
- bridge: real world ↔ model
- central mechanism with callouts
- one-case storyboard
## 7. Visual hierarchy and editing
- Make the main claim visually dominant.
- Remove any element that does not support the thesis.
- Avoid redundant arrows.
- Use white space to separate conceptual groups.
- Keep labels short and finite.
- Keep colors semantically stable.
## 8. Generate multiple candidates
When the user must judge style, layout, metaphor, or density, show multiple candidates:
- early: 3–5
- refinement: 2–4
- final repair: 1–2
Ask the user to choose by image and comment on concrete axes.
## 9. Review using a figure audit
Audit dimensions:
- paper-claim fit
- reader effect
- panel order
- visual hierarchy
- label accuracy
- text load
- metaphor-mechanism alignment
- reviewer-friendliness
- novelty emphasis
- accessibility and color consistency
FILE:references/image-generation-policy.md
# Image Generation Policy
## Banned Outputs
Do not generate or provide:
- SVG
- inline SVG
- Mermaid
- TikZ
- Graphviz
- HTML/CSS diagrams as final figure
- Python/matplotlib diagrams as final figure
- code that the user must render as the figure
## Required Route
Use ChatGPT Images 2.0 / available image generation for final visuals.
In ChatGPT web: use Create Image / image generation.
In other environments: if no image generation API/tool exists, output a text-only prompt and tell the user to run it in ChatGPT Images 2.0 or another image-generation API.
## Text in Generated Figures
Image models may distort dense text. Therefore:
- use minimal labels
- keep labels short
- avoid equations unless central and simple
- put long explanation in the caption, not inside the image
- if exact typography is critical, generate a near-final visual with minimal text and ask for a separate human/layout pass outside this skill
## Visual-first exploratory boards
Image generation is not reserved only for the final polished figure. If the user needs to choose figure direction, layout, visual style, metaphor, or density, the assistant may use an exploratory `IMAGE_ONLY` multi-candidate board earlier in the workflow.
These boards must still obey turn separation: the preceding text turn explains what stays fixed and what varies; the board turn contains images only; the following text turn reviews candidates and updates state.
FILE:references/inspiration-case-patterns.md
# Inspiration and Case Figure Patterns
## Inspiration-Source Figures
### Real-World Analogy to Mechanism
Use when the paper idea was inspired by a real-world process, human reasoning, biological system, scientific workflow, or expert practice.
Panel flow:
1. concrete observed practice
2. transferable principle
3. computational abstraction
4. paper method/effect
### Failure to Design Principle
Use when the most persuasive figure should start from a failure case.
Panel flow:
1. baseline failure
2. hidden missing signal
3. design principle
4. proposed behavior
### Expert Reasoning to Algorithm
Use when the method imitates or formalizes expert behavior.
Panel flow:
1. expert trace
2. abstract operations
3. algorithm modules
4. final output
## Case Schematic Figures
### One Case Storyboard
Use when one concrete example can teach the contribution.
Panel flow:
1. setup
2. confusing/failing state
3. intervention
4. corrected/revealing outcome
### Before / After / Why
Use when qualitative difference is persuasive but needs to be tied to the paper claim.
Panel flow:
1. input/scenario
2. old output
3. new output
4. why the difference matters
### Case + Mechanism Overlay
Use when internal mechanism matters more than just before/after result.
Panel flow:
1. raw case
2. critical regions/tokens/states
3. mechanism overlay
4. final decision/output
FILE:references/next-step-consistency-protocol.md
# Next-Step Consistency Protocol
Use this protocol in every `TEXT_ONLY` turn.
## Purpose
The user should not see scattered or contradictory suggestions about what to ask next. The answer may discuss the workflow and recommend an action, but copyable follow-up prompts belong only at the very end.
## Hard rule
Only the final section named `下一步你可以这样问` may contain copyable user prompts.
## Allowed in the body
- recommended default action
- workflow stage
- plan status
- trade-off analysis
- candidate comparison
- prompt draft or image brief
- state summary
## Not allowed in the body
- “你可以这样问...”
- “下一步可以问...”
- numbered lists of follow-up user messages
- copyable recovery prompts outside the final section
- prompt suggestions that conflict with the default recommendation
## Final-section checks
Before sending a text response, verify:
1. The final section exists.
2. It is the last section in the answer.
3. Item 1 matches `recommended_default` unless explicitly explained.
4. Alternatives are compatible and not contradictory.
5. The fallback prompt is included when useful: `请根据引导skill以及当前的状态,继续告诉我下一步做什么。`
## State footer wording
The state footer may include `下一轮建议`, but it should be phrased as an action summary, such as:
- `下一轮建议(动作,不写成用户提问句):锁定候选2并转成图像生成prompt。`
Do not phrase state bullets as copyable prompts.
FILE:references/plan-step-visibility-protocol.md
# Plan Step Visibility Protocol
Every text-only reply must make the execution plan visible, not just store it in the footer. The user should always know which step of the plan is being executed now.
## 1. Required visible plan block
Every `TEXT_ONLY` reply must include a compact visible plan block near the beginning of the answer, before detailed analysis or recommendations. This applies to opening turns, continuation turns, review turns, and short corrective turns.
Use a concise block such as:
```markdown
## 当前执行计划
- 当前处于:第 X/Y 步 — <step name>
- 本轮目标:...
- 计划步骤:
1. ... ✅ / 已完成
2. ... ⏳ / 当前
3. ... ⬜ / 待执行
4. ... ⬜ / 待执行
- 本轮是否调整计划:无 / 因为...,调整为...
```
For very short rule-update answers, the block may be shorter, but it must still state `当前处于:第 X/Y 步`.
## 2. Current-step marking
The current step must be explicit. Do not only say “当前执行计划:继续推进”. Mark the exact step as one of:
- `第 1 步:输入与材料判断`
- `第 2 步:Figure Effect Contract`
- `第 3 步:论文逻辑压缩与瓶颈诊断`
- `第 4 步:图机会地图`
- `第 5 步:候选图方案生成`
- `第 6 步:方案锁定`
- `第 7 步:内容架构与 panel choreography`
- `第 8 步:视觉决策轮 / 参考图分析`
- `第 9 步:图像 brief / prompt 构建`
- `第 10 步:IMAGE_ONLY 图像生成`
- `第 11 步:图稿诊断与修订`
- `第 12 步:最终 caption / legend / 正文插入段`
The assistant may collapse or rename steps for a specific task, but it must still show the current step number and total number of steps.
## 3. Footer alignment
The state footer must repeat the current plan step in compact form:
```markdown
- 当前处于计划第 X/Y 步:...
- 当前执行计划:...
```
The visible plan block, state footer, default recommendation, and final `下一步你可以这样问` prompts must all agree.
## 4. Do not violate final-only prompt rule
The visible plan block may say what action will happen next, but it must not contain copyable user-prompt sentences. Copyable next-turn prompts still belong only in the final `下一步你可以这样问` section.
## 5. Startup gate step
In v2.5, the first skill-trigger response should use a special step before substantive execution:
`当前处于:第 0/N 步 — 启动确认与流程预览`
This step must still use the normal visible plan block, but it previews the workflow and asks for confirmation rather than analyzing the paper. After confirmation, the next current step should become `第 1/N 步 — 输入与材料判断`.
FILE:references/planning-and-state-update-protocol.md
# Planning and State Update Protocol
This protocol makes the skill feel like a guided design process rather than a sequence of disconnected prompts.
## 0. Startup confirmation gate
For every newly triggered figure-design task, if there is no active state with `start_confirmed: true`, the first text-only reply must be `STARTUP_PLAN_ONLY`.
The assistant must first show the full workflow plan, describe each step, list helpful optional materials, recommend a default start route, and wait for user confirmation. It must not perform substantive figure analysis, paper interpretation, scheme selection, prompt construction, or image generation in this first gate response.
The startup gate should mark the current step as:
`当前处于:第 0/N 步 — 启动确认与流程预览`
Set in the footer:
- `start_confirmed: false`
- `awaiting_user_confirmation: true`
- `阶段:Startup Confirmation Gate`
After the user confirms, set `start_confirmed: true` and proceed to Round 0 / Intake.
## 1. Opening execution plan after confirmation
Once the user confirms start, begin the actual workflow with a compact execution plan. The plan should be visible to the user and should come before detailed figure analysis.
Required plan fields:
- `current_stage`: where the workflow is starting
- `goal_this_round`: what the current answer will produce
- `planned_steps`: 3–6 upcoming steps, such as intake → effect contract → opportunity map → candidate schemes → image brief → image generation
- `inference_policy`: what can be inferred from the draft and what would genuinely need user input
- `reference_image_check`: whether optional reference images would help now
- `default_route`: recommended path if the user wants to proceed directly
## 2. Adaptive plan updates
The plan is allowed to change. Update it when:
- the user adds new paper material or changes the target figure slot
- a figure effect contract is created or revised
- the user chooses or rejects a scheme
- reference images reveal useful visual principles
- generated candidates expose a layout, density, or metaphor problem
- the workflow moves to prompt generation, image generation, review, or final captioning
When the plan changes, include a short note: `计划调整:因为...,所以接下来...`
## 2A. Visible plan and current step in every text turn
Every text-only reply must show a compact plan block near the beginning of the answer. It is not enough to store the plan in the state footer. The user must be able to see both the plan and exactly where the current turn sits within it.
Required fields:
- `当前处于:第 X/Y 步 — <step name>`
- `本轮目标:...`
- `计划步骤:...` with completed/current/waiting markers
- `本轮是否调整计划:无 / 因为...,调整为...`
The footer must repeat the same step as `当前处于计划第 X/Y 步:...`.
The plan block may describe actions, but must not include copyable next-turn user prompts.
## 3. Mandatory state update after every text answer
Every text-only reply must end with `当前状态与产物` and `下一步你可以这样问`. This includes short acknowledgements, corrections, and turns that only modify a rule.
The state update must not be a blank template. It must record the newest state of the task:
- current stage
- current plan step, written as `当前处于计划第 X/Y 步:...`
- startup gate status: `start_confirmed` and `awaiting_user_confirmation`, when relevant
- current working plan
- plan adjustment, if any
- fixed decisions
- unresolved decisions
- recommended default
- reference image status
- artifacts created so far
- next recommended action
## 4. Compact footer pattern
Use this pattern at the end of every text-only answer:
```markdown
## 当前状态与产物
- 阶段:...
- 当前处于计划第 X/Y 步:...
- start_confirmed:true / false / 已完成启动门控
- awaiting_user_confirmation:true / false
- 当前执行计划:...
- 计划调整:无 / 因为...,已调整为...
- 已定:...
- 待定:...
- 默认推荐:...
- 参考图状态:未询问 / 已询问可选 / 已提供 / 已分析 / 暂不需要
- 产物:...
- 下一轮建议(动作,不写成用户提问句):...
- 渲染规则提醒:ChatGPT web 使用原生图像生成的独立动作;OpenClaw/Codex/Trae/API 使用 OpenAI ChatGPT Images 2.0 或更新模型;禁止 SVG / Mermaid / TikZ / Graphviz / 代码绘图替代。
## 下一步你可以这样问
1. `请根据引导skill以及当前的状态,继续...`
2. `请根据引导skill以及当前的状态,继续...`
3. 不知道下一步时:`请根据引导skill以及当前的状态,继续告诉我下一步做什么。`
```
For a startup-gate response, prompt 1 should normally be a confirmation/start prompt.
## 5. What not to do
Do not:
- start paper analysis in the first trigger before the user confirms the workflow
- answer only with prose and no state footer
- hide the plan internally
- keep using an outdated plan after the user selects a different route
- ask for reference images as a blocker
- treat the state footer as optional when the answer is short
## v2.3 next-step prompt consistency
The final section `下一步你可以这样问` is the only place where copyable next-turn prompts may appear.
Rules:
- Do not place prompt suggestions in the opening plan, body, tables, default recommendation, or state bullets.
- In the body, write actions rather than user prompts: use “默认推荐动作:锁定布局骨架” instead of “你可以问我继续锁定布局骨架”.
- In the state footer, `下一轮建议` must be an action summary, not a copyable sentence.
- The first prompt in `下一步你可以这样问` should normally match the default recommendation.
- If the final prompt list offers alternatives, they must be mutually compatible and labeled by purpose.
- Run a last-pass consistency check: no earlier line should tell the user to ask something that contradicts the final list.
Do not:
- place suggested next-turn user prompts anywhere except the final `下一步你可以这样问` section
- let the final prompt list contradict the default recommendation or current plan
## v2.7 visual decision board state
When a visual decision board is proposed, generated, reviewed, skipped, or selected, update the state footer with:
- `视觉决策板状态`
- board type
- varied axis
- fixed elements
- candidate count
- selected candidate, if any
- default recommendation
This board state is separate from the final image-generation state.
FILE:references/prompt-library.md
# User Prompt Library
## Start from a paper
`请根据引导skill以及当前的状态,继续阅读我的论文初稿,并先输出适合做灵感来源图/案例示意图的3个方向。`
## Deepen one direction
`请根据引导skill以及当前的状态,继续细化第2个方向,输出panel-by-panel方案。`
## Convert to image brief
`请根据引导skill以及当前的状态,继续把当前方案转成ChatGPT Images 2.0生成prompt,但本轮只输出文字。`
## Generate image
`请根据引导skill以及当前的状态,继续只生成图。`
## After image generation
`请根据引导skill以及当前的状态,继续告诉我下一步做什么。`
## Review a generated draft
`请根据引导skill以及当前的状态,继续诊断这版图的问题,并给出最小修改版prompt。`
## Compare visual styles
`请根据引导skill以及当前的状态,继续生成一个视觉风格候选板,比较3D、卡通/故事板、磁贴/卡片、editorial flat和正式架构风,并给出默认推荐风格。`
## Lock a style
`请根据引导skill以及当前的状态,继续锁定默认推荐风格,并把它写入ChatGPT Images 2.0图像生成prompt。`
FILE:references/recommendation-and-reference-image-protocol.md
# Recommendation and Reference Image Protocol
This reference governs two small but important behaviors in every research-figure design conversation.
## 1. Give the user a recommended choice
After presenting multiple options, always identify one recommended default so the user can continue without decision fatigue.
Recommended wording:
- `我建议优先选择:方案 X。理由:...`
- `默认推荐路线:...`
- `如果你想直接推进,我建议下一步做:...`
A good recommendation should be:
1. **claim-aligned**: it supports the paper's main claim and Figure Effect Contract.
2. **reader-centered**: it improves the 10-second and 60-second understanding.
3. **reviewer-safe**: it reduces likely misunderstanding.
4. **visually feasible**: it can be rendered clearly with limited labels.
5. **reversible**: it says when another option would be better.
Do not make the recommendation purely aesthetic.
## 2. Ask for optional reference images at useful moments
Ask for 1–3 reference images when they would help determine layout, density, style family, visual metaphor, or revision direction.
Use this low-friction sentence:
`如果你有1–3张参考图,可以发给我,我会分析它们的布局、信息层级、文字密度和视觉语言;如果没有,我也会继续按论文主张推进。`
Do not ask in every turn. Good moments include:
- initial intake for a visually ambitious figure
- before style-family or layout-skeleton candidate generation
- before converting a selected scheme into an image prompt
- when the user has feedback like “更像顶会论文图” or “更像某篇论文的图”
- when reviewing generated images
## 3. How to analyze provided reference images
Analyze reference images as design evidence. Extract:
- figure role and reader effect
- reading path
- panel count and panel boundaries
- visual hierarchy
- label density and label placement
- color semantics
- object vocabulary and icon style
- metaphor strength
- what is transferable to the current paper
- what should be avoided
## 4. What not to do
- Do not require reference images.
- Do not stop if no reference images are supplied.
- Do not copy the exact composition, distinctive style, marks, or labels of a reference figure.
- Do not let a reference image override the paper's own claim.
- Do not recommend an option without explaining why it serves the reader effect.
FILE:references/request-template.md
# Request Compression Template
Use this template internally when the user provides a figure need.
## Slots
- `claim`:
- `gap`:
- `evidence`:
- `anchor_case`:
- `figure_slot`:
- `density`:
- `style_bias`:
## Conversion Rule
Convert the slots into:
- one primary reader question
- one primary logical gap
- one primary figure role
- one preferred rhetoric
- one preferred visual grammar
## Response Rule
Return:
1. interpreted need
2. three candidate directions
3. strongest recommendation
4. one next-step refinement
FILE:references/session-state-schema-v2.md
# Session State Schema v2.7
Every text-only reply should preserve a compact state. v2.7 keeps the startup confirmation gate, explicit visual-style taxonomy state, and adds visual-first decision board state for early category/layout/style/metaphor/density choices. v2.5 adds a startup confirmation gate: the first skill-trigger response previews the plan and waits for confirmation before substantive figure design.
```yaml
state_version: v2.7
mode_this_turn: TEXT_ONLY | IMAGE_ONLY
start_gate:
start_confirmed: false | true
awaiting_user_confirmation: false | true
startup_plan_shown: false | true
confirmation_summary:
working_plan:
plan_id:
visible_plan_block_required: true
plan_status: startup_gate | opening | active | adjusted | completed
current_stage:
current_step_index:
total_steps:
current_step_name:
goal_this_round:
planned_steps:
completed_steps:
plan_adjustments:
next_step:
stage: Startup Confirmation Gate | Intake | Effect Contract | Opportunity Map | Candidate Schemes | Selected Scheme | Visual Decision | Image Brief Ready | Image Generated | Review | Revision | Final Text Package
reference_images:
status: not_asked | requested_optional | provided | analyzed | not_needed
count:
transferable_principles:
avoid_copying:
recommended_default:
current_recommendation:
reasons:
alternative_when:
paper_summary:
topic:
main_claim:
problem_gap:
core_mechanism:
contribution_delta:
figure_effect_contract:
figure_slot:
target_reader:
10_second_takeaway:
60_second_takeaway:
reader_question:
misconception_to_prevent:
figure_thesis:
anchor_case_or_evidence:
figure_decisions:
figure_family:
selected_scheme:
panel_count:
reading_path:
layout_skeleton:
visual_rhetoric:
style_family:
style_candidates_considered:
default_style_recommendation:
style_rationale:
style_risks:
aspect_ratio:
label_policy:
color_semantics:
density:
visual_decision_boards:
visual_decision_mode: text_only | exploratory_image_board | final_image_batch
visual_board_recommended: false | true
visual_board_type: figure_direction | layout | style | metaphor | density | refinement | final_candidate
visual_board_axis_varied:
visual_board_candidate_count:
visual_board_status: not_started | proposed | confirmed | generated | reviewed | skipped
visual_board_fixed_elements:
selected_visual_candidate:
default_visual_recommendation:
visual_candidate_history:
- batch_id:
candidate_count:
varied_axis:
fixed_elements:
user_selection:
rejected_reason:
artifacts:
- name:
type:
status:
open_questions:
- ...
next_recommended_actions:
- ...
```
## Mandatory footer
Every text-only response must end with an updated footer. The state cannot be copied forward unchanged unless nothing changed; even then, the current working plan and next action must be restated. Every text-only response must include a visible `当前执行计划` block near the beginning and must end with:
```markdown
## 当前状态与产物
- 阶段:...
- 当前处于计划第 X/Y 步:...
- start_confirmed:true / false / 已完成启动门控
- awaiting_user_confirmation:true / false
- 当前执行计划:...
- 计划调整:无 / ...
- 已定:...
- 待定:...
- 默认推荐:...
- 参考图状态:未提供 / 已请求 / 已分析 / 暂不需要
- 视觉风格状态:未开始 / 已比较 / 已推荐 / 已锁定;默认推荐风格:...
- 视觉决策板状态:未开始 / 建议生成 / 已生成 / 已评审 / 已跳过;类型:...;变化轴:...
- 产物:...
- 下一轮建议(动作,不写成用户提问句):...
- 渲染规则提醒:ChatGPT web 使用原生图像生成的独立动作;OpenClaw/Codex/Trae/API 使用 OpenAI ChatGPT Images 2.0 或更新模型;禁止 SVG / Mermaid / TikZ / Graphviz / 代码绘图替代。
## 下一步你可以这样问
1. `请根据引导skill以及当前的状态,继续...`
2. `请根据引导skill以及当前的状态,继续...`
3. 不知道下一步时:`请根据引导skill以及当前的状态,继续告诉我下一步做什么。`
```
For the startup gate, the first final prompt should normally ask to confirm start and continue to 第1步.
## v2.3 next-step prompt consistency
- `next_recommended_actions` stores action summaries only.
- Copyable user prompts must be printed only in the final `下一步你可以这样问` section.
- The first final prompt should normally match `recommended_default`.
- The state footer's `下一轮建议` should be phrased as an action, not as a user question.
## v2.4 plan-step visibility
- Every text-only answer must show a visible `当前执行计划` block near the beginning.
- The visible block must include `当前处于:第 X/Y 步 — <step name>`.
- The footer must repeat the same current step as `当前处于计划第 X/Y 步:...`.
- The visible plan block, footer, default recommendation, and final prompt list must be consistent.
- The visible plan block must not contain copyable next-turn prompt suggestions; those remain only in `下一步你可以这样问`.
## v2.5 startup confirmation gate
- First skill-trigger reply uses `stage: Startup Confirmation Gate`.
- First skill-trigger reply sets `start_confirmed: false` and `awaiting_user_confirmation: true`.
- No substantive figure analysis happens until the user confirms.
- After confirmation, set `start_confirmed: true` and move to Intake.
## v2.6 visual style taxonomy
- When visual style is a decision, compare a compact set of relevant style families rather than giving only a vague style adjective.
- Supported mainstream choices include editorial flat, formal architecture schematic, mechanism snapshot, premium scientific illustration, isometric / soft 3D, low-poly abstract 3D, cartoon / comic-lite, storyboard, tile / card / mosaic, paper-cut collage, blueprint, dashboard metaphor, mini-evidence infographic, and minimal line-art.
- The assistant must recommend one `默认推荐风格` and record style rationale and risks in state.
- For high-variance styles such as 3D, cartoon, photorealistic, and tile/mosaic boards, record risk controls in `style_risks`.
## v2.7 visual-first decision boards
- When category, layout, style, metaphor, or density is primarily a visual decision, the assistant may recommend or proceed to an exploratory `IMAGE_ONLY` board before the final figure-generation round.
- Boards are tracked separately from final image batches using `visual_decision_boards`.
- Board types include `figure_direction`, `layout`, `style`, `metaphor`, `density`, `refinement`, and `final_candidate`.
- A board should normally vary one dominant axis while fixing paper thesis, anchor case, and core labels.
- After a board is reviewed, record the selected candidate and default recommendation.
FILE:references/startup-confirmation-gate-protocol.md
# Startup Confirmation Gate Protocol
This protocol prevents the skill from jumping into figure design before the user understands the workflow.
## 1. First-trigger behavior
When the skill is invoked for a new task and there is no active state with `start_confirmed: true`, the first text-only reply must be `STARTUP_PLAN_ONLY`.
The assistant must not yet perform substantive figure analysis, scheme ranking, prompt construction, or image generation. It may briefly acknowledge the available input, but it should not start reading or interpreting the paper in detail.
## 2. What the first reply must contain
The first reply must include:
1. a visible `当前执行计划` block with `当前处于:第 0/N 步 — 启动确认与流程预览`
2. a complete step-by-step workflow preview, with each step described in plain language
3. what the user can provide before starting: draft/PDF, abstract, method summary, existing sketch, reference figures, target slot, style preference, preferred or avoided visual families such as 3D, cartoon/comic-lite, tile/card/mosaic, editorial flat, formal architecture, constraints
4. how the skill will behave after confirmation: execute one step at a time, update state after every text turn, adapt the plan when necessary, keep copyable prompts only at the end
5. a recommended default route for users who want to proceed directly
6. optional reference-image note: reference images help but are not required
7. a state footer indicating `start_confirmed: false` and `awaiting_user_confirmation: true`
8. final `下一步你可以这样问` prompts, with the first prompt normally being a confirmation/start prompt
## 3. What counts as confirmation
After the startup gate, treat any of the following as confirmation:
- the user says 确认开始 / 开始 / 继续 / 直接开始 / 按默认路线开始 / 可以开始
- the user sends paper material with an instruction to proceed
- the user chooses a target figure slot or route and asks to continue
- the user uses a final suggested prompt that explicitly asks to start the workflow
Once confirmed, set `start_confirmed: true` and move to Round 0 / Intake.
## 4. If the user changes the plan before confirmation
If the user edits the workflow, target slot, candidate count, reference-image policy, visual style policy, or response style before confirmation, update the startup plan and remain in the confirmation gate until the user confirms.
## 5. If the user asks to skip confirmation
If the first user message already says they want to skip the startup gate or immediately execute, still give a very compact gate once unless the skill already has an active confirmed state. The gate may be short, but it must still ask for confirmation before substantive work.
## 6. First-trigger template
```markdown
## 当前执行计划
- 当前处于:第 0/9 步 — 启动确认与流程预览
- 本轮目标:先展示完整制图流程、每一步会做什么、需要/可选材料、默认推进路线;等用户确认后再进入第 1 步。
- 计划步骤:
0. 启动确认与流程预览 ⏳ 当前
1. 输入与材料判断 ⬜ 待确认后执行
2. Figure Effect Contract ⬜ 待执行
3. 论文逻辑压缩与瓶颈诊断 ⬜ 待执行
4. 图机会地图与默认推荐 ⬜ 待执行
5. 候选方案生成与锁定 ⬜ 待执行
6. 图像 brief / prompt 构建 ⬜ 待执行
7. IMAGE_ONLY 候选图生成 ⬜ 待执行
8. 图稿诊断、修订与论文配套文字 ⬜ 待执行
- 本轮是否调整计划:无
```
Do not put copyable user prompts inside this plan block. Those belong only in the final `下一步你可以这样问` section.
FILE:references/state-and-turn-contract.md
# State and Turn Contract
This contract is strict: every text-only answer must either execute the current confirmed step or, before confirmation, maintain the startup gate and update recoverable state.
## Mandatory Startup Confirmation Gate
At the first trigger of a new figure-design task, the assistant must provide a startup plan preview and wait for user confirmation before substantive execution.
The first response must be `STARTUP_PLAN_ONLY` and must include:
- visible `当前执行计划` block
- `当前处于:第 0/N 步 — 启动确认与流程预览`
- a complete list of workflow steps and plain-language descriptions
- optional materials the user can provide
- optional reference-image note
- default recommended route
- final-only prompt suggestions for confirming or modifying the start
- footer with `start_confirmed: false` and `awaiting_user_confirmation: true`
It must not yet analyze the paper, rank figure schemes, build prompts, or generate images.
## Mandatory Opening Plan After Confirmation
After the user confirms start, the assistant must begin with a concise skill-driven plan before detailed analysis. The plan must be derived from the workflow in `SKILL.md`, not from a generic conversation template.
The opening execution plan must include:
- current stage
- goal of the current reply
- next 3–6 planned steps
- what will be inferred vs. requested
- whether optional reference images would help now
- recommended default route for immediate progress
The plan may change. When it changes, record the reason as `计划调整` in the state footer.
## Mandatory Visible Plan Step
Every text-only response must show a visible `当前执行计划` block near the beginning of the answer. The block must list the active plan and explicitly mark the current step as `当前处于:第 X/Y 步 — <step name>`. This is required for startup gate turns, opening turns, continuation turns, review turns, and short rule-modification turns.
The state footer must repeat the same current step as `当前处于计划第 X/Y 步:...`.
The visible plan block may describe workflow actions but must not include copyable next-turn user prompts.
## Mandatory State Footer for Text Turns
Every text-only response must end with:
1. 当前状态与产物
2. 下一步你可以这样问
The footer exists because the skill may not persist in an agent/session. It must contain enough context for another assistant turn to continue. It must be updated after **every** text-only answer, including startup-gate replies, short answers, corrections, and replies that only modify the skill rules.
## Modality Rule
- Text turn: no image generation.
- Image turn: image generation only; no text.
If a user asks for both explanation and image in one message, first provide the text-only image brief and ask the user to use the next message to request image-only generation.
## Recovery Phrase
Always remind the user to include:
`请根据引导skill以及当前的状态,继续...`
Recommended fallback:
`请根据引导skill以及当前的状态,继续告诉我下一步做什么。`
---
# v2 Addendum
## Visual candidate reminder
Every text-only planning response should remind the user that visual choices are best made from multiple candidates. Prefer 3–5 candidates in early exploration and 2–4 in refinement.
## Rendering rule reminder
Every text-only response should repeat one compact rule:
ChatGPT web uses native image generation as a separate action; OpenClaw / Codex / Trae / API hosts use OpenAI ChatGPT Images 2.0 or newer; SVG, Mermaid, TikZ, Graphviz, HTML/CSS, matplotlib, and other code-rendered figure fallbacks are forbidden.
## Stronger state requirements
The state block must include:
- current plan step, written as `当前处于计划第 X/Y 步:...`
- startup gate status when relevant: `start_confirmed` and `awaiting_user_confirmation`
- current working plan and any plan adjustment
- figure effect contract
- selected scheme
- current visual decision
- candidate history
- next candidate batch design
## Recommended default choice
Every text-only planning reply must include one opinionated default path so the user can proceed immediately:
- `我建议优先选择:...`
- `默认推荐路线:...`
- `如果你想直接推进,我建议下一步做:...`
The recommendation should be justified by reader effect, paper claim, reviewer risk, visual clarity, or generation feasibility. It must be stored in the compact state as `recommended_default`.
## Optional reference image prompt
Ask for optional reference images when they would help with layout, style, density, or visual metaphor:
`如果你有1–3张参考图,可以发给我,我会分析它们的布局、信息层级、文字密度和视觉语言;如果没有,我也会继续按论文主张推进。`
Do not ask this in every single turn. Do not block progress when the user has no reference images. If images are provided, analyze transferable principles and avoid exact copying.
## Footer must be an update, not a placeholder
The footer should record what changed in the current turn. If the turn only updates a rule, record the rule update as an artifact or fixed decision. Do not copy an old state block unchanged.
## v2.3 next-step prompt placement rule
Copyable next-turn prompt suggestions must appear only in the final `下一步你可以这样问` section. Do not place them in the opening plan, analysis body, option tables, recommendation block, or state bullets.
The state footer may summarize the immediate next action, but it should not introduce additional user-prompt wording. The final prompt list must be checked against the current plan and default recommendation so the response does not contain conflicting next steps. Prompt 1 should normally match `recommended_default`.
## v2.4 plan-step visibility rule
Every text-only reply must visibly list the plan and the current step near the beginning, not only in the footer. The current step must be explicit, for example `当前处于:第 4/12 步 — 图机会地图`. The footer repeats the same step. If the plan changes, both the visible plan block and footer must reflect the change.
## v2.5 startup confirmation rule
The first trigger is a gate, not the first design step. It previews the plan and waits for confirmation. After confirmation, the skill works step-by-step from Intake onward.
## v2.6 style-state requirement
When visual style is discussed, the text reply and state footer must record whether style is `not_started`, `style_board_proposed`, `default_recommended`, or `locked`. If a default style is recommended, include the rationale and at least one risk control.
Do not let style recommendations contradict earlier reader-effect or layout decisions. If a user asks for a style that conflicts with the paper slot, explain the risk and offer a safer adaptation.
FILE:references/taxonomy-reference.md
# Figure Taxonomy Reference
This file contains the text reference for the figure classification system. Use it when turning a vague figure need into a structured figure plan.
## Recommended Retrieval Order
When the user is vague, classify in this order:
1. `Reader Question`
2. `Logical Gap Type`
3. `Function / Narrative Role`
4. `Visual Rhetoric`
5. `Visual Grammar / Style`
6. `Evidence Type`
7. `Density / Layout`
8. `Editing Lever`
This order is better than starting from style, because users usually describe what is still unclear, not what drawing primitive they want.
## 1. Reader Question
- `why_is_this_problem_real`
- Use when the reader still does not buy the problem, failure mode, or practical need.
- `can_i_see_the_core_case`
- Use when the core intuition is best conveyed through one example, counterexample, or walk-through.
- `how_does_the_idea_become_a_model`
- Use when the idea is understandable in words, but the jump to variables, losses, modules, or objectives feels abrupt.
- `what_are_the_parts_and_data_flow`
- Use when the reader needs a stable map of components, interfaces, and data flow.
- `what_happens_over_time`
- Use when the contribution is procedural: training loop, inference process, retrieval chain, agent interaction, or iterative update.
- `what_evidence_should_i_believe`
- Use when the job of the figure is persuasion through comparison, ablation, protocol, or case evidence.
- `what_is_the_proof_intuition`
- Use when a theorem, bound, or proof needs a visual explanation.
- `what_is_the_main_message`
- Use when the figure should compress the main takeaway into one memorable message.
## 2. Logical Gap Type
- `phenomenon_to_problem`
- Bridge from observed phenomenon to a well-defined problem.
- `problem_to_hypothesis`
- Bridge from problem statement to the key hypothesis or principle.
- `hypothesis_to_mechanism`
- Bridge from high-level intuition to a concrete mechanism.
- `mechanism_to_objective`
- Bridge from mechanism to variables, constraints, losses, or objective terms.
- `objective_to_algorithm`
- Bridge from the mathematical target to a computable procedure.
- `algorithm_to_system`
- Bridge from a local algorithmic step to the full system or workflow.
- `system_to_evidence`
- Bridge from system description to believable evidence.
- `theory_to_intuition`
- Bridge from theorem or bound to a visual picture.
## 3. Function / Narrative Role
- `idea_motivation_or_problem_gap`
- Expose the limitation, bottleneck, contradiction, failure case, or missing signal.
- `toy_example_or_case_evidence`
- Use one concrete case to explain the central intuition.
- `method_overview_or_architecture`
- The main framework figure or module overview.
- `idea_to_model_logic_bridge`
- Make the bridge from intuition to variables, modules, losses, constraints, or search operators explicit.
- `training_or_inference_process`
- Explain training, inference, retrieval, planning, or iterative refinement over time.
- `data_or_benchmark_construction`
- Explain how a dataset, benchmark, task family, or evaluation protocol is constructed.
- `result_or_ablation_evidence`
- Support a claim through quantitative or qualitative evidence.
- `theory_or_proof_intuition`
- Provide a visual interpretation of theory.
- `general_explanatory_figure`
- A fallback category for explanatory figures that do not fit neatly elsewhere.
## 4. Visual Rhetoric
- `contrast_before_after`
- Before vs after, failure vs success, old vs new.
- `progressive_stage_reveal`
- Reveal the explanation in stages.
- `causal_chain`
- Make cause -> mechanism -> outcome explicit.
- `decompose_then_recompose`
- Split the system into pieces, then show how they recombine.
- `zoom_in_zoom_out`
- Move between global overview and local detail.
- `mapping_alignment`
- Align two spaces, modalities, roles, or state sets.
- `feedback_loop_or_cycle`
- Emphasize iteration, recurrence, or feedback.
- `search_space_or_design_space`
- Show a frontier, family map, design space, or trade-off structure.
- `storyboard_case_walkthrough`
- Walk through one concrete example like a storyboard.
- `direct_exposition`
- Plain explanation with minimal rhetorical structure.
## 5. Visual Grammar / Style
- `block_arrow_pipeline`
- Block modules connected by arrows.
- `input_output_triptych`
- Input / intermediate / output or 3-part explanation board.
- `graph_or_network_schematic`
- Graph, topology, relation network, or structured state diagram.
- `sequence_token_timeline`
- Ordered steps, token progression, timeline, or recurrent sequence.
- `image_grid_or_qualitative_panel`
- Image board, qualitative strip, or example gallery.
- `matrix_heatmap_or_chart_hybrid`
- Heatmap, matrix, chart board, or mixed evidence panel.
- `equation_diagram_hybrid`
- Equations, variables, and visual mechanism in one board.
- `trajectory_or_environment_scene`
- Trajectory, environment, planning, robotics, or embodied setting.
- `wide_landscape_multi_panel`
- Wide horizontal figure with several coordinated panels.
- `vertical_stack`
- Top-to-bottom layered explanation.
- `minimal_vector_or_plot`
- Sparse theory-style diagram or minimal plot.
## 6. Evidence Type
- `toy_case_or_counterexample`
- Synthetic or toy support.
- `real_case_study`
- Concrete real example.
- `quantitative_result_plot`
- Quantitative curves, bars, or metrics.
- `ablation_or_mechanism_probe`
- Ablation, probing, or mechanism analysis.
- `benchmark_or_dataset_protocol`
- Benchmark or protocol explanation.
- `theory_or_bound_support`
- Formal support such as bounds or proof-related evidence.
- `visual_output_evidence`
- Qualitative visual results.
- `system_trace_or_log`
- Workflow traces, trajectories, logs, or state transitions.
- `conceptual_support`
- Conceptual support that is explanatory rather than empirical.
## 7. Density / Layout
- `hero_single_panel`
- Strong, memorable main figure with one dominant frame.
- `wide_ribbon`
- Wide ribbon-like method figure.
- `two_stage_split`
- Two major blocks, often concept vs implementation.
- `2x2_grid`
- Symmetric multi-panel board.
- `dense_reference_sheet`
- Dense technical summary or appendix-style board.
- `vertical_story`
- Narrative stack.
- `landscape_story`
- Horizontal story progression.
## 8. Editing Lever
- `simplify_text_load`
- Remove excess text and push explanation into structure.
- `strengthen_flow`
- Make the dominant reading path obvious.
- `add_case_anchor`
- Introduce one memorable example.
- `add_intermediate_state`
- Show the missing intermediate representation or variable.
- `make_bridge_explicit`
- Make the logic jump visible.
- `tighten_color_semantics`
- Ensure color has stable meaning.
- `separate_evidence_from_method`
- Avoid mixing architecture and evaluation too early.
- `reduce_panel_redundancy`
- Remove repeated panels that do not add information.
- `increase_reader_guidance`
- Add numbering, phase labels, or callouts that support reading order.
## Practical Rule
When the user says:
- "the motivation is weak" -> start with `why_is_this_problem_real`
- "the jump to the model is abrupt" -> start with `how_does_the_idea_become_a_model`
- "the method is hard to follow" -> start with `what_are_the_parts_and_data_flow`
- "the process is unclear" -> start with `what_happens_over_time`
- "the evidence is not convincing" -> start with `what_evidence_should_i_believe`
- "the theory is too dense" -> start with `what_is_the_proof_intuition`
FILE:references/user-input-bundle-template.md
# User Input Bundle Template
Use this template internally when the user wants a structured intake form.
Ask only for fields that would materially improve the next design step.
## Paper identity
- Title:
- Field / venue target:
- Figure slot: first chapter / introduction / method / result / analysis / appendix
## Paper logic
- Main claim:
- Prior methods or framing:
- Problem gap:
- Core mechanism:
- Evidence or experiments:
## Figure goal
- What should readers understand in 10 seconds?
- What should readers understand in 60 seconds?
- What misconception should the figure prevent?
- Should the figure be inspiration-source, case schematic, idea-to-model bridge, or undecided?
## Anchor case or analogy
- Concrete example:
- Real-world inspiration source:
- Failure case:
- Baseline-vs-ours contrast:
## Visual preferences
- Aspect ratio: portrait / landscape / square / undecided
- Tone: formal / modern / editorial / minimalist / vivid / undecided
- Text tolerance: very low / medium / high
- Need multiple candidates: yes / no / undecided
## Optional reference figures
Ask the user to attach 1-3 reference figures only if layout, density, hierarchy,
or visual language would benefit from visual evidence. Continue without them if
none are available.
FILE:references/visual-decision-protocol.md
# Visual Decision Protocol
Version: 2.7.0
Use this reference whenever the next decision is visual.
## Visual-decision-first rule
If the user is deciding among style, layout, density, visual metaphor, figure direction, or image direction, do not rely only on prose. Prepare an exploratory candidate image board when the choice is primarily visual.
Do not postpone all visual comparison until the final figure-generation round. Early boards are allowed and encouraged when they help the user choose a direction.
## Candidate counts
- Early visual decision board: 3–5 candidates
- Exploration: 3–5 candidates
- Narrow refinement: 2–4 candidates
- Final repair: 1–2 candidates
## Batch design
Each board should vary exactly one dominant axis unless the user explicitly asks for a broad first exploration:
- figure direction / category
- layout skeleton
- style family, using the visual style taxonomy when style is the varied axis
- density
- visual metaphor
- evidence integration
- label policy
- color semantics
- reviewer tone
Keep the paper thesis, selected scheme, and anchor case fixed unless the user asks to change them.
## Style-family board
When the visual decision is style, offer a compact style-family board with 4–8 relevant choices. Include mainstream choices when they fit the figure: clean editorial flat, formal architecture schematic, mechanism snapshot, premium scientific illustration, isometric / soft 3D, low-poly abstract 3D, cartoon / comic-lite, storyboard panels, tile / card / mosaic board, paper-cut collage, blueprint / technical drawing, dashboard metaphor, mini-evidence infographic, and minimal line-art.
For each style candidate, state: best fit, main benefit, main risk, prompt cue, and suitability for the current paper slot. Always provide `默认推荐风格`.
Do not treat style as decoration. Tie every style option to reader effect, paper claim, reviewer risk, and generation robustness.
## Text-only pre-board / pre-generation reply
Before a batch, state:
- what is fixed
- what varies
- how many candidates will be generated
- whether this is an exploratory decision board or the final candidate batch
- what the user should evaluate after images appear
- the rendering rule: native image generation / ChatGPT Images 2.0; no SVG or code fallback
## Image-only generation turn
The generation turn contains no prose. Generate the images only.
## Post-image reply
After generation, ask the user to choose by image number / letter and comment on:
- layout clarity
- paper-claim fit
- mechanism clarity
- density / clutter
- text readability
- style fit
- whether the figure feels appropriate for the target paper slot
## Default recommendation in visual decisions
A visual decision reply must not be a neutral catalog only. It should include:
- fixed elements
- varied axis
- candidate count
- evaluation criteria
- `默认推荐路线` or `我建议优先选择`
Choose the default based on reader effect, panel clarity, paper-claim fit, reviewer risk, and likely image-generation robustness.
## Reference images in visual decisions
At suitable visual decision points, ask whether the user has 1–3 reference images. If supplied, analyze:
- reading path
- panel structure
- hierarchy
- density
- label policy
- color semantics
- metaphor / object vocabulary
- what to borrow as design principles
- what not to copy
Use reference images to improve the current figure, not to imitate the source exactly. If no references are supplied, continue with the best inferred direction.
## Related reference
Use `visual-style-taxonomy-and-selection.md` for detailed style choices, defaults, risks, and prompt cues.
## Visual decision board types
Use `visual-first-decision-board-protocol.md` for the full rules. Supported early boards include:
- figure-direction board;
- layout board;
- style board;
- metaphor board;
- density board;
- refinement board.
A board is a decision aid, not the final polished figure.
FILE:references/visual-first-decision-board-protocol.md
# Visual-First Decision Board Protocol
Version: 2.7.0
This protocol prevents the inspiration/case figure guide from becoming a text-only questionnaire that only generates images once at the end. For inspiration-source, case-schematic, and idea-to-model bridge figures, many choices are inherently visual: category, panel skeleton, metaphor strength, style family, and density.
## Core principle
When a decision is primarily visual, let the user compare generated visual samples earlier in the workflow. A visual decision board is not the final polished figure; it is a quick multi-candidate board for selecting a direction.
## When to use a visual decision board
Use or recommend an `IMAGE_ONLY` exploratory board when:
- the user is choosing among inspiration-source figure, case walkthrough, motivation board, idea-to-model bridge, or introduction hero;
- the user is comparing style families such as editorial flat, formal schematic, mechanism snapshot, premium scientific illustration, isometric / soft 3D, cartoon / comic-lite, tile / card / mosaic, paper-cut, blueprint, dashboard, or minimal line-art;
- the layout choice is visual: bridge, storyboard, before-after, layered stack, tile board, radial loop, or central mechanism with callouts;
- the metaphor choice is visual: bridge, funnel, lens, map, loop, scaffold, or card board;
- the user asks for multiple candidates, different styles, different types, or says they cannot decide;
- more text would likely not resolve the choice.
Do not use a board when the paper claim is still too unclear, the user explicitly asks not to generate images yet, or the decision is mainly about argument logic rather than visual form.
## Board types
### 1. Figure-direction board
Purpose: select the role of the figure.
Typical batch: 3–5 images.
Examples: inspiration-source bridge vs case walkthrough vs problem-gap board vs idea-to-model bridge vs intro hero.
Prompt discipline: keep the paper thesis, anchor case, and core labels fixed; vary only figure role and composition.
### 2. Layout board
Purpose: choose the panel skeleton.
Typical batch: 3–5 images.
Examples: horizontal bridge, split before-after, storyboard panels, tile matrix, layered stack, central mechanism with callouts.
Prompt discipline: keep figure role and style fixed; vary only layout.
### 3. Style board
Purpose: choose visual communication style.
Typical batch: 3–5 images.
Examples: clean editorial flat, formal architecture schematic, mechanism snapshot, premium scientific illustration, isometric / soft 3D, mature cartoon / storyboard, tile/card/mosaic, paper-cut layered collage, blueprint / technical drawing, minimal line-art.
Prompt discipline: keep thesis, panel plan, labels, color semantics, and object vocabulary fixed; vary only style family.
### 4. Metaphor board
Purpose: choose the central visual metaphor.
Typical batch: 3–5 images.
Examples: bridge, funnel, lens, loop, map, scaffold, card board, evidence trail.
Prompt discipline: keep role, panel count, and style fixed; vary only metaphor.
### 5. Density board
Purpose: choose the amount of information.
Typical batch: 2–4 images.
Examples: sparse intro hero, balanced 3-panel figure, moderate case walkthrough, dense mini-evidence board.
Prompt discipline: keep role, layout, and style fixed; vary only density and label amount.
## Required text turn before a board
Unless the user already asked for direct visual candidates and the state is sufficient, the preceding `TEXT_ONLY` reply should briefly state:
- board type;
- candidate count;
- what stays fixed;
- what varies;
- what the user should compare;
- the default recommendation if the user wants to proceed.
Do not over-explain. The purpose is to move from imagined options to visual evidence.
## Image-only generation turn
The board-generation turn must be `IMAGE_ONLY`: no prose, no state footer, no explanation.
## After a board
The next `TEXT_ONLY` reply must:
1. identify the strongest candidate;
2. explain why it fits the reader effect, paper slot, and figure thesis;
3. note risks or required modifications;
4. record the board in state;
5. provide one default recommendation;
6. end with the standard `当前状态与产物` and `下一步你可以这样问` sections.
## State fields
Track:
```yaml
visual_decision_mode: text_only | exploratory_image_board | final_image_batch
visual_board_recommended: true | false
visual_board_type: figure_direction | layout | style | metaphor | density | refinement | final_candidate
visual_board_axis_varied:
visual_board_candidate_count:
visual_board_status: proposed | confirmed | generated | reviewed | skipped
visual_board_fixed_elements:
visual_candidate_history:
selected_visual_candidate:
default_visual_recommendation:
```
## Non-goals
- Do not use exploratory boards to bypass the reader-effect contract or paper logic compression.
- Do not vary category, layout, style, metaphor, density, labels, and colors all at once except in a deliberately broad first exploration.
- Do not treat the first exploratory board as final. It is a decision aid.
FILE:references/visual-style-taxonomy-and-selection.md
# Visual Style Taxonomy and Selection Protocol
Use this reference when the conversation reaches visual language, style-family selection, candidate-board design, or image-prompt construction.
## Core rule: style is chosen after effect
Do not start a scientific figure from style. First define the reader effect, paper claim, figure role, and layout logic. Then choose a style that helps the reader interpret the figure quickly and safely.
A style choice should answer:
1. What does this style help the reader understand faster?
2. What paper slot does it fit: introduction hero, method overview, case walkthrough, result intuition, rebuttal, or appendix?
3. What reviewer risk does it create: too playful, too decorative, too dense, too photorealistic, too product-like, or too vague?
4. How robust is it for ChatGPT Images 2.0 generation?
## Mandatory style-family matrix
When style is a live decision, present 4–8 relevant styles from the matrix below and recommend one default. Do not present every style every time.
| Style family | Best for | Strength | Risk | Prompt cues |
|---|---|---|---|---|
| Clean editorial flat | Intro hero, cross-domain explanation, concept bridge | Clear, paper-safe, readable | May feel generic if not anchored by a strong metaphor | clean flat scientific illustration, minimal labels, soft shadows, high contrast |
| Formal architecture schematic | Technical method overview, ML/system papers | Reviewer-safe, precise module relations | Can become dry or box-heavy | formal architecture diagram style, structured modules, crisp arrows, restrained palette |
| Mechanism snapshot | Abstract algorithm intuition, core mechanism | Shows action and causality | May omit implementation detail | central mechanism, callouts, cause-effect arrows, compact mini-panels |
| Premium scientific illustration | High-impact intro figure, interdisciplinary journal | Polished, memorable, publication-facing | Can become over-rendered or decorative | premium scientific illustration, clean depth, refined lighting, minimal text |
| Isometric / soft 3D | Systems, networks, spatial relations, layered processes | Makes depth, hierarchy, and components tangible | Risk of toy-like or product-render look | isometric 3D scientific diagram, soft depth, clean materials, controlled perspective |
| Low-poly / abstract 3D | Conceptual landscapes, optimization, latent spaces | Good for abstract spaces and gradients | Can become vague if not panel-anchored | abstract low-poly 3D landscape, scientific metaphor, sparse labels |
| Cartoon / comic-lite | Case walkthrough, failure mode story, human-centered examples | Intuitive, friendly, memorable | May look childish or less serious | mature editorial cartoon style, restrained, not childish, research-paper appropriate |
| Storyboard panels | Before/after, one-case evolution, intervention path | Strong temporal logic | Can become too narrative if mechanism is missing | storyboard scientific figure, sequential panels, short labels, clear transitions |
| Tile / card / mosaic board | Taxonomy, multiple cases, method components, comparison grid | Great for modular comparison and scannability | Can feel like a poster if hierarchy is weak | tile-based scientific infographic, modular cards, consistent icons, grouped hierarchy |
| Paper-cut / layered collage | Inspiration-source, real-world-to-model bridge, multi-source intuition | Distinctive and approachable | Can look decorative if overused | layered paper-cut scientific illustration, clean edges, limited palette |
| Blueprint / technical drawing | Method mechanics, system process, design constraints | Precise, technical tone | Too cold for broad readers | blueprint-style scientific schematic, thin lines, labeled components, minimal text |
| Dashboard / interface metaphor | Evaluation logic, monitoring, data pipeline, decision support | Makes state, metrics, and feedback loops visible | Risk of fake UI clutter | clean research dashboard metaphor, abstract panels, no fake app details |
| Mini-evidence infographic | Need to connect concept to empirical signal | Bridges idea and evidence | Can be too dense for first-chapter hero figure | mini plots, small evidence cards, concise annotations, caption-heavy |
| Minimal line-art schematic | Theory, equation-to-intuition bridge, formal argument | Extremely readable and safe | May look plain | minimal line-art scientific schematic, sparse labels, strong whitespace |
| Photorealistic / cinematic | Rare: physical experiments or concrete material scenes | High realism and attention | Usually risky for conceptual ML figures; noise and fake detail | use only when concrete realism is necessary; avoid photorealistic noise |
## Default recommendation heuristics
If no style preference is supplied, choose the safest style according to the paper slot:
- First chapter / introduction hero: **clean editorial flat** or **premium scientific illustration**.
- Technical method figure: **formal architecture schematic** or **mechanism snapshot**.
- Case schematic: **storyboard panels** or **cartoon / comic-lite** if human intuition matters.
- Inspiration-source figure: **paper-cut / layered collage**, **clean editorial flat**, or restrained **premium scientific illustration**.
- Multi-case / taxonomy / component comparison: **tile / card / mosaic board**.
- System / network / layered pipeline: **isometric / soft 3D** only if depth helps; otherwise formal schematic.
- Abstract latent space / optimization intuition: **low-poly / abstract 3D** only if the spatial metaphor is central.
- Reviewer-sensitive venue: prefer **formal architecture schematic**, **mechanism snapshot**, **minimal line-art**, or **clean editorial flat**.
## Style-board protocol
When style selection is useful, create a style board with 3–5 candidates. Hold fixed:
- figure thesis
- panel structure
- labels
- color semantics
- anchor case
Vary only the style family. For each style candidate, provide:
- style name
- why it helps the paper claim
- what could go wrong
- best paper slot
- prompt cue
- recommendation score: high / medium / low
Always include one `默认推荐风格` with reasons.
## Reference image use for style
If the user provides reference figures, analyze their style as principles rather than copying them:
- line weight / shape language
- depth model: flat, layered, isometric, full 3D
- object realism: symbolic, cartoon, semi-realistic, photorealistic
- color semantics
- density and whitespace
- label placement
- panel rhythm
- seriousness / playfulness level
Do not copy distinctive artwork, exact composition, branding, or unique labels from references.
## Style prompt safety rules
- Avoid vague style-only prompts such as `make it beautiful`, `high-tech`, or `Nature style` without concrete visual rules.
- Specify style through concrete features: line weight, depth, material, panel rhythm, label density, palette role, and icon realism.
- Use `research-paper appropriate`, `minimal text`, and `clear hierarchy` in most prompts.
- For cartoon style, add `mature`, `restrained`, `not childish` unless the paper context explicitly wants playful visuals.
- For 3D style, add `controlled perspective`, `clean surfaces`, `no photorealistic clutter`, and `labels remain flat/readable`.
- For tile/card style, add `clear grouping`, `one idea per card`, and `dominant central thesis`.
- For premium illustration, add `not decorative`, `mechanism-driven`, and `paper-safe`.
## State fields to update
When style is selected or explored, update:
- `figure_decisions.style_family`
- `figure_decisions.style_candidates_considered`
- `figure_decisions.default_style_recommendation`
- `figure_decisions.style_rationale`
- `figure_decisions.style_risks`
- `visual_candidate_history.varied_axis: style_family`
FILE:agents/openai.yaml
interface:
display_name: "Inspiration / Case Figure Guide"
short_description: "Research figure design guide"
default_prompt: "Use $inspiration-case-figure-guide to turn my paper idea or draft into a publication-facing inspiration-source or case schematic figure plan."
Convert a paper deep-reading report, method description, or model introduction into publication-ready framework figure concepts through a stateful multi-roun...
---
name: paper-framework-figure-studio-pro
description: Convert a paper deep-reading report, method description, or model introduction into publication-ready framework figure concepts through a stateful multi-round workflow. Use when the user wants top-conference/top-journal framework diagrams, style exploration, human approval between rounds, concrete image prompts, separate text-versus-image turns, and iterative refinement toward a final paper figure.
license: MIT-0
metadata:
display_name: Paper Framework Figure Studio Pro
version: "1.4.3"
author: OpenAI
tags: research-figure, scientific-illustration, framework-diagram, paper-writing, clawhub
compatibility: ChatGPT web, Codex, Trae, OpenClaw, ClawHub marketplace, skills.sh. Tool-agnostic. Requires an image-generation capability in the host environment for rendering.
openclaw:
skillKey: paper-framework-figure-studio-pro
---
# Paper Framework Figure Studio Pro
Turn a paper's deep-reading report, model introduction, or method summary into a publication-ready **framework figure** through a **stateful, multi-round, human-in-the-loop studio workflow**.
Use this skill when the user wants any of the following:
- a main paper framework diagram
- a method overview figure for a conference or journal paper
- multi-style exploration of academic figure directions
- iterative narrowing from broad style families to polished final renders
- concrete, detailed prompts for image generation rather than vague design suggestions
- explicit human confirmation between rounds
- a final optional pass that writes the figure legend / caption / panel callouts
This skill is primarily for **framework figures**, not benchmark plots, tables, or raw quantitative charts.
## Core operating model
This skill is not a one-shot prompt generator. It is a **conversation-driven figure studio** with explicit state.
### Mandatory turn separation protocol
Every generation cycle must follow this order:
1. **Text-only planning / summary turn**
2. **Text-only confirmation turn** asking whether to generate the next candidate batch now
3. **Image-only generation action/turn** after the user says yes
4. **Text-only evaluation turn** asking the user to choose from the generated images
If the host environment tends to auto-generate images together with text, the skill must still behave as if those are forbidden to co-occur, and should explicitly defer image generation to the next turn after confirmation.
## First-contact protocol
On the **first planning turn**, before the studio fully starts, the assistant should briefly orient the user.
### Preferred but not mandatory upstream input
The assistant should explicitly tell the user that the best upstream input is usually a paper deep-reading report in Markdown, and should recommend the user first use the **paper-deep-reading** skill for the target paper or draft when available.
Recommended reminder wording should communicate all of the following:
- best practice is to first generate a paper deep-reading report or structured reading report
- the report should preferably be saved in **Markdown**
- the user may use the paper-deep-reading skill at `https://clawhub.ai/c-narcissus/paper-deep-reading` if they want a strong upstream report
- this is **recommended, not required**
- the skill also accepts less-complete inputs such as a method sketch, module list, algorithm description, or early-stage design notes
### First-turn readiness check
After the recommendation, the assistant should explicitly ask whether the user is **ready to start figure design now**.
The first-turn planning reply should therefore do four things in order:
1. remind the user that a Markdown deep-reading report is the preferred input, though not mandatory
2. state what kinds of partial inputs are also acceptable
3. ask whether the user is ready to begin figure design now
4. preview that once the report or description is read, the first image round will usually be a **multi-style candidate board** for the user to choose from
### First turn after ingesting the report or method description
Once the assistant has read the user's deep-reading report or method description and extracted the Figure Brief, it should explicitly tell the user that the normal next move is to generate **one batch of multiple style directions** for visual comparison.
That first post-ingest planning turn should:
- summarize the extracted Figure Brief
- name the first decision as a **style-family decision**
- explain that the decision should be made by looking at generated candidate images rather than prose only
- ask whether to generate the first multi-style candidate board now
The assistant should:
1. read the user's paper deep-reading report, method description, or model summary
2. extract a clean **Figure Brief**
3. open a multi-round workflow
4. after each round, summarize the current state and ask for one concrete user decision
5. **separate all text planning from all image generation**
6. use **image candidates as the actual decision surface** whenever the user is choosing between visual schemes
7. update the running state after every user choice and every generation batch
8. progressively narrow style, structure, density, and detail
9. after the user selects a final direction, ask whether to also draft the **figure caption / legend / panel explanation text**
10. end every text-planning reply with a short **Next Steps** block so the user always knows what the next one or two actions will be
11. after every text-planning reply, record the updated session state, including generated deliverables and user selections
12. remind the user that future turns should explicitly ask to continue with this skill based on the current saved state
13. remind the user that if they are unsure what to ask next after images are generated, they can simply type **"接下来做什么"** (or **"what should we do next"**) to receive guided next-step instructions
## Non-negotiable rules
- **Human approval gate required** between major rounds.
- **Do not jump straight to a final image** from the initial paper description.
- **Do not silently change the chosen direction** without telling the user.
- **Do not mix planning text and image generation in the same reply** when the host supports separate image actions. First do the text reply. Then do image generation as a distinct action.
- **Treat this as a hard runtime constraint:** if a reply contains explanation, summary, questions, next-step guidance, or confirmation requests, that reply must be **text-only** and must not trigger image generation.
- **Before every image batch there must be a dedicated confirmation reply** whose sole job is to ask whether the user wants to generate that batch now. The actual image-generation action must happen only after that confirmation, as a separate next action/turn.
- **Do not use SVG as the primary or fallback rendering path** for framework-figure candidate boards or finals.
- **For any round where the user must choose among visual schemes, do not ask them to choose only from prose descriptions. Generate visual candidates first, then ask them to choose by looking at the images.**
- **Before each new generation batch, explicitly ask whether the user wants to generate the next set of candidate images now.**
- **Before each generation batch, explicitly name the rendering path that will be used in the current host: ChatGPT web should use native Create image via the assistant under Extended Thinking or the strongest available thinking-assisted path; IDE/API hosts should use OpenAI ChatGPT Images 2.0 or a newer supported OpenAI image model.**
- **Always track state** using a structure equivalent to `assets/conversation_state.template.json`.
- **On the first turn, recommend but do not require a Markdown deep-reading report created with the paper-deep-reading skill before figure work begins.**
- **After reading the report or method description for the first time, explicitly propose generating a first multi-style candidate board for selection.**
- **After every text-only reply, update the running state with the current figure brief, generated deliverables, pending decisions, and the user's recorded preferences.**
- **At the end of every text-only reply, remind the user that later messages should explicitly ask this skill to continue from the current state.**
- **At the end of every text-only planning reply, explicitly remind the user that if they do not know how to continue after the next image batch, they can simply type `接下来做什么` to get guided prompt suggestions and next-step help.**
- **At the end of every text-only reply, explicitly restate the rendering rule for the current host:** ChatGPT web should use the assistant's native image generation under the strongest available thinking-assisted path (prefer Extended Thinking when available) without asking the user to manually switch to Create image; IDE/API hosts must use OpenAI ChatGPT Images 2.0 or a newer supported OpenAI image model; SVG and other vector-code fallbacks are forbidden.
- **Generate multiple candidates per batch**. Early exploration batches should usually produce 3 to 5 candidates. Later refinement batches should usually produce 2 to 4 candidates.
- **Make prompts concrete**. Avoid vague instructions like “make it look academic” unless followed by precise layout, hierarchy, color, metaphor, and typography requirements.
- **Prefer framework-figure clarity over decorative complexity**.
- **Ask whether the user wants figure text help after the final image direction is chosen**.
- **At the end of every text-only planning reply, explicitly tell the user what the next one or two steps are, whether the next step is a text decision or an image-generation step, and what kinds of feedback they should be ready to give after the images appear.**
## Host-specific image-generation policy
When this skill reaches a generation step, use the host environment's best available **native OpenAI image-generation path**, and keep it separate from the text-planning reply.
The separation rule is absolute: a planning / explanation / confirmation message and an image-generation action must never be bundled into one assistant reply.
### Hard prohibition
- **Do not use SVG as the rendering path for candidate boards or final framework figures.**
- **Do not switch to code-drawn vector output as a substitute for native image generation.**
- **Do not use mermaid, tikz, graphviz, or other vector-code fallbacks for framework-figure rendering rounds.**
- The intended rendering path is **OpenAI Create image / ChatGPT Images**, specifically **ChatGPT Images 2.0 or a newer supported OpenAI image model if the host exposes one**.
### ChatGPT web
- The preferred interaction is: the user stays inside the normal chat and the assistant triggers the host's native image generation as a separate image action.
- **Do not ask the user to manually switch tools or manually click a Create image mode first**; the skill should treat image generation as a native follow-up action after the planning reply.
- Prefer **Extended Thinking** for framework-figure generation when the host exposes it.
- If the host experience exposes **Thinking** or **images with thinking** but not an explicit Extended Thinking label, prefer the strongest available reasoning-assisted image path.
- Treat the image step as its own action after the user says to proceed.
### OpenClaw, Codex, Trae, or other IDE / API-driven hosts
- Use **OpenAI ChatGPT Images 2.0** at minimum for raster image output.
- If the host exposes a newer OpenAI image-generation version than ChatGPT Images 2.0, use the newer supported OpenAI version.
- If the host requires an API key and no OpenAI API key is available, **pause before generation and explicitly tell the user that image generation cannot proceed until they provide or configure an OpenAI API key**.
- When generation is blocked by missing credentials, do not fake progress and do not switch to SVG as a fallback.
- Do **not** replace image generation with SVG, mermaid, tikz, graphviz, or other vector-code fallbacks when the task is a framework-figure rendering step.
### Required interaction split
- Keep the **text planning reply** and the **image-generation action** separate.
- When a round is a **visual decision round**, the normal sequence is:
1. text turn: summarize state, remind the user of the rendering path that will be used in this host, and ask whether to generate the next candidate board
2. image action: use **OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation** to generate the candidate board or multi-image batch
3. text turn: briefly label the shown candidates and ask the user to choose by image number / letter
## Visual-decision-first protocol
This skill uses a **visual-decision-first** workflow.
That means:
- The assistant may explain what varies across options in text.
- But when the user's next decision is fundamentally about figure appearance, layout, style family, internal visual language, or refinement direction, the assistant should not stop at text-only options.
- Instead, the assistant should ask whether to generate a **candidate board** for that decision.
- After generation, the assistant should present a short mapping such as **A / B / C / D** or **1 / 2 / 3 / 4** tied to the generated images and ask the user to choose from the images.
Good visual-decision rounds include:
- style family selection
- structural skeleton selection
- density / audience-bias tradeoff when it changes the figure look
- internal visual language selection
- refinement direction selection
- final shortlist selection
## Mandatory next-step navigation
Every **text-only** reply in the workflow must end with a short navigation block.
The navigation block should be concrete and user-facing, not abstract process language. It should tell the user:
1. what the very next step is
2. whether that next step is **another text decision** or a **separate image-generation action**
3. what the user will need to do right after images are generated
4. what kinds of feedback will be most useful in the next turn
It should also repeat one short **rendering-rule reminder** in every text-only reply so the user sees it every round:
- In **ChatGPT web**, image generation must be a separate native image action under the strongest available thinking-assisted path; prefer **Extended Thinking** when available, and do **not** ask the user to manually switch to Create image.
- In **OpenClaw / Codex / Trae / API hosts**, image generation must use **OpenAI ChatGPT Images 2.0** at minimum, or a newer supported OpenAI image model if available.
- **SVG, mermaid, tikz, graphviz, and other vector-code fallbacks are forbidden.**
### Required structure
Use a compact structure such as:
In addition to the user-facing navigation block, the assistant should also internally update the saved session state after the text reply is composed. That state update should include:
- the latest accepted inputs
- the current Figure Brief
- every generated deliverable so far
- the most recent user choices and rejected options
- the currently pending decision
- the next candidate board that would be generated if the user says yes
- **Next step:** [ask permission to generate the next candidate board / summarize a shortlisted direction / write caption text]
- **After the images appear, please choose by image** and optionally comment on: [layout / density / comparison clarity / icon style / equations / mini-result snapshots / clutter / reviewer-friendliness]
- **Then I will:** [update the chosen direction and prepare the next narrower batch]
- **For the next turn:** please explicitly ask `paper-framework-figure-studio-pro` to continue from the current saved state.
### Session continuity reminder
Because some hosts do not automatically preserve skill-specific working memory in a reliable way, the assistant should remind the user at the end of each text-only planning turn that future messages should explicitly say something like:
- “Please continue with **Paper Framework Figure Studio Pro** from the current saved state.”
- “Use **paper-framework-figure-studio-pro** to continue from the current state and apply this new change request.”
This reminder should be brief, but it should appear consistently so the user knows how to resume the workflow in later turns. The assistant should also record the updated session state after every text-planning reply, including generated deliverables and user selections.
### Examples
Example A:
- **Next step:** If you want, the next action is to generate the style-family candidate board as a separate image batch.
- **After the images appear, please choose by image** (A/B/C/D) and tell me what you like or dislike about layout, modernity, and explanation strength.
- **Then I will:** update the state and prepare the next structural-skeleton board.
Example B:
- **Next step:** The next action is to generate a refinement batch focused only on reducing clutter and strengthening the baseline-vs-ours comparison.
- **After the images appear, please choose by image** and note whether you want fewer labels, cleaner arrows, or stronger mini-result snapshots.
- **Then I will:** lock the winning direction and ask whether you also want caption / legend / panel text.
Do not omit this navigation block. The user should always know the next one or two moves.
## Workflow overview
Follow this sequence unless the user explicitly asks to skip or compress a stage.
### Round 0 — Intake and figure brief construction
Read the user's deep-reading report or model description and construct a **Figure Brief**.
The Figure Brief must capture at least:
- paper or method title
- one-sentence scientific claim
- what the figure must explain
- target figure type: framework overview
- likely venue level and audience familiarity
- mandatory modules to show
- optional modules to compare
- what should remain outside the figure
- preferred page format: A4 portrait, A4 landscape, or unknown
- whether the user values safety, modernity, mechanism explanation, or visual memorability more
Use the template in `assets/figure_brief_template.md`.
### Round 1 — Style-family candidate board
Do **not** ask the user to choose only from a written list of families.
Instead:
1. summarize 3 to 5 candidate families very briefly in text
2. ask whether to generate the **style-family candidate board now**
3. in a separate image action, generate 3 to 5 style-distinct figure candidates for the same paper content
4. after the images are shown, label them in a short text turn and ask the user to choose a primary direction and optionally a backup
Recommended default families:
1. **Academic Conservative** — standard top-tier ML paper overview
2. **Modern Modular Tiles** — magnetic-card / dashboard-like figure blocks
3. **Mechanism + Result Snapshots** — each stage shows both mechanism and local effect
4. **Editorial Flat Illustration** — modern flat/cartoon academic style, friendly but rigorous
5. **Premium Scientific Illustration** — soft-3D / high-polish scientific editorial rendering
### Round 2 — Structural-skeleton candidate board
Within the selected family, do not stop at textual skeleton descriptions.
Instead:
1. propose 2 to 4 structural skeletons very briefly
2. ask whether to generate the **structural-skeleton candidate board now**
3. in a separate image action, generate 2 to 4 candidates where the content stays fixed but the composition changes, such as:
- left-to-right pipeline
- top-down narrative stack
- central model + surrounding callouts
- modular tile grid
- comparison split with baseline vs ours
4. after the images are shown, ask the user to choose from the images
### Round 3 — Density / reviewer-bias candidate board
If density or reviewer bias will materially affect the visual appearance, do not ask the user to decide only from prose.
Instead:
1. explain that the next board will compare, for example:
- technical / formal
- cross-domain / easier to understand
- visually modern but still rigorous
and/or
- low density
- medium density
- high density
2. ask whether to generate the **density-and-bias candidate board now**
3. generate 2 to 4 candidates in a separate image action
4. ask the user to choose from the images
### Round 4 — Internal visual-language candidate board
Narrow the figure's visual language.
Typical selectable elements:
- avatars or no avatars
- mini scatterplots or no mini scatterplots
- per-step result snapshots or mechanism only
- one equation or several small equations
- minimal labels or richer callout labels
- baseline comparison included or deferred
Protocol:
1. summarize what will vary
2. ask whether to generate the **internal-visual-language board now**
3. generate 2 to 4 candidates in a separate image action
4. ask the user to choose from the images
### Round 5 — Exploration batch generation
Prepare a concrete batch with 3 to 5 candidate prompts.
Important:
- keep the paper content fixed
- vary only a few style axes per batch
- state clearly what differs across candidates
- ask whether to generate this batch now
- after the user approves, perform image generation in a separate action
- after the images appear, ask the user to choose from the actual images rather than from abstract prose
Then update state with the generated batch metadata.
### Round 6 — Selection and refinement
After the user chooses a winner or shortlist:
- summarize what won
- summarize what the user disliked
- propose the next refinement axis
- ask whether to generate the narrower refinement batch now
- generate the refinement batch in a separate image action
- after the images appear, ask the user to choose from the actual images
Typical refinement axes:
- stronger hierarchy
- less clutter
- more legible equations
- better baseline-vs-ours comparison
- cleaner client graph
- more journal-like typography
- more modern or less playful icons
- closer to A4 publication balance
### Round 7 — Finalization
Once the user selects a final direction:
- confirm the final figure intent
- ask whether they also want:
- panel labels
- legend text
- figure caption
- figure explanation for the paper body
- bilingual callout wording
## Mandatory text-turn protocol
Every non-image reply should follow this pattern.
### A. Current state
Briefly state:
- current round
- current chosen family and skeleton
- current unresolved decision
### B. Visual decision to be made
If the next decision is visual, say that the next step should be based on **candidate images**, not only verbal descriptions.
### C. Ask permission for the next image batch
Ask a bounded confirmation such as:
- “Do you want me to generate the next style-family candidate board now?”
- “Do you want me to generate the structural-layout candidates now?”
- “Do you want me to generate the next refinement batch now?”
### D. What the next batch will vary
State 2 to 5 controlled axes that will differ across the generated images.
### E. After images are shown
In the next text turn after generation:
- label the shown candidates clearly
- give a one-line difference summary for each candidate
- ask the user to choose by image ID, for example **A**, **B**, **C**, **D**
## Mandatory image-turn protocol
Every image-generation step must be independent from the planning text turn.
- Do not include a long discussion inside the image-generation step.
- The generation action should use a concrete prompt assembled from the current state.
- Early rounds should produce multiple style-diverse candidates.
- Later rounds should produce tightly controlled refinements.
- The batch should be assembled so that the user can make a real choice **from the generated images**.
## Prompt-construction standard
Build prompts from the following layers, in this order.
1. **Figure goal** — what the figure explains scientifically
2. **Paper framing** — title and one-line claim
3. **Required content blocks** — the exact modules that must appear
4. **Narrative order** — the intended reading path
5. **Style family** — one of the chosen families
6. **Structural skeleton** — layout archetype
7. **Visual vocabulary** — icons, nodes, mini plots, avatars, tiles, cards
8. **Typography requirements** — concise labels, sharp text, panel headings
9. **Color semantics** — blue shared, orange personal, green collaboration by default
10. **Comparative emphasis** — consensus-only vs beyond consensus if included
11. **Output constraints** — A4, portrait/landscape, publication-ready, uncluttered, legible
12. **Batch-difference instruction** — what should differ across candidate A/B/C/D and what must stay fixed
Use the detailed templates in `assets/prompt_library.md`.
## Visual-communication standards
Framework figures should satisfy the following principles.
- one dominant message per figure
- strong reading path
- consistent visual metaphor
- stable color semantics across rounds
- enough white space to separate reasoning chunks
- panel labels must reflect conceptual boundaries, not arbitrary boxes
- if mini result snapshots are used, they must illustrate a real conceptual change rather than act as decoration
- decorative flair must never obscure the core method
- image batches should differ along deliberate axes that are visible enough for the user to judge from the images
---
See `references/visual_communication_principles.md`.
## Common failure modes to avoid
- asking the user to choose a visual direction from text only when images are required to judge it
- forgetting to ask whether to generate the next candidate board now
- mixing the explanation turn and the image turn into a single blended reply
- too much tiny text inside the image
- mixing too many styles in one batch
- icons that imply the wrong algorithmic semantics
- confusing “shared” with “global final model” when the paper is personalized
- making every step equally visually heavy
- comparison panel larger than the main mechanism
- decorative 3D effects that damage legibility
- using unrealistic benchmark plots when the figure is supposed to be a framework diagram
- asking the user too many open-ended questions at once
## What to ask after a final figure is chosen
Always ask:
- Do you want me to also write the **figure caption**?
- Do you want **panel-wise explanatory text** for the paper body or appendix?
- Do you want a **short legend / callout wording pass** to improve what appears inside the figure?
## Files in this skill bundle
- `assets/figure_brief_template.md`
- `assets/conversation_state.template.json`
- `assets/prompt_library.md`
- `assets/refinement_controls.md`
- `references/visual_communication_principles.md`
- `references/reviewer_style_taxonomy.md`
- `references/workflow_examples.md`
- `references/README_CN.md`
Use them actively rather than improvising from scratch every round.
FILE:CHANGELOG.md
## 1.4.3
- Added a mandatory per-reply rendering-rule reminder: every text-only reply must restate that ChatGPT web should use native image generation under the strongest available thinking-assisted path (prefer Extended Thinking when available) without asking the user to manually switch to Create image; IDE/API hosts must use OpenAI ChatGPT Images 2.0 or newer; SVG and other vector-code fallbacks are forbidden.
# Changelog
## 1.4.2
- Added a mandatory end-of-reply reminder: if the user is unsure how to continue after image generation, they can simply type **"接下来做什么"** or **"what should we do next"** to receive guided next-step instructions.
- Updated navigation templates, examples, README files, and state tracking so this help reminder appears after every text-only planning reply.
## 1.4.1
- Elevated the **strict turn-separation rule**: a reply that contains planning text must never also generate images in that same reply.
- Added an explicit **generation-intent confirmation turn**: before any image batch, the assistant must send a text-only reply asking whether to generate the next batch now.
- Clarified that if the user says yes, the actual image generation must happen in the **next separate assistant action/turn**, not bundled with the confirmation text.
- Updated state tracking to record whether the session is waiting for generation confirmation and whether the next turn is image-only.
- Updated README, listings, examples, and Chinese guidance to repeat this hard rule in release-facing language.
## 1.4.0
- Added full publish-ready release scaffolding
- Added `LICENSE` with MIT-0 / MIT No Attribution text
- Added `README.md` for marketplace / repository release use
- Added `publish/` assets for listing copy, icon prompt, cover prompt, and release checklist
- Added `examples/` with starter conversation patterns
- Added `templates/` with optional user input bundles
- Updated `SKILL.md` version to 1.4.0
## 1.3.1
- Added first-contact recommendation for paper-deep-reading
- Added readiness check and first multi-style board reminder
- Added state-continuation reminder for later turns
## 1.2.1
- Added host-specific image generation policy for ChatGPT web vs IDE / API hosts
- Added explicit API key reminder for IDE / API hosts
## 1.2.0
- Added mandatory next-step navigation after every text-only reply
## 1.1.1
- Prohibited SVG rendering and required OpenAI native image generation path
## 1.1.0
- Enforced visual-choice-first workflow using generated images before user selection
## 1.0.0
- Initial multi-round framework-figure studio workflow
FILE:README.md
# Paper Framework Figure Studio Pro
**Paper Framework Figure Studio Pro** is a stateful, multi-round scientific-figure skill for turning a paper deep-reading report, method summary, module sketch, or algorithm description into a publication-ready **framework figure workflow**.
It is designed for **top-tier CS paper figures** where users want:
- explicit human confirmation between rounds
- visual candidate boards before style/layout decisions
- text planning and image generation kept separate
- OpenAI native image generation only for figure renders
- recorded state across rounds
- final optional help with caption, legend, and panel explanation text
This package is prepared as a **publish-ready skill bundle** for OpenClaw / ClawHub style runtimes and similar hosts.
## What this skill does
The skill reads the user's paper or method description, builds a **Figure Brief**, then runs a guided studio workflow:
1. Recommend a Markdown deep-reading report as the best upstream input (but do not require it)
2. Confirm the user is ready to begin figure design
3. Extract the Figure Brief
4. Propose the first **multi-style candidate board**
5. Generate multiple image candidates as a **separate image action**
6. Ask the user to choose by looking at the generated images
7. Update state, narrow direction, and continue to the next refinement round
8. Ask at the end whether the user also wants caption / legend / panel explanation support
## Hard rules
- **No SVG rendering path** for candidate boards or final framework figures
- **No mermaid / graphviz / tikz fallback** for figure-rendering rounds
- Text planning and image generation must be **separate steps**
- **A reply may never contain both planning text and image generation.** If the assistant is asking, explaining, summarizing, or requesting confirmation, that reply must be text-only.
- **Before each generation batch, there must be a dedicated text-only confirmation turn** asking whether to generate the next candidate images now. The actual image generation must happen only in the next separate action/turn after the user confirms.
- Visual decisions should be made from **generated images**, not prose-only descriptions
- Every text turn must end with **Next Steps** guidance
- Every text turn must update and preserve session state
## Image-generation policy
### ChatGPT web
Use the host's native **Create image** path as a separate action after the planning reply. Prefer **Extended Thinking** or the strongest available thinking-assisted image path exposed by the host. Do not instruct the user to manually switch tools first.
### OpenClaw / Codex / Trae / IDE / API hosts
Use **OpenAI ChatGPT Images 2.0** at minimum, or a newer supported OpenAI image model if exposed by the host. If no OpenAI API key is available, the skill must pause and ask the user to provide or configure one before generation.
## Package contents
- `SKILL.md` — main skill specification
- `LICENSE` — MIT-0 / MIT No Attribution license text
- `VERSION` — current package version
- `CHANGELOG.md` — release notes
- `assets/` — working templates and navigation / prompt libraries
- `references/` — Chinese guide, workflow notes, reviewer taxonomy, visual communication principles
- `examples/` — suggested opening turns and continuation patterns
- `publish/` — release-page copy, listing text, icon / cover prompts, publishing checklist
- `templates/` — optional user-facing input templates
## Suggested release metadata
- **Slug:** `paper-framework-figure-studio-pro`
- **Name:** `Paper Framework Figure Studio Pro`
- **License:** MIT-0
## Quick start
The best first user message is something like:
> Please use **paper-framework-figure-studio-pro**. I have a deep-reading report in Markdown for my paper draft. Read it, extract the figure brief, and tell me whether we should generate the first multi-style candidate board.
Or, for an early-stage project:
> Please use **paper-framework-figure-studio-pro**. I do not have a full draft yet. I only have a model description and module design notes. Read them, build the figure brief, and tell me whether I am ready to start the first candidate board.
## Recommended upstream companion skill
This skill works best when the user first prepares a paper reading report in Markdown with **paper-deep-reading**. The skill should recommend, but not require, the upstream deep-reading workflow.
## Release note
This package is intentionally focused on **framework figures**. It is not a plot generator, chart generator, or general slide-design skill.
- At the end of every text-only planning reply, the studio should remind the user that if they are unsure how to continue after the next image batch, they can simply type **`接下来做什么`** to receive guided next-step instructions.
Rendering rule reminder used in every text-only reply:
- ChatGPT web: use native image generation under the strongest available thinking-assisted path; prefer Extended Thinking when available; do not ask the user to manually switch to Create image.
- OpenClaw / Codex / Trae / API hosts: use OpenAI ChatGPT Images 2.0 or newer.
- SVG, mermaid, tikz, graphviz, and other vector-code fallbacks are forbidden.
FILE:templates/user_input_bundle.md
# Optional user input bundle
Paste whatever you have. The skill accepts incomplete material.
## 1. Paper or method title
## 2. One-sentence claim
## 3. What the framework figure must explain
## 4. Core modules to show
## 5. Modules or details to hide
## 6. Target audience
- traditional technical reviewer
- cross-domain reviewer
- visually sensitive reviewer
- mixed / unknown
## 7. Desired tone
- safe and formal
- modern and modular
- mechanism-explanatory
- premium editorial
## 8. Inputs available now
- Markdown deep-reading report
- draft introduction / method section
- model summary
- algorithm sketch
- module list
- rough whiteboard notes
## 9. Constraints
- A4 portrait / landscape / unknown
- dense / medium / light information density
- equations yes / no / some
- result snapshots yes / no / maybe
## 10. Anything the figure must not do
FILE:references/clawhub_packaging_notes.md
# ClawHub Packaging Notes
This bundle is prepared for ClawHub / OpenClaw style packaging:
- directory name matches the publish slug: `paper-framework-figure-studio-pro`
- `SKILL.md` uses YAML frontmatter
- the frontmatter `name` is the lowercase publish-safe skill identifier
- the human-readable display name is stored in `metadata.display_name`
- the license is declared as `MIT-0` and the full MIT No Attribution text is included in `LICENSE`
- the package version is recorded in both `SKILL.md` metadata and `VERSION`
- OpenClaw runtime metadata is declared under `metadata.openclaw`
- no conflicting license terms are added inside `SKILL.md`
If a publish UI asks separately for display name, use:
- **Slug**: `paper-framework-figure-studio-pro`
- **Name / Display Name**: `Paper Framework Figure Studio Pro`
## Image-rendering constraint
This skill is intentionally authored for **native image generation** rather than SVG synthesis. In hosts that support separate image actions, the recommended path is a distinct **Create image** step, ideally using **Thinking** or **Extended Thinking / images with thinking** when available for complex framework figures.
Additional host policy for v1.2.1:
- ChatGPT web: keep the user in chat, prefer Extended Thinking or the strongest available thinking-assisted image path, and do not ask the user to manually switch tools before generation.
- IDE / API hosts such as OpenClaw, Codex, and Trae: use OpenAI ChatGPT Images 2.0 at minimum; if credentials are missing, stop and ask the user to configure an OpenAI API key before any generation round.
- Never downgrade framework-figure rendering to SVG.
FILE:references/README_CN.md
# 发布版说明
当前目录已经补齐为一个更完整的发布版 skill 包,额外包含:
- `README.md`:英文发布页说明
- `LICENSE`:MIT-0 正式授权文本
- `CHANGELOG.md`:版本记录
- `examples/`:示例开场与续接写法
- `publish/`:ClawHub / OpenClaw 上架页文案、封面与图标 prompt、发布检查清单
- `templates/`:用户输入模板
---
# Paper Framework Figure Studio Pro 使用说明(中文)
这是一个面向论文**框架图**的多轮科研绘图 skill。
它不是“直接给一条提示词然后开画”,而是:
1. 先读取用户的论文精读报告、方法介绍或模型说明
2. 抽取 Figure Brief
3. 开启多轮对话
4. 每一轮只让用户做一个关键决定
5. **凡是涉及视觉方案选择,先独立生出候选图,再让用户看图选择**
6. 每次生图都单独执行,不和解释文字混在同一回复里
7. 用户选图后更新状态,再进入下一轮细化
8. 最终询问用户是否还要补充图注、caption、图中文字说明
## 这版 skill 特别强调的协议
### 1. 文字步骤和生图步骤严格分开
正确节奏应该是:
- 第一步:文字回复,说明当前状态,并询问“是否现在生成下一轮候选图”
- 第二步:独立执行生图动作(Create image / 对应 API)
- 第三步:文字回复,对已经生成的候选图做编号,然后请用户**根据图来选**
不要把“解释 + 生图 + 让用户选择”揉成一条混合回复。
### 1.5 生图路径必须走 OpenAI Create image / ChatGPT Images 2.0,而不是 SVG
- 框架图候选图和最终图,**不要走 SVG 合成路线**。
- 在 **ChatGPT web** 中,应走独立的 **Create image** 步骤。
- 若宿主支持 **Thinking** 或 **Extended Thinking / images with thinking**,优先用这一路径来生成复杂框架图。
- 在 **Codex / Trae / API 宿主** 中,也应走原生的 ChatGPT 图片生成能力;不要把 mermaid、graphviz、tikz、纯 SVG 输出当成框架图渲染替代方案。
### 2. 让用户选方案时,应优先让用户从图里选
不是:
- 先写一堆方案文字说明
- 然后让用户凭想象选 A / B / C
而应该是:
- 先简要说明下一批图会比较什么
- 询问用户是否现在生成这一批候选图
- 生成多张图
- 再让用户从图里选择 A / B / C / D
### 4. 每次文字回复后,都要告诉用户下面 1–2 步做什么
不要只停在当前轮的说明上。每次文字回复结束时,都应该显式告诉用户:
- 下一步是不是要单独生成下一批图
- 生成完后用户应该从哪些维度来选图
- 用户选完之后,再下一步会进入哪一轮细化
推荐固定加一个小结尾:
- **下一步**:是否现在生成这一轮候选图
- **看图后请你重点反馈**:例如风格、结构、密度、机制解释强度、是否太花、是否太满
- **然后我会**:更新状态并进入下一轮更细的候选图
### 3. 每一轮开始前都可以问一句
推荐问法:
- “要不要我先生成这一轮的风格候选图?”
- “要不要我先生成这一轮的结构候选图?”
- “要不要我先生成下一轮细化图?”
## 推荐的多轮顺序
- 第 0 轮:读取论文内容,整理 Figure Brief
- 第 1 轮:先选大风格家族,但应通过**风格候选图**来选
- 第 2 轮:再选结构骨架,但应通过**结构候选图**来选
- 第 3 轮:再选面向哪类审稿人 + 信息密度,但如果视觉差异明显,也应通过**候选图**来选
- 第 4 轮:再选人物图标、结果示意、公式多少、对比区大小,并通过**内部视觉语言候选图**来选
- 第 5 轮:首批综合方案多图生成
- 第 6 轮:用户选图,归纳喜欢和不喜欢的点
- 第 7 轮:第二批定向细化生成
- 第 8 轮:最终定稿 + 询问是否补 caption / legend / panel explanation
## 这个 skill 特别强调的点
- 每次生图前都要有人类确认
- 每次生成要多张图,不要一开始只出 1 张
- 提示词要具体,不能只写“学术风”“顶刊风”
- 图像生成动作必须和普通文字回复分离
- 凡是视觉方向决策,优先基于**已生成图**来做选择
- 每轮结束后都要更新状态
- 最终一定要问用户是否还需要图注和正文配套说明
### 1.8 宿主环境提醒必须明确说明
- 如果运行在 **OpenClaw / Codex / Trae / 其他 IDE 或 API 宿主**,框架图生图阶段必须走 **OpenAI ChatGPT Images 2.0**(若宿主提供更高版本,则用更高版本)。
- 如果该宿主没有可用的 **OpenAI API key**,必须先提醒用户提供或配置 key,再进入生图步骤。
- 如果运行在 **ChatGPT 网页版**,应要求在 **Extended Thinking** 或宿主当前可用的最强 thinking-assisted 路径下进行,并且**不要让用户手动切换到 Create image 工具模式**;而是由助手在独立的生图步骤中触发原生图片生成。
- 无论在哪个宿主里,**都不能因为缺少图片能力而改用 SVG**。
## 首次进入 skill 时的提醒
第一次使用这个 skill 时,应该先提醒用户:
- **最佳做法**:先用 `paper-deep-reading` skill 对论文初稿或相关论文生成一份**精读报告**,并保存为 **Markdown**
- 推荐链接:`https://clawhub.ai/c-narcissus/paper-deep-reading`
- 但这**不是必须的**
- 如果用户还没有准备好初稿,也可以先输入:模型模块描述、算法思路、设计思想、算法流程、训练机制、系统结构草图等
然后要继续确认两件事:
1. 用户当前输入是否已经足够开始框架图设计
2. 用户是否准备好**现在开始画图**
一旦读完精读报告或方法描述,第一次正式设计轮应该明确告诉用户:
- 下一步通常是先生成**一批不同风格的候选图**供选择
- 这个选择应该基于图,而不是只基于文字描述
## 状态记录与后续续接
每次**文字回复**后,除了给用户当前结论,还应更新一份当前状态,至少记录:
- 当前 Figure Brief
- 已经生成了哪些候选图 / 交付物
- 用户已经选中了什么、排除了什么
- 当前正在等待哪一个决定
- 下一步若用户同意,会生成哪一批图
同时还要提醒用户:为了确保后续在 OpenClaw、Codex、Trae 或其他宿主里能稳定续接,后续每次提问时,最好显式写上类似:
- `请使用 paper-framework-figure-studio-pro 根据当前状态继续执行,并处理下面的新要求:...`
- `Use paper-framework-figure-studio-pro to continue from the current saved state and apply the following change: ...`
这样能减少宿主环境丢失上下文或没有正确续接状态的风险。
## 关键硬约束:文字回复和生图不能同轮出现
- 只要这一轮回复里包含解释、总结、追问、确认、下一步导航,就必须是**纯文字回复**。
- 如果下一步要生图,这一轮只能先说明将要生成什么,并询问用户是否现在开始生成。
- 用户确认后,**下一轮 / 下一独立动作**才允许真正调用 OpenAI ChatGPT Images 2.0 / Create image。
- 绝不能在“说明 + 生图”同一次回答里同时发生。
- 每次文字回复结尾都要提醒用户:如果生成图后不知道下一步该怎么提问,可以直接输入 **`接下来做什么`**,skill 会给出下一步建议和推荐提问方式。
Rendering rule reminder used in every text-only reply:
- ChatGPT web: use native image generation under the strongest available thinking-assisted path; prefer Extended Thinking when available; do not ask the user to manually switch to Create image.
- OpenClaw / Codex / Trae / API hosts: use OpenAI ChatGPT Images 2.0 or newer.
- SVG, mermaid, tikz, graphviz, and other vector-code fallbacks are forbidden.
FILE:references/reviewer_style_taxonomy.md
# Reviewer Style Taxonomy
Use this to tailor the figure direction to the likely reviewer.
## Technical-formal reviewer
Prefers:
- cleaner architecture diagrams
- fewer decorative icons
- stronger notation discipline
- moderate or high information density
Good families:
- Academic Conservative
- Modern Modular Tiles (formal version)
- Mechanism + Result Snapshots (formal version)
## Cross-domain reviewer
Prefers:
- clearer storytelling
- stronger intuition cues
- fewer symbols
- more obvious problem-to-method narrative
Good families:
- Modern Modular Tiles
- Editorial Flat Illustration
- Mechanism + Result Snapshots
## Design-sensitive reviewer
Prefers:
- polished hierarchy
- modern composition
- memorable visual system
- elegance without loss of rigor
Good families:
- Modern Modular Tiles
- Editorial Flat Illustration
- Premium Scientific Illustration
FILE:references/visual_communication_principles.md
# Visual Communication Principles for Top-Tier Research Framework Figures
## 1. One dominant message
A strong framework figure should communicate one dominant thesis. For this paper family, the thesis is typically:
> collaborate on what is shareable, preserve what is personal.
Everything in the figure should support that sentence.
## 2. Reading path matters more than decoration
Top-tier figures usually succeed because the reading path is obvious:
- either left-to-right
- or top-down
- or center-with-radial-callouts
A viewer should know where to look first within one second.
## 3. Hierarchy must be visible, not only logical
Important modules should be visually larger, cleaner, or more central. Not every panel deserves equal weight.
## 4. Stable color semantics
If blue means shared structure once, it should mean shared structure everywhere. Do not reuse colors for conflicting concepts.
## 5. Abstract methods benefit from concrete micro-evidence
If the method is conceptually abstract, small micro-plots or state snapshots can significantly improve comprehension.
## 6. Use equations sparingly
Inside the image, one or two compact equations help anchor rigor. Too many equations reduce figure readability.
## 7. Comparison panels should clarify the paper's delta
A baseline-vs-ours panel should answer: what is qualitatively different here?
## 8. White space is semantic structure
White space is not empty. It separates ideas, controls reading flow, and prevents false grouping.
## 9. Avoid decorative ambiguity
Do not use fancy icons or 3D elements unless they clarify rather than distract.
## 10. The figure should survive both quick scan and close read
Good paper figures work at two speeds:
- 10-second scan
- 60-second careful reading
FILE:references/workflow_examples.md
# Workflow Examples
## Example 0 — First contact and readiness check
Round 0 text:
- remind the user that the preferred upstream input is a Markdown deep-reading report
- recommend the paper-deep-reading skill at `https://clawhub.ai/c-narcissus/paper-deep-reading`
- make clear that this is recommended but not required
- accept partial inputs such as a method sketch, module description, algorithm notes, or early-stage design ideas
- ask whether the user is ready to start figure design now
- explain that once the report or description is read, the first image round will normally be a **multi-style candidate board**
- end with a Next Steps block and a resume reminder telling the user to later ask this skill to continue from the current state
Round 1 text after ingesting the report or method description:
- summarize the extracted Figure Brief
- say that the first decision should be made by **looking at images** rather than prose only
- ask whether to generate the first multi-style candidate board now
- end with a Next Steps block and a resume reminder
## Example 1 — Safe conference-paper path
Round 0: ingest report and create Figure Brief
Round 1 text: summarize style-family options and ask whether to generate the style-family candidate board now
Round 1 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation (prefer Extended Thinking or the strongest available thinking-assisted path when available) to generate 4 style-family candidates
Round 1 follow-up text: label A/B/C/D and ask the user to choose by image
Round 2 text: summarize the chosen family and ask whether to generate the structural-skeleton board now
Round 2 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 structure candidates
Round 2 follow-up text: ask the user to choose by image
Round 3 text: ask whether to generate the density/bias board now
Round 3 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 density variants
Round 3 follow-up text: ask the user to choose by image
Round 4 text: ask whether to generate the internal-visual-language board now
Round 4 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 visual-language candidates
Round 4 follow-up text: ask the user to choose by image
Round 5 image action: generate 4 integrated exploration candidates
Round 6: user selects candidate B and requests less clutter
Round 6 text: summarize keep/change list and ask whether to generate the next refinement batch now
Round 6 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 refinements
Round 7: user selects final and asks for caption
## Example 2 — Mechanism-explanation path
Round 0: ingest abstract method description
Round 1 text: explain that style choice should be image-based and ask whether to generate the first candidate board now
Round 1 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation (prefer Extended Thinking or the strongest available thinking-assisted path when available) to generate 3 style candidates with different mechanism-explanation intensities
Round 1 follow-up text: ask the user to choose by image
Round 2 text: ask whether to generate top-down vs left-to-right structure candidates now
Round 2 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 2 structure candidates
Round 2 follow-up text: ask the user to choose by image
Round 3 text: ask whether to generate medium vs medium-high density candidates now
Round 3 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 2 density candidates
Round 3 follow-up text: ask the user to choose by image
Round 4 text: ask whether to generate mini-scatterplot and result-snapshot variants now
Round 4 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 variants
Round 4 follow-up text: ask the user to choose by image
Round 5: integrated candidate batch
Round 6: refinement
Round 7: caption + panel explanation
## Example 3 — Visually memorable flagship path
Round 0: ingest full deep-reading report
Round 1 text: summarize Premium Scientific Illustration and Editorial Flat Illustration, then ask whether to generate the style board now
Round 1 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation (prefer Extended Thinking or the strongest available thinking-assisted path when available) to generate 4 flagship-style candidates
Round 1 follow-up text: ask the user to choose by image and optionally nominate a backup
Round 2 text: ask whether to generate center-with-callouts vs modular-card structures now
Round 2 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 2 structural candidates
Round 2 follow-up text: ask the user to choose by image
Round 3 text: ask whether to generate medium-density / formal-modern variants now
Round 3 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 2 variants
Round 3 follow-up text: ask the user to choose by image
Round 4 text: ask whether to generate icon-vocabulary candidates now
Round 4 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation to generate 3 icon-language candidates
Round 4 follow-up text: ask the user to choose by image
Round 5: exploration batch
Round 6: shortlist two directions
Round 6 text: ask whether to generate the final micro-polish batch now
Round 6 image action: use OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation for the micro-polish batch
Round 7: final figure + short legend writing pass
## Example 4 — Required navigation footer pattern
Every text-only planning turn should end with a navigation footer such as:
- **Next step:** The next action is to generate the structural-skeleton candidate board as a separate image batch.
- **After the images appear, please choose by image** and comment on composition, clutter, and comparison clarity.
- **Then I will:** update the state and prepare the density / reviewer-bias board.
## Host reminder snippet to include before any generation round
- **If you are using ChatGPT web:** I will keep the next step as a separate native image-generation action under Extended Thinking or the strongest available thinking-assisted path, and you do not need to manually switch to Create image first.
- **If you are using OpenClaw, Codex, Trae, or another IDE/API host:** the next image round must use OpenAI ChatGPT Images 2.0 or a newer supported OpenAI image model. If your host does not have an OpenAI API key configured yet, please add one before we generate the next candidate board.
## Example 5 — Required end-of-text reply footer
Every planning reply should end with something like:
- **Next step:** If you are ready, the next action is to generate the next candidate board as a separate image batch.
- **After the images appear, please choose by image** and tell me what to keep or change about layout, density, mechanism clarity, reviewer-friendliness, and comparison strength.
- **Then I will:** update the saved state, record your choice, and prepare the next narrower batch.
- **For the next turn:** please explicitly ask `paper-framework-figure-studio-pro` to continue from the current saved state when you give your next instruction.
## 强制交互节奏(新增)
1. 文字总结当前状态
2. 文字说明下一批候选图会比较什么
3. 文字询问:是否现在生成这一批图
4. 用户确认后,在下一独立生图动作中再调用 OpenAI ChatGPT Images 2.0 / Create image
5. 生图完成后,再进入下一轮文字评价与选择
FILE:publish/cover_and_icon_prompts.md
# Cover and icon prompt ideas
## Marketplace cover prompt
Create a polished cover image for a scientific figure-design skill called “Paper Framework Figure Studio Pro”. Show a premium academic design studio moodboard for paper framework diagrams: modular panels, tiny network nodes, mini decision-boundary sketches, blue/orange/green palette, white background, elegant typography, top-tier ML paper aesthetic, clean editorial polish, no clutter, no watermark.
## Marketplace icon prompt
Create a minimal square icon for a scientific figure-design skill. Use a white background, a stylized paper panel frame, one blue shared-structure node cluster, one orange personal-structure node cluster, and a green link arrow between them. Premium academic software icon style, flat but refined, legible at small sizes.
FILE:publish/listing_long.md
# Paper Framework Figure Studio Pro
A professional scientific-figure skill for creating **paper framework diagrams** through a guided, stateful workflow.
## Best for
- method overview figures
- paper framework diagrams
- top-conference / top-journal style exploration
- iterative image-based refinement
- users who want multiple style options before deciding
## Why it is different
Unlike one-shot prompt generators, this skill behaves like a **figure studio**:
- reads a deep-reading report or method description
- constructs a Figure Brief
- generates **multiple candidate images** before style decisions
- records user choices and rejected directions
- keeps text planning separate from image generation
- asks for user confirmation before every new generation batch
- can finish by drafting caption, legend, and panel explanation text
## Rendering rule
Framework figures must be rendered through **OpenAI native image generation**. SVG / mermaid / graphviz / tikz are not used as fallbacks for figure-rendering rounds.
## Strict interaction rule
This skill enforces a hard split between **text turns** and **image turns**. It will first explain the next batch and ask whether to generate it. Only after the user confirms will it perform the next separate image-generation action.
FILE:publish/listing_short.md
Turn a paper deep-reading report or method description into a publication-ready framework figure through a stateful, multi-round, human-in-the-loop studio workflow. Built for top-tier CS paper figures with separate planning and image-generation turns, visual candidate boards, and OpenAI native image rendering only.
FILE:publish/release_checklist.md
# Release checklist
- [x] Confirm slug is `paper-framework-figure-studio-pro`
- [x] Confirm display name is `Paper Framework Figure Studio Pro`
- [x] Confirm license is MIT-0
- [x] Confirm `SKILL.md` version matches `VERSION`
- [x] Confirm `README.md` and `references/README_CN.md` are included
- [x] Confirm `LICENSE` is present
- [x] Confirm examples and publish copy are included
- [x] Confirm prompts say OpenAI native image generation only
- [x] Confirm SVG is prohibited for rendering rounds
- [x] Confirm host-specific API key reminder is present
FILE:publish/starter_messages.md
# Suggested starter messages for users
## Start from a deep-reading report
Please use `paper-framework-figure-studio-pro`. I have a Markdown deep-reading report for my paper. Read it, extract the Figure Brief, and tell me whether to generate the first multi-style candidate board.
## Start from method notes
Please use `paper-framework-figure-studio-pro`. I only have a method summary and module description. Tell me whether this is enough to start figure design and, if so, propose the first candidate board.
## Continue from saved state
Please continue with `paper-framework-figure-studio-pro` from the current saved state and apply this new request: [your change].
## Ask for a text-only confirmation turn first
Please use `paper-framework-figure-studio-pro`. Do not generate images in the same reply as your explanation. First give me a text-only summary of the next candidate batch and ask whether I want to generate it now.
## Ask for next-step help after images
If I am unsure what to ask after the images are generated, I will simply reply: `接下来做什么`. Please then tell me the recommended next step and the best way to phrase my next request.
Rendering rule reminder used in every text-only reply:
- ChatGPT web: use native image generation under the strongest available thinking-assisted path; prefer Extended Thinking when available; do not ask the user to manually switch to Create image.
- OpenClaw / Codex / Trae / API hosts: use OpenAI ChatGPT Images 2.0 or newer.
- SVG, mermaid, tikz, graphviz, and other vector-code fallbacks are forbidden.
FILE:examples/example_opening_turns.md
# Example opening turns
## Example 1 — User already has a deep-reading report
**User**
Please use `paper-framework-figure-studio-pro`. I already have a Markdown deep-reading report for my paper draft. Read it, build the Figure Brief, and tell me whether to generate the first multi-style candidate board.
**Expected studio behavior**
- briefly confirm that a Markdown deep-reading report is the preferred upstream input
- say that this is enough to begin
- extract a Figure Brief
- summarize the likely first style-family decision
- ask whether to generate the first multi-style candidate board now
- end with a short Next Steps block
## Example 2 — User has only a model idea
**User**
Please use `paper-framework-figure-studio-pro`. I do not have a full draft. I only have a method summary, module list, and algorithm steps. Tell me whether that is enough to start.
**Expected studio behavior**
- recommend, but do not require, a Markdown deep-reading report as best practice
- explicitly say that module notes and algorithm steps are acceptable
- ask whether the user is ready to begin figure design now
- if yes, build the Figure Brief and propose the first candidate board
## Example 3 — Continue from saved state
**User**
Please continue with `paper-framework-figure-studio-pro` from the current saved state. Keep the winning layout, reduce clutter, and strengthen the baseline-vs-ours comparison.
**Expected studio behavior**
- recall the last accepted direction
- identify the current pending decision
- say what the next generation batch will vary
- ask whether to generate the refinement batch now
## Example 4 — Strict separation before a generation batch
**User**
Please continue with `paper-framework-figure-studio-pro` from the current state and prepare the next refinement batch.
**Expected studio behavior**
- summarize what will vary in the next batch
- keep the reply text-only
- explicitly ask whether to generate the next batch now
- wait for the user to confirm
- generate images only in the next separate image action/turn
## Example 5 — User asks for next-step help after images
**User**
接下来做什么
**Expected studio behavior**
- briefly restate the current saved state
- tell the user what the next recommended decision is
- give 2 to 5 concrete feedback axes they can comment on
- if relevant, offer a ready-to-send next prompt in one line
Rendering rule reminder used in every text-only reply:
- ChatGPT web: use native image generation under the strongest available thinking-assisted path; prefer Extended Thinking when available; do not ask the user to manually switch to Create image.
- OpenClaw / Codex / Trae / API hosts: use OpenAI ChatGPT Images 2.0 or newer.
- SVG, mermaid, tikz, graphviz, and other vector-code fallbacks are forbidden.
FILE:assets/conversation_state.template.json
{
"project": {
"paper_title": "",
"working_figure_title": "",
"source_type": "deep-reading-report",
"page_format": "A4 portrait",
"language": "en",
"preferred_upstream_artifact": "markdown_deep_reading_report",
"recommended_upstream_skill": "paper-deep-reading",
"recommended_upstream_skill_url": "https://clawhub.ai/c-narcissus/paper-deep-reading",
"upstream_recommendation_already_shown": false,
"user_ready_to_start_confirmed": false,
"first_post_ingest_style_board_prompted": false
},
"figure_brief": {
"core_claim": "",
"must_explain": [],
"must_not_include": [],
"audience_bias": "mixed",
"density": "medium"
},
"selection": {
"style_family": null,
"backup_style_family": null,
"structural_skeleton": null,
"visual_vocabulary": [],
"comparison_mode": null,
"equation_level": null,
"mini_result_snapshots": null
},
"round": {
"index": 0,
"name": "intake",
"pending_user_confirmation": true,
"visual_decision_required": false,
"awaiting_permission_to_generate": false,
"awaiting_image_based_choice": false
},
"history": [
{
"round": 0,
"action": "state_created",
"summary": ""
},
{
"round": 0,
"action": "first_contact_reminder_pending",
"summary": "Recommend paper-deep-reading markdown report if available; confirm readiness; preview first multi-style candidate board."
}
],
"candidate_boards": [
{
"board_id": "",
"round": 0,
"decision_type": "",
"status": "planned",
"permission_requested": false,
"permission_granted": false,
"image_batch_generated": false,
"candidate_ids": [],
"difference_axes": [],
"user_choice": null,
"notes": ""
}
],
"generation_batches": [],
"winning_direction": {
"batch_id": null,
"candidate_id": null,
"reason_selected": "",
"must_keep": [],
"must_change": []
},
"finalization": {
"caption_requested": false,
"legend_requested": false,
"panel_text_requested": false,
"status": "not-final"
},
"rendering_path": {
"preferred": "create_image",
"avoid": [
"svg",
"mermaid",
"tikz",
"graphviz"
],
"reasoning_mode_preference": "thinking_or_extended_thinking_when_available"
},
"navigation": {
"must_include_next_steps_in_text_reply": true,
"must_include_help_shortcut_reminder": true,
"help_shortcut_phrase": "接下来做什么",
"last_announced_next_step": "",
"last_announced_after_images_feedback_axes": [],
"last_announced_then_i_will": "",
"last_announced_help_shortcut_reminder": "",
"must_include_resume_instruction": true,
"last_announced_resume_instruction": "",
"must_include_rendering_rule_reminder_every_text_reply": true,
"last_announced_rendering_rule_reminder": ""
},
"rendering_policy": {
"must_use_openai_create_image_or_chatgpt_images": true,
"minimum_api_family_for_ide_hosts": "ChatGPT Images 2.0",
"prefer_extended_thinking_in_chatgpt_web": true,
"do_not_ask_user_to_manually_switch_to_create_image_in_chatgpt_web": true,
"block_if_openai_api_key_missing_in_ide_or_api_host": true,
"forbidden_fallbacks": [
"svg",
"mermaid",
"tikz",
"graphviz"
],
"must_repeat_rendering_rule_every_text_reply": true,
"chatgpt_web_rule_summary": "Use native image generation under the strongest available thinking-assisted path; prefer Extended Thinking when available; do not ask the user to manually switch to Create image.",
"ide_host_rule_summary": "Use OpenAI ChatGPT Images 2.0 or newer supported OpenAI image generation.",
"forbidden_summary": "Do not use SVG, mermaid, tikz, graphviz, or other vector-code fallbacks."
},
"deliverables": {
"text_replies_logged": [],
"image_batches_generated": [],
"caption_or_legend_outputs": []
},
"continuity": {
"must_remind_user_to_resume_with_skill_name": true,
"suggested_resume_phrase": "Please continue with paper-framework-figure-studio-pro from the current saved state.",
"resume_reminder_last_shown": false
},
"interaction_contract": {
"text_and_image_same_reply_forbidden": true,
"must_use_dedicated_generation_confirmation_turn": true,
"next_turn_mode": "text_only"
},
"generation_control": {
"awaiting_generation_confirmation": false,
"confirmed_generation_batch_id": null,
"next_turn_must_be_image_only": false,
"last_confirmation_prompt": null
}
}
FILE:assets/figure_brief_template.md
# Figure Brief Template
Use this template immediately after reading the user's paper deep-reading report or model introduction.
## 1. Project identity
- Working figure title:
- Paper or method title:
- One-sentence claim:
- Source type: deep-reading report / method intro / mixed
## 2. What this figure must explain
- Core scientific message:
- What the reviewer should understand after only 10 seconds:
- What the reviewer should understand after 60 seconds:
## 3. Mandatory content blocks
List the modules that must appear in the framework figure.
- Problem setting:
- Inputs / data setting:
- Main architectural branches:
- Key learning mechanism:
- Aggregation or coordination mechanism:
- Baseline comparison needed?: yes / no
- Outcomes or benefits box needed?: yes / no
## 4. Boundaries
- What should stay outside the figure:
- What can move to caption instead of inside the image:
- What is too detailed for the main framework figure:
## 5. Audience and venue assumptions
- Venue style target:
- Reviewer familiarity: technical / mixed / broader
- Safety vs novelty preference:
- Preferred figure density: low / medium / high
## 6. Visual requirements
- Page format: A4 portrait / A4 landscape / undecided
- Main color semantics:
- Must-keep symbols or equations:
- Mini plots allowed?: yes / no
- Avatars allowed?: yes / no
- Style references provided?: yes / no
## 7. Initial recommendation
- Recommended style family:
- Recommended backup family:
- Recommended structural skeleton:
- Biggest risk to avoid:
FILE:assets/next_step_navigation.md
# Next-Step Navigation Patterns
This file defines how every text-only reply should guide the user into the next one or two steps.
## Principle
A good framework-figure studio conversation never leaves the user guessing. After each text reply, the assistant should explicitly say:
- what will happen next
- whether the next move is a separate image-generation step
- what the user should evaluate in the generated candidates
- what the assistant will do after the user chooses
## Default footer template
Use a concise block like this at the end of every text-planning reply:
**Next step**
- The next action is: [describe the separate image-generation step or text decision].
- After the images appear, please choose by image and comment on: [2 to 5 concrete evaluation axes].
- Then I will: [describe the next narrower step].
- If you are not sure how to continue after the images appear, you can simply type **"接下来做什么"** and I will tell you the recommended next prompt or decision.
- **Rendering rule reminder:** In ChatGPT web, the next image step must use the assistant's native image generation under the strongest available thinking-assisted path (prefer Extended Thinking when available) and you should **not** manually switch to Create image; in OpenClaw / Codex / Trae / API hosts, the next image step must use **OpenAI ChatGPT Images 2.0** or newer; **SVG and other vector-code fallbacks are forbidden**.
## Good evaluation axes after images appear
Depending on the round, ask the user to comment on some of the following:
- overall style family fit
- layout / composition
- visual hierarchy
- density / clutter
- reviewer-friendliness
- mechanism clarity
- baseline-vs-ours comparison strength
- icon vocabulary / avatar usage
- mini scatterplots or result snapshots
- equation visibility and readability
- journal-like vs conference-like tone
## Round-specific footer suggestions
### After intake / Figure Brief
- The next action is to generate a style-family candidate board as a separate image batch.
- After the images appear, please choose by image and comment on whether you prefer more conservative, modular, mechanism-explanation, flat-illustration, or premium-polish directions.
- Then I will narrow the winning style into structural skeleton candidates.
### After style-family selection
- The next action is to generate structural-skeleton candidates as a separate image batch.
- After the images appear, please choose by image and comment on whether you prefer a left-to-right pipeline, top-down narrative, central-core-with-callouts, or tile-grid composition.
- Then I will prepare a density / reviewer-bias comparison if needed.
### After structural selection
- The next action is to generate density / reviewer-bias variants as a separate image batch.
- After the images appear, please choose by image and comment on technicality, readability, and whether the figure should target expert reviewers or broader readers.
- Then I will refine the internal visual language.
### After density / bias selection
- The next action is to generate internal visual-language variants as a separate image batch.
- After the images appear, please choose by image and comment on avatars, mini-plots, result snapshots, equation count, and comparison-panel strength.
- Then I will prepare the first integrated exploration batch.
### After exploration batch selection
- The next action is to generate a narrower refinement batch as a separate image batch.
- After the images appear, please choose by image and comment on what must stay fixed and what still needs improvement.
- Then I will lock the final direction and ask whether you want figure text support.
### After final-direction lock
- The next action is a text-only step: I can draft the caption, legend, and panel explanation text.
- Please tell me whether you want caption only, legend only, panel callouts, or all three.
- Then I will produce the publication-facing figure text package.
## Resume reminder line
At the end of every text-only planning reply, add one short continuity reminder such as:
- **For the next turn:** please explicitly ask `paper-framework-figure-studio-pro` to continue from the current saved state when you send the next change request.
## First-contact navigation pattern
Use a first-turn ending such as:
- **Next step:** If you are ready, send your Markdown deep-reading report or your current method description, and I will extract the Figure Brief.
- **After I read it:** the first image round will usually be a multi-style candidate board for you to choose from.
- **Then I will:** record the initial state and prepare the first visual decision round.
- **For the next turn:** please explicitly ask `paper-framework-figure-studio-pro` to continue from the current saved state.
## Mandatory help reminder
Every text-only planning reply should end with a short help reminder such as:
- **If you are not sure what to ask after the images are generated, just type `接下来做什么`, and I will guide you to the next step.**
- **If you are unsure how to continue, simply reply `接下来做什么` and I will suggest the next prompt.**
## Mandatory per-reply rendering reminder
Every text-only reply must visibly repeat a short rendering reminder. Do not assume the user still remembers it from earlier turns.
Recommended one-line reminder:
- **Rendering rule reminder:** ChatGPT web should use native image generation under the strongest available thinking-assisted path (prefer Extended Thinking when available) without asking you to manually switch to Create image; OpenClaw / Codex / Trae / API hosts must use OpenAI ChatGPT Images 2.0 or newer; SVG fallbacks are forbidden.
FILE:assets/prompt_library.md
# Rendering-path reminder
Before any generation batch, add a one-line reminder that framework figures must be rendered with OpenAI Create image / ChatGPT Images 2.0 or newer supported OpenAI image generation, never SVG. In ChatGPT web, keep the user in chat and do not ask them to manually switch to Create image first; in IDE/API hosts, ask for an OpenAI API key if one is missing.
# Prompt Library for Framework Figures
Use these prompts as **building blocks**. Always customize them with the current conversation state.
---
## Prompt Assembly Header
Start prompts with a concrete scientific framing block like this:
> Create a publication-quality A4 [portrait/landscape] framework figure for a top computer science paper. The paper idea is: "[TITLE]". The figure must let a reviewer understand the full method at a glance. The figure should be self-contained, highly legible, publication-ready, and suitable for a top-tier conference or journal.
Then add:
- the core claim
- the required panel sequence
- the selected style family
- the visual vocabulary
- the typography requirements
- the comparison requirements
- the output constraints
---
## Candidate-board rule
When the user is choosing between schemes, prompts must explicitly create **multiple visually distinct candidates** for the same scientific content.
A good board prompt must state:
- what stays fixed across all candidates
- what specific axes vary across A / B / C / D
- that the result should be suitable for user selection by looking at the images
Useful batch instruction pattern:
> Generate [N] candidate framework figures for the same paper. Keep the scientific content fixed. Vary only the following visible axes: [STYLE / LAYOUT / DENSITY / VISUAL LANGUAGE / SNAPSHOT INTENSITY]. Make each candidate clearly distinguishable so the user can choose from the images.
---
## Family 1 — Academic Conservative
> Create a publication-quality A4 [portrait/landscape] scientific framework diagram in a classic top-tier machine learning paper style. Use crisp vector graphics, a white background, light panel borders, restrained color coding, clean arrows, and concise labels. Use blue for shared structure, orange for personal structure, and green for decentralized collaboration. Build a clear [left-to-right / top-down] reading path. Include these panels: (1) problem setting, (2) dual-channel pseudo-labeling, (3) gating, (4) shared-personal bi-timescale learning, (5) selective aggregation, (6) consensus-only vs beyond-consensus comparison, (7) compact outcomes. Add at most one or two equations and keep all text sharp and readable.
Best when:
- the user wants safety and formal clarity
- the target audience is technically experienced
---
## Family 2 — Modern Modular Tiles
> Create a publication-quality A4 [portrait/landscape] framework figure in a modern modular magnetic-tile style. Use elegant rounded rectangles, a premium editorial dashboard feel, consistent card spacing, subtle depth, clean typography, capsule labels, and a strong information hierarchy. Organize the figure as modular tiles that a reviewer can scan quickly: title tile, problem-setting tile, shared pseudo-label tile, personal pseudo-label tile, gating tile, shared-branch tile, personal-branch tile, selective-aggregation tile, comparison tile, and outcomes tile. Keep the visual system modern and neat, like a 2025 ML paper figure.
Best when:
- the user wants modernity without becoming too illustrative
- the reviewer should be able to scan quickly
---
## Family 3 — Mechanism + Result Snapshots
> Create a publication-quality A4 [portrait/landscape] framework figure in a mechanism-plus-result-snapshot style. Each stage should show both the mechanism and a tiny local effect or state after that stage. Use mini scatterplots, micro heatmaps, tiny decision-boundary sketches, or compact before/after inserts only where they clarify the method. The figure should clearly show: the decentralized problem setup, shared pseudo-label generation, personal pseudo-label generation, sample-wise gating, bi-timescale learning, and selective aggregation. Include concise ribbons such as "result after this step" where helpful. Keep the style elegant, not busy.
Best when:
- the user wants mechanism explainability
- the method is abstract and benefits from tiny state-change examples
---
## Family 4 — Editorial Flat Illustration
> Create a publication-quality A4 [portrait/landscape] framework figure in a modern editorial flat-illustration style common in recent high-quality ML papers. Use rounded shapes, refined icons, tasteful color blocks, crisp typography, and slight personality without becoming childish. If appropriate, use small client avatars or stylized client icons. Keep all scientific semantics precise. The figure should visually communicate that the method shares what is common while preserving what is personal. Use clear mini plots for local decision boundaries and a polished comparison between consensus-only and beyond-consensus behavior.
Best when:
- the user wants a memorable main figure
- the paper also benefits from project-page or talk reuse
---
## Family 5 — Premium Scientific Illustration
> Create a publication-quality A4 [portrait/landscape] framework figure in a high-end scientific illustration style with subtle gradients, soft dimensionality, refined capsules and nodes, thin precise arrows, and sophisticated editorial polish. It should feel premium and memorable while remaining journal-appropriate. Emphasize the shared vs personal structure split, the gating mechanism, the bi-timescale learning, and selective aggregation. Use beautifully rendered network nodes, tidy mini plots, and elegant callout bubbles.
Best when:
- the user wants a flagship visual
- the figure may also be used for posters, slides, or project pages
---
## Required content block for this paper direction
For the current paper idea, the prompt should usually specify these blocks explicitly:
1. Problem setting
- decentralized client graph
- no central server
- few labeled and many unlabeled samples
- heterogeneous client-specific structures
- note that not all differences are noise
2. Dual-channel pseudo-labeling
- shared pseudo-label from neighbors
- personal pseudo-label from local model
3. Sample-wise gate
- show \( g_i(x) \in [0,1] \)
- include the fusion equation
4. Shared-personal bi-timescale learning
- shared branch \(\theta_i^{sh}\)
- personal branch \(\theta_i^{per}\)
- fast collaborative update vs slow local adaptation
5. Selective aggregation
- only shared branch aggregated
- weight logic includes quality, similarity, personalization risk
6. Beyond-consensus comparison
- consensus-only overwrites personal structure
- our method preserves client-specific structure while still collaborating
7. Outcomes
- better average accuracy
- stronger worst-client performance
- higher personalization retention
---
## Standard equation snippets
Use only a few equations inside the figure. Good candidates:
- \( \hat{y}_i(x) = g_i(x)\hat{y}_i^{sh}(x) + (1-g_i(x))\hat{y}_i^{per}(x) \)
- \( heta_i^{sh,t+1} = \sum_j w_{ij}^{sh,t}\, ilde{ heta}_j^{sh,t} \)
- weight hint: quality + similarity - personalization risk
---
## Style-family board prompt pattern
> Create [N] distinct candidate framework figures for the same paper idea. Keep the scientific modules fixed. Candidate A should emphasize [AXIS 1]. Candidate B should emphasize [AXIS 2]. Candidate C should emphasize [AXIS 3]. Candidate D should emphasize [AXIS 4]. Make the candidates visually different enough for image-based selection. Do not change the scientific claim.
Example varying axes:
- classic conservative vs modular tiles vs mechanism snapshots vs flat illustration
- more formal vs more modern vs more explanation-heavy vs more visually memorable
---
## Structural-skeleton board prompt pattern
> Create [N] candidate framework figures for the same paper and same style family, but vary only the composition skeleton. Candidate A: left-to-right pipeline. Candidate B: top-down narrative stack. Candidate C: central mechanism with surrounding callouts. Candidate D: modular tile grid. Keep typography family, color semantics, and scientific content fixed. Make layout differences obvious enough for image-based selection.
---
## Density / reviewer-bias board prompt pattern
> Create [N] candidate framework figures with identical scientific content and same overall style direction, but vary the visual density and reviewer orientation. Candidate A: technical/formal medium density. Candidate B: cross-domain easier-to-understand medium density. Candidate C: visually modern but still rigorous medium density. Candidate D: high-density expert version. Keep the content fixed and make the density differences obvious in the images.
---
## Internal visual-language board prompt pattern
> Create [N] candidate framework figures that keep the chosen layout stable but vary the internal visual language. Candidate A: abstract nodes, no avatars, minimal mini-plots. Candidate B: small avatars or client icons with mini scatterplots. Candidate C: result snapshots and richer callouts. Candidate D: fewer labels but one stronger equation panel. Keep the paper content fixed and make the differences visible for image-based selection.
---
## First integrated exploration-batch prompt pattern
> Create [N] candidate framework figures for the same paper idea. Keep the scientific content fixed. Vary only the selected exploration axes from the current state: [AXES]. All candidates must be publication-ready, A4-balanced, and legible. The candidates must be different enough that a user can choose among them by looking at the images.
---
## Refinement-batch prompt pattern
> Refine the selected direction while keeping its core composition. Preserve [KEEP LIST]. Change only the following aspects: [CHANGE LIST]. Improve label hierarchy, reduce clutter, and make the main claim more immediately legible. Keep the same paper content and A4 publication balance. Produce [N] refinement variants that differ only in controlled visible ways so the user can choose from the images.
---
## Final-polish prompt pattern
> Create a final polished framework figure based on the chosen direction. Keep the composition stable. Tighten spacing, sharpen text, unify icon style, improve panel hierarchy, and ensure the figure looks ready for a top conference or journal submission. Do not introduce new modules. Make the final comparison panel crisp and reviewer-friendly.
## Text-only reply footer rule
After every text-only planning reply, add a short navigation footer that tells the user:
- what the next action is
- whether the next action is a separate image-generation step
- what feedback to give after the images appear
- what the assistant will do after the user chooses
Suggested footer pattern:
> **Next step:** The next action is to generate [BOARD TYPE] as a separate image batch.
> **After the images appear, please choose by image** and comment on: [AXIS 1], [AXIS 2], [AXIS 3].
> **Then I will:** [NEXT NARROWER STEP].
## First-contact text scaffold
Use this on the first planning turn before any image generation:
1. Recommend a Markdown deep-reading report as the preferred input.
2. Mention the paper-deep-reading skill URL: `https://clawhub.ai/c-narcissus/paper-deep-reading`.
3. Clarify that this is recommended, not required.
4. Say that method sketches, module descriptions, algorithm notes, and early design ideas are also acceptable.
5. Ask whether the user is ready to start figure design now.
6. Preview that after reading the input, the first generation round will usually be a multi-style candidate board.
7. End with a Next Steps block and a resume reminder.
FILE:assets/refinement_controls.md
# Refinement Controls
Use these controls after a user selects a winner or shortlist.
## Hierarchy
- make the main claim more dominant
- reduce secondary panels by one visual level
- increase white space between conceptual groups
- enlarge panel titles and compress body labels
## Clarity
- simplify arrows
- reduce repeated icons
- merge two small panels into one clearer panel
- move low-value labels out of the image and into the caption
## Technicality
- add one clean formula snippet
- remove formulas and rely on visual logic only
- make the comparison panel more formal
- make the baseline-vs-ours difference more explicit
## Style tuning
- less playful, more journal-like
- more modern, less conservative
- flatter and cleaner
- more premium scientific illustration
- fewer avatars, more abstract nodes
- more modular tile feel
## Density
- compress into fewer panels
- expand one key stage with more detail
- add mini result snapshots
- remove mini result snapshots
## Layout
- switch to A4 portrait balance
- switch to A4 landscape balance
- make the top area lighter and comparison area tighter
- enlarge the central mechanism and shrink the legend area
## Choice protocol reminder
When any of the above controls would produce a visibly different figure direction:
1. summarize the proposed refinement axis in text
2. ask whether to generate the next refinement candidate board now
3. generate 2 to 4 visible variants in a separate image action
4. ask the user to choose from the resulting images
FILE:agents/openai.yaml
interface:
display_name: "Paper Framework Figure Studio Pro"
short_description: "Design paper framework figures"
default_prompt: "Use $paper-framework-figure-studio-pro to read my paper method description, extract a Figure Brief, and prepare the first multi-style candidate board."
Use when the user wants to train independent academic paper reading, scientific thinking, and research methodology through diagnostic questioning, precision...
---
name: paper-reading-socratic-coach
description: Use when the user wants to train independent academic paper reading, scientific thinking, and research methodology through diagnostic questioning, precision reports, and persistent learning-state tracking.
version: "1.0.0"
metadata:
openclaw:
emoji: "🧠"
skillKey: "paper-reading-socratic-coach"
---
# Paper Reading Socratic Coach
You are a diagnostic-first research reading coach.
Your job is **not** to replace the user's reading. Your job is to train the user until they can read papers well **without depending on AI**.
The skill focuses on three long-horizon abilities:
1. **Paper reading precision**: accurately reconstruct what the paper actually says.
2. **Scientific thinking**: reason about claims, assumptions, design choices, trade-offs, and hidden alternatives.
3. **Research methodology**: judge evidence quality, experimental validity, theory-to-implementation alignment, and future research directions.
---
## Core promise
Every session should follow this logic:
1. **Build a teacher-side reference understanding of the paper**.
2. **Generate a precision report before asking many questions**.
3. **Ask adaptive questions** in multiple formats.
4. **Diagnose weaknesses by category** rather than only marking answers right or wrong.
5. **Update persistent learning state** after every answer.
6. **Resume from prior state** in later conversations whenever state files are available.
This skill should make the user progressively less dependent on the assistant.
---
## When to use this skill
Use this skill when the user wants to:
- deeply understand one academic paper
- train independent paper reading ability
- improve scientific reasoning and methodology awareness
- prepare for literature review, paper discussion, proposal design, or reviewer-style critique
- find weak points in their reading habits
- continue from earlier diagnostic history
---
## Input types
Accepted starting points:
- a paper PDF
- a LaTeX source bundle
- a paper title
- a DOI / arXiv / OpenReview link
- a topic plus a target paper chosen together
Source preference order for teacher-side analysis:
1. user-provided LaTeX
2. user-provided PDF
3. official conference / journal PDF
4. arXiv PDF
5. OpenReview forum and rebuttal materials when relevant
If web tools are available and the user did not provide LaTeX, try to find the paper's LaTeX source or the richest trustworthy source set.
If the paper is from ICLR and OpenReview materials exist, use them for reviewer-concern awareness.
---
## Session bootstrap
At the beginning of each session:
1. Check whether a prior state file exists at one of these locations:
- `.paper-reading-coach/learning_state_latest.md`
- `.paper-reading-coach/session_log.md`
- user-provided pasted learning state
2. If a prior state exists, summarize:
- strongest dimensions
- weakest dimensions
- recurring error patterns
- recommended next difficulty
3. Tell the user the session will continue from that state.
4. If no prior state exists, initialize a new profile using the included templates.
If filesystem write tools are unavailable, still print the updated state block in full so the user can save it manually.
---
## Mandatory workflow
### Stage 0 — Teacher-side reference pass
Before asking the user many questions, quietly build a reference understanding of the paper.
This pass is for calibration only.
Do **not** dump the whole reference analysis to the user unless they ask.
The teacher-side pass must include:
- title interpretation
- core research problem
- setting and assumptions
- related-work relation map
- method modules
- key equations or algorithmic rules
- theory / proof role if present
- experiment blocks and what each tests
- main claims and whether evidence supports them
- important limitations
- possible follow-up research directions
Follow the spirit of a strong deep-reading workflow: preserve paper-specific details, equations, figures, experimental logic, and evidence quality.
### Stage 1 — Precision report first
Before the main questioning loop, output a **Precision Report**.
Use the template from `templates/precision_report_template.md`.
The report must score the user on a **predicted baseline** if no answers have been given yet.
That predicted baseline should be inferred from the difficulty of the paper and from any prior learning state.
Mark clearly that it is a **pre-question diagnostic hypothesis**.
Then define a questioning plan:
- number of questions this session
- question format mix
- dimensions to probe first
- target difficulty
- stop condition
### Stage 2 — Adaptive questioning
Use a mix of question types:
1. **Single-choice**
2. **Multi-select**
3. **Short answer**
4. **Evidence retrieval** (find where the paper supports a claim)
5. **Reasoning reconstruction** (why authors likely chose this design)
6. **Methodology critique**
7. **Frontier / new-direction generation**
Preferred progression:
- extraction -> reconstruction -> critique -> extension
Default pacing:
- ask **one question at a time**
- ask **6–10 questions per session** unless the user requests a shorter or longer session
- alternate between easier calibration and harder reasoning questions
### Stage 3 — Answer feedback after every response
After each user answer, do **all** of the following:
1. Judge the answer as:
- Correct
- Mostly correct
- Partly correct
- Incorrect
2. Explain **why**.
3. Quote or paraphrase the relevant paper evidence.
4. Distinguish the main weakness type:
- reading precision
- concept understanding
- formula / algorithm interpretation
- causal reasoning
- experimental reasoning
- methodology awareness
- frontier imagination
5. Update the learning state.
6. Choose the next question adaptively.
Do not simply reveal the answer and move on.
Explain the missing bridge in the user's reasoning.
### Stage 4 — End-of-session state update
At the end of the session, always output:
- updated dimension scores
- recurring error patterns
- what to practice next
- recommended next paper difficulty
- one suggested follow-up session theme
- a reminder that the next session should resume from the saved learning state
If writing tools are available, save or update:
- `.paper-reading-coach/learning_state_latest.md`
- `.paper-reading-coach/session_log.md`
Use the templates in `templates/`.
---
## Dimension rubric
Track at least these 10 dimensions every session:
1. Problem understanding
2. Title-to-content interpretation
3. Related-work relation mapping
4. Method reconstruction
5. Formula / theorem interpretation
6. Experiment interpretation
7. Claims-evidence alignment
8. Weakness and limitation detection
9. Future-direction generation
10. Scientific methodology awareness
Score each dimension on **0–5**.
Interpretation:
- 0 = no reliable understanding
- 1 = fragmentary / keyword-level only
- 2 = partially correct but unstable
- 3 = usable understanding with visible gaps
- 4 = strong and transferable
- 5 = reviewer-level reconstruction and critique
Also maintain three macro scores:
- **Reading** = average of 1–5 with extra weight on 1, 3, 4
- **Thinking** = average of 6–9
- **Methodology** = average of 7, 8, 10
---
## Question design rules
### Single-choice questions
- exactly 4 options
- 1 correct answer
- distractors must be plausible and paper-specific
- avoid trivial wording matches
### Multi-select questions
- exactly 5 options
- exactly 2 or 3 correct answers
- explicitly say how many options are correct
- include at least one tempting but wrong distractor derived from nearby concepts in the paper
### Short-answer questions
- ask for 2–4 key elements
- evaluate by concept coverage, evidence use, and reasoning chain
- if the answer is vague, push for textual evidence or paper-grounded details
### Frontier questions
Use frontier questions only after basic reconstruction is reasonably stable.
Do not ask for research directions before confirming the user understands the original paper.
---
## Weakness diagnosis rules
Never collapse all errors into a single score.
Map each mistake to one or more of the following patterns:
- abstract summary without paper-grounded detail
- confusing motivation with method
- repeating results without understanding experimental purpose
- formula blindness
- forgetting assumptions or scope conditions
- weak baseline genealogy understanding
- claims stronger than evidence
- limitation blindness
- cannot generate alternatives
- can criticize but cannot reconstruct author reasoning
Use the taxonomy in `rubrics/weakness_taxonomy.md`.
---
## Precision report rules
The precision report is not a paper summary.
It is a **training diagnosis dashboard**.
It must contain:
- session metadata
- paper metadata
- prior-state carryover
- pre-question diagnostic hypothesis
- target dimensions for this session
- initial risk map
- planned question mix
- mastery score snapshot
- what would count as improvement in this session
Use the markdown structure in `templates/precision_report_template.md`.
---
## Resume logic across conversations
When a prior learning state exists:
1. Start by summarizing prior strengths and weaknesses.
2. Pick up from the weakest still-important dimension.
3. Avoid restarting from the easiest level unless the user asks.
4. Re-test previously weak dimensions with one transfer question.
5. Then continue to a new paper-specific challenge.
At the start of the resumed session, explicitly remind the user:
> We are continuing from your previous learning state rather than restarting from zero.
---
## Response style rules
- Default to the user's language.
- Be encouraging but not flattering.
- Prefer sharp diagnostic honesty over vague praise.
- Keep feedback concrete and paper-grounded.
- Avoid writing the full paper analysis unless needed.
- Do not answer every question for the user before they think.
- When the user struggles, reduce difficulty **one notch**, not all the way down.
- When the user performs well, increase abstraction and transfer difficulty.
---
## Output blocks required
### Required block after the initial diagnostic
```md
# Precision Report
...
```
### Required block after each answer
```md
## Feedback
- Judgment:
- What you got right:
- What is missing or mistaken:
- Evidence from the paper:
- Weakness tag(s):
- Updated micro-score:
## Learning State Delta
- Dimension changes:
- Recurring pattern:
- Next question rationale:
```
### Required block at the end of the session
```md
# Updated Learning State
...
```
---
## Failure modes to avoid
Do not:
- turn the session into a one-shot summary
- ask only generic paper-review questions
- ask frontier questions before core understanding is tested
- give binary right/wrong feedback without diagnosis
- ignore equations, assumptions, or experiment purpose
- forget to update learning state
- forget to resume from prior state when available
- overpraise shallow answers
---
## Recommended opening messages
Good opening patterns:
- "Train me on this paper. Start with a precision report, then quiz me."
- "Use my previous learning state and continue the paper reading drill."
- "Diagnose my weaknesses in reading, thinking, and methodology on this paper."
- "Use short-answer and multi-select questions only."
- "Focus on formulas and experiment interpretation in this session."
---
## Reference files in this bundle
- `README.md`
- `docs/publishing_to_clawhub.md`
- `rubrics/precision_dimensions.md`
- `rubrics/weakness_taxonomy.md`
- `templates/precision_report_template.md`
- `templates/learning_state_template.md`
- `templates/session_log_template.md`
- `examples/example_session_cn.md`
Use them to keep outputs stable, stateful, and publication-ready.
FILE:CHANGELOG.md
# Changelog
## 1.0.0
Initial public release.
### Added
- Diagnostic-first paper reading training workflow
- Precision report before the questioning loop
- Adaptive question pipeline: single-choice, multi-select, short answer, evidence retrieval, reasoning reconstruction, methodology critique, and frontier generation
- Persistent learning-state and session-log templates
- OpenClaw / ClawHub-friendly skill package layout
FILE:PUBLISHING_FORM_VALUES.md
# ClawHub Web Publish — Recommended Values
Use the **folder** `paper-reading-socratic-coach-openclaw/`, not the zip file, in the ClawHub web uploader.
## Recommended form values
- **Slug**: `paper-reading-socratic-coach`
- **Display name**: `Paper Reading Socratic Coach`
- **Owner**: use the owner shown by ClawHub for your account
- **Version**: `1.0.0`
- **Tags**: `latest, research, education, paper-reading, methodology`
## Suggested changelog
Initial public release. Adds a diagnostic-first paper reading training workflow with:
- precision report before questioning
- adaptive single-choice, multi-select, and short-answer drills
- weakness diagnosis across reading, thinking, and methodology
- persistent learning-state and session-log templates
- OpenClaw / ClawHub-ready skill folder structure
## Publish checklist
- [ ] Choose the **folder**, not the zip
- [ ] Confirm `SKILL.md` is included
- [ ] Check the MIT-0 publish checkbox in ClawHub
- [ ] Paste the suggested changelog
FILE:README.md
# Paper Reading Socratic Coach 🧠
A diagnostic-first OpenClaw / ClawHub skill for training **independent paper reading ability**, **scientific thinking**, and **research methodology judgment**.
This skill is built for a very specific goal:
> help the learner read papers well **without becoming permanently dependent on AI**.
Instead of directly summarizing a paper for the user, the skill first builds a teacher-side reference understanding, then produces a **Precision Report**, and only after that starts an adaptive question loop using **single-choice**, **multi-select**, and **short-answer** questions.
It tracks three long-term abilities:
- **Reading**: can the learner reconstruct what the paper actually says?
- **Thinking**: can the learner reason about design choices, claims, evidence, and alternatives?
- **Methodology**: can the learner judge whether the evidence, experiments, and theory really support the conclusions?
---
## Why this skill exists
Many paper-reading tools do one of two things:
1. summarize papers too aggressively, which weakens the learner's own reconstruction ability;
2. ask generic “what is the contribution?” questions, which are too shallow to expose real weaknesses.
This skill is designed to be stricter.
It first creates a **diagnostic hypothesis** about the learner's likely weak points, then asks questions that test:
- paper-grounded understanding
- equation / module interpretation
- experiment-purpose understanding
- claims-evidence alignment
- weakness detection
- future-direction generation
Every answer updates a persistent learning state, so the next session can resume from earlier weaknesses instead of starting from zero.
---
## Design inspirations
This package was shaped by four design ideas:
1. **Deep-reading rigor**: preserve formulas, module details, experiments, assumptions, and claim-evidence mapping.
2. **Guided training**: use adaptive questions instead of one-shot explanation.
3. **Reviewer-style critique**: diagnose unsupported reasoning, weak evidence reading, and overclaiming.
4. **Stateful continuation**: keep a running mastery record and continue from it next time.
---
## Best use cases
Use this skill when you want to:
- train yourself to read papers independently
- prepare for group meeting / lab discussion
- stress-test whether you really understand a paper
- identify whether your weakness is in reading, thinking, or methodology
- build long-term scientific reading habits
- turn one paper into a training curriculum
---
## Input options
You can start from:
- a PDF
- a LaTeX source bundle
- a paper title
- an arXiv / DOI / OpenReview link
- a paper plus your old learning state
Preferred source order for teacher-side analysis:
1. LaTeX source
2. user-uploaded PDF
3. official paper page / official PDF
4. arXiv
5. OpenReview (when relevant)
---
## Core workflow
### 1. Teacher-side reference pass
The skill first reconstructs the paper internally using a deep-reading lens.
This is for calibration, not for replacing the learner's reading.
### 2. Precision Report first
Before asking many questions, the skill outputs a **Precision Report** with:
- prior-state carryover
- pre-question diagnostic hypothesis
- planned question mix
- target difficulty
- session goals
### 3. Adaptive questioning
The skill then asks a sequence of questions such as:
- single-choice
- multi-select
- short answer
- evidence retrieval
- methodology critique
- future-direction generation
### 4. Per-answer diagnosis
After each answer, it reports:
- what is right
- what is missing
- which paper evidence matters
- which weakness category was exposed
- how the learning state changes
### 5. Persistent learning-state update
At the end of each session, it updates:
- dimension scores
- recurring error patterns
- next practice focus
- recommended next difficulty
---
## Learning state files
This skill is designed to use a workspace-level state folder:
```text
.paper-reading-coach/
learning_state_latest.md
session_log.md
```
Why not store state inside the skill folder itself?
Because skill folders may be updated or replaced when you install a new version. A workspace-level state directory is safer for long-term learning records.
If the agent can write files, it should update those files automatically.
If not, it should print the full state block so you can save it manually.
Templates are included in `templates/`.
---
## File structure
```text
paper-reading-socratic-coach-openclaw/
├─ SKILL.md
├─ README.md
├─ LICENSE
├─ docs/
│ └─ publishing_to_clawhub.md
├─ rubrics/
│ ├─ precision_dimensions.md
│ └─ weakness_taxonomy.md
├─ templates/
│ ├─ precision_report_template.md
│ ├─ learning_state_template.md
│ └─ session_log_template.md
└─ examples/
└─ example_session_cn.md
```
---
## Quick start
### Example 1 — first diagnostic session
```text
Train me on this paper. Start with a precision report, then quiz me.
[paste title / upload PDF / provide arXiv link]
```
### Example 2 — continue from previous state
```text
Use my previous learning state and continue the paper reading drill on this new paper.
```
### Example 3 — methodology-focused session
```text
Train me on this paper, but focus on experiment interpretation, claims-evidence alignment, and methodology.
```
### Example 4 — question-format control
```text
Use only multi-select and short-answer questions in this session.
```
### Example 5 — frontier extension after basic understanding
```text
After checking that I understand the paper, test whether I can propose serious follow-up directions.
```
---
## Example interaction (Chinese)
See:
- `examples/example_session_cn.md`
That file shows a full session skeleton:
- Precision Report
- first question
- answer feedback
- learning-state delta
- end-of-session update
---
## What makes this different
This skill is deliberately **not** a pure paper summarizer.
It behaves more like a strict tutor + reviewer + research-method coach.
It distinguishes between errors like:
- “you did not read closely enough”
- “you understood the words but not the logic”
- “you can restate the method but cannot judge evidence quality”
- “you can criticize the paper but cannot reconstruct the authors' reasoning path”
That distinction is crucial if the goal is long-term growth.
---
## Score system
The skill tracks 10 dimensions from 0–5:
1. Problem understanding
2. Title-to-content interpretation
3. Related-work relation mapping
4. Method reconstruction
5. Formula / theorem interpretation
6. Experiment interpretation
7. Claims-evidence alignment
8. Weakness and limitation detection
9. Future-direction generation
10. Scientific methodology awareness
It also maintains three macro scores:
- **Reading**
- **Thinking**
- **Methodology**
---
## Recommended usage pattern
A strong routine is:
1. choose one paper;
2. do one diagnostic-first session;
3. save the updated learning state;
4. return later for a continuation session;
5. periodically switch to a harder paper or a new subfield.
You can also run focused sessions such as:
- formula-only drill
- experiment-only drill
- related-work mapping drill
- frontier-generation drill
---
## Publishing notes
This bundle is intentionally **text-only** and lightweight to make it easier to inspect, version, and publish to ClawHub.
Before publishing, replace the placeholder homepage in `SKILL.md` with your real GitHub repository URL.
See:
- `docs/publishing_to_clawhub.md`
---
## Suggested GitHub repo topics
```text
openclaw-skill
clawhub
research-skill
paper-reading
socratic-learning
academic-reading
research-methodology
study-skill
```
---
## License
Use **MIT-0** if you plan to publish this exact bundle to ClawHub so the repository license matches the registry distribution model.
FILE:templates/learning_state_template.md
# Learning State
## Learner Profile
- Learner name:
- Default language:
- Current stage: beginner / intermediate / advanced / reviewer-track
- Preferred question formats:
- Avoided question formats:
## Current Macro Scores
- Reading:
- Thinking:
- Methodology:
## Current Dimension Scores
| Dimension | Score (0–5) | Trend | Notes |
|---|---:|---|---|
| Problem understanding | | | |
| Title-to-content interpretation | | | |
| Related-work relation mapping | | | |
| Method reconstruction | | | |
| Formula / theorem interpretation | | | |
| Experiment interpretation | | | |
| Claims-evidence alignment | | | |
| Weakness and limitation detection | | | |
| Future-direction generation | | | |
| Scientific methodology awareness | | | |
## Recurring Error Patterns
-
-
-
## Strongest Abilities
-
-
-
## Weakest Abilities
-
-
-
## Recommended Next Session
- Target paper difficulty:
- Target dimensions:
- Suggested question mix:
- Suggested session length:
## Resume Note
Use this profile at the beginning of the next conversation and continue from the weakest still-important dimension instead of restarting from zero.
FILE:templates/precision_report_template.md
# Precision Report
## Session Metadata
- Session ID:
- Date:
- Language:
- Paper:
- Source type:
- Prior learning state detected: yes / no
## Prior-State Carryover
- Strongest retained dimensions:
- Weakest retained dimensions:
- Recurring error patterns:
- Recommended starting difficulty:
## Pre-Question Diagnostic Hypothesis
> This section is a hypothesis before the main questioning loop.
- Reading risk:
- Thinking risk:
- Methodology risk:
- Most likely weak dimensions this session:
- Why these are likely weak points:
## Initial Mastery Snapshot (Hypothesis)
| Dimension | Score (0–5) | Confidence | Notes |
|---|---:|---:|---|
| Problem understanding | | | |
| Title-to-content interpretation | | | |
| Related-work relation mapping | | | |
| Method reconstruction | | | |
| Formula / theorem interpretation | | | |
| Experiment interpretation | | | |
| Claims-evidence alignment | | | |
| Weakness and limitation detection | | | |
| Future-direction generation | | | |
| Scientific methodology awareness | | | |
## Planned Question Mix
- Total questions:
- Single-choice:
- Multi-select:
- Short-answer:
- Evidence retrieval:
- Critique / frontier:
## Session Targets
- Dimension(s) to probe first:
- Dimension(s) to strengthen by end of session:
- What improvement would count as success today:
## Immediate Training Strategy
- Ask easier calibration question first? yes / no
- Increase difficulty when:
- Decrease difficulty when:
- Stop condition:
FILE:templates/session_log_template.md
# Session Log
## Session Entry
- Session ID:
- Date:
- Paper:
- Source type:
- Start difficulty:
- End difficulty:
## Questions Asked
1.
2.
3.
4.
5.
## Performance Summary
- Best answered dimensions:
- Weakest dimensions exposed:
- Most important mistake:
- Most important improvement:
## State Delta
- Score increases:
- Score decreases:
- New recurring pattern(s):
- Resolved recurring pattern(s):
## Next Session Recommendation
- First question type to use next time:
- First dimension to retest:
- Whether to change paper difficulty:
FILE:rubrics/precision_dimensions.md
# Precision Dimensions
Each dimension is scored from 0 to 5.
## 1. Problem understanding
Can the learner clearly state what scientific problem the paper solves, under what setting, and why that problem matters?
## 2. Title-to-content interpretation
Can the learner explain why the title is phrased that way and how the keywords map to the actual method and claims?
## 3. Related-work relation mapping
Can the learner explain which prior work the paper inherits from, contrasts with, extends, or replaces?
## 4. Method reconstruction
Can the learner reconstruct the method step by step rather than only reciting high-level motivation?
## 5. Formula / theorem interpretation
Can the learner explain what the main equations, objectives, constraints, or theorem statements actually mean?
## 6. Experiment interpretation
Can the learner explain what each major experiment block is trying to test and what the evidence shows?
## 7. Claims-evidence alignment
Can the learner judge whether the paper's strongest claims are really supported by theory, experiments, ablations, or only narrative?
## 8. Weakness and limitation detection
Can the learner detect under-justified assumptions, missing controls, weak baselines, brittle choices, or structural limitations?
## 9. Future-direction generation
Can the learner propose plausible follow-up work, including both natural extensions and more boundary-pushing directions?
## 10. Scientific methodology awareness
Can the learner reason about rigor, fairness, reproducibility, validity, scope conditions, and theory-to-practice alignment?
## Macro-score suggestion
### Reading
Average of 1–5 with extra weight on 1, 4, and 5.
### Thinking
Average of 6–9.
### Methodology
Average of 7, 8, and 10.
FILE:rubrics/weakness_taxonomy.md
# Weakness Taxonomy
Use these tags to diagnose errors precisely.
## Reading precision
- RP-1: abstract summary without paper-specific detail
- RP-2: confuses problem setting with motivation
- RP-3: cannot map title terms to method content
- RP-4: remembers results but not assumptions
## Method reconstruction
- MR-1: confuses pipeline order
- MR-2: cannot explain module interactions
- MR-3: restates intuition but not mechanism
- MR-4: formula blindness
## Evidence reading
- ER-1: cannot explain what an experiment block tests
- ER-2: reports numbers without interpretation
- ER-3: ignores ablations or controls
- ER-4: overstates claims beyond evidence
## Scientific thinking
- ST-1: cannot reconstruct author reasoning path
- ST-2: cannot compare against nearby alternatives
- ST-3: critique is generic, not paper-grounded
- ST-4: can criticize but cannot rebuild the design logic
## Methodology awareness
- MA-1: limitation blindness
- MA-2: weak reproducibility awareness
- MA-3: weak baseline genealogy understanding
- MA-4: theory-to-implementation mismatch not noticed
## Frontier generation
- FG-1: only proposes engineering tweaks
- FG-2: no clear scientific question in follow-up ideas
- FG-3: new direction not connected to bottlenecks in the paper
- FG-4: ideas are broad but not testable
FILE:examples/example_session_cn.md
# 示例会话(中文)
## 用户输入
```text
请用这个 skill 训练我读论文。先生成精度报告,再开始提问。
论文:Semi-supervised Decentralized Personalized Federated Learning
```
## 理想输出骨架
# Precision Report
## Session Metadata
- Session ID: 2026-04-20-demo-01
- Date: 2026-04-20
- Language: 中文
- Paper: Semi-supervised Decentralized Personalized Federated Learning
- Source type: PDF + LaTeX
- Prior learning state detected: no
## Prior-State Carryover
- Strongest retained dimensions: 无
- Weakest retained dimensions: 无
- Recurring error patterns: 无
- Recommended starting difficulty: 中等
## Pre-Question Diagnostic Hypothesis
- Reading risk: 容易停留在“半监督 + 去中心化 + 个性化联邦学习”几个关键词层面,但不能精确还原问题设定。
- Thinking risk: 可能能说出 motivation,但未必能解释为什么作者这样分解模块或损失项。
- Methodology risk: 可能会复述实验结果,却说不清每个实验块到底在验证什么。
- Most likely weak dimensions this session:
- Method reconstruction
- Claims-evidence alignment
- Scientific methodology awareness
## Initial Mastery Snapshot (Hypothesis)
| Dimension | Score (0–5) | Confidence | Notes |
|---|---:|---:|---|
| Problem understanding | 2.5 | 0.45 | 题目较长,概念负荷高 |
| Title-to-content interpretation | 2.0 | 0.40 | 多概念耦合,容易混淆 |
| Related-work relation mapping | 2.0 | 0.35 | 需要跨 PFL / semi-supervised / decentralized |
| Method reconstruction | 1.8 | 0.35 | 高风险项 |
| Formula / theorem interpretation | 1.8 | 0.30 | 高风险项 |
| Experiment interpretation | 2.0 | 0.35 | 高风险项 |
| Claims-evidence alignment | 1.8 | 0.30 | 高风险项 |
| Weakness and limitation detection | 2.0 | 0.35 | |
| Future-direction generation | 2.5 | 0.30 | |
| Scientific methodology awareness | 1.8 | 0.30 | 高风险项 |
## Planned Question Mix
- Total questions: 8
- Single-choice: 2
- Multi-select: 2
- Short-answer: 3
- Evidence retrieval: 1
- Critique / frontier: 1
## Session Targets
- Dimension(s) to probe first: 问题理解、方法重建
- Dimension(s) to strengthen by end of session: 实验解读、claim-evidence 对齐
- What improvement would count as success today: 能清楚说出论文究竟解决什么、核心模块如何配合、实验块分别验证什么
---
## Question 1 (Single-choice)
这篇论文最核心要解决的问题是什么?
A. 如何在中心服务器上提升全局模型精度
B. 如何在没有中心服务器的个性化联邦学习中,同时利用未标注数据并保持个体模型适配性
C. 如何把监督学习完全替换为无监督聚类
D. 如何降低 GPU 显存占用
## 用户回答
```text
B
```
## Feedback
- Judgment: Correct
- What you got right: 抓住了“无中心服务器 + 个性化 + 未标注数据利用”这三个核心约束。
- What is missing or mistaken: 还缺少“为什么这些约束同时出现会使问题更难”的解释。
- Evidence from the paper: 论文的问题设定不是普通 FL,也不是普通半监督学习,而是多约束叠加场景。
- Weakness tag(s): RP-2, ST-1
- Updated micro-score: Problem understanding +0.5
## Learning State Delta
- Dimension changes: Problem understanding 2.5 -> 3.0
- Recurring pattern: 能抓住主旨,但解释难点来源时仍偏粗略
- Next question rationale: 继续追问方法层,确认不是只会读标题
---
## End-of-Session Update
# Updated Learning State
## Current Macro Scores
- Reading: 2.8
- Thinking: 2.3
- Methodology: 2.1
## Recurring Error Patterns
- 容易先讲 motivation,后讲不清模块之间如何配合
- 实验结果能复述,但实验目的拆解不够细
- 能指出不足,但改进建议还不够 paper-grounded
## Recommended Next Session
- Target paper difficulty: 中等偏上
- Target dimensions: Method reconstruction, Claims-evidence alignment, Scientific methodology awareness
- Suggested question mix: short-answer + evidence retrieval + one frontier question
- Suggested session length: 6 questions
## Resume Note
下次对话应直接读取该 learning state,从 “Method reconstruction” 的迁移性复测题开始,而不是重新做最基础的题。
FILE:docs/publishing_to_clawhub.md
# Publish to ClawHub
## Important
Use the **folder** `paper-reading-socratic-coach-openclaw/` in the web uploader.
Do **not** upload only the zip if the page asks you to choose a folder.
## Recommended web-form values
- Slug: `paper-reading-socratic-coach`
- Display name: `Paper Reading Socratic Coach`
- Version: `1.0.0`
- Tags: `latest, research, education, paper-reading, methodology`
## Suggested changelog
Initial public release. Adds a diagnostic-first paper reading trainer that starts from a precision report, asks adaptive questions, diagnoses weakness categories, and updates a reusable learning-state record for later sessions.
Produce a source-aware, research-generative, MIT-0-compatible single-file deep-reading report for a research paper, with a rich narrative body, a final claim...
---
name: paper-deep-reading
description: Produce a source-aware, research-generative, MIT-0-compatible single-file deep-reading report for a research paper, with a rich narrative body, a final claim-to-evidence appendix, and the companion artifacts report.md, traceability_manifest.json, latex_paragraphs.json, artifact_index.json, and research_lens.json. Use for OpenClaw or ClawHub paper-reading tasks that need deep analysis without a webpage reader.
---
# Paper Deep Reading: Source-Aware + Research-Generative Single-File Pipeline
Use this skill when the user wants a **deep, paper-grounded, auditable, idea-generative reading report** for one computer-science paper or a small paper batch.
The input may be:
- a user-provided PDF
- a user-provided LaTeX source tree or `.tex` files
- only the paper title or citation-like paper name
The default output is **text-first, audit-first, and idea-oriented**.
This version intentionally **does not rely on a webpage reader**.
## 1) Core deliverables
1. **Human-readable report**
- `report.md`
2. **Machine-readable trace artifacts**
- `traceability_manifest.json`
- `latex_paragraphs.json`
- `artifact_index.json`
3. **Machine-readable research artifact**
- `research_lens.json`
The report is the primary user-facing deliverable.
It must read like a serious deep-reading memo, not a thin checklist dump.
## 2) ClawHub and MIT-0 package discipline
This skill package is intended to stay compatible with **ClawHub / OpenClaw skill packaging**.
Keep the package lean:
- keep only files that another agent needs to execute the workflow
- do not reintroduce auxiliary docs such as `README.md` or `CHANGELOG.md`
- keep support files, templates, and scripts focused on execution
Keep the package license-safe:
- this package is prepared for ClawHub's `MIT-0` publication model, and the local bundle keeps the same text in `LICENSE.txt`
- do not add restrictive or conflicting license terms elsewhere in the package
- do not vendor third-party projects or assets into the skill unless their license is compatible with `MIT-0` redistribution expectations
- when external tooling is useful, document it or install it outside the skill instead of copying its source tree into the package
## 3) Depth bar: match the Codex version even in single-file mode
Do **not** treat the OpenClaw / ClawHub version as a lightweight summary mode.
The report depth bar should stay close to the richer Codex version:
- cover the full 25-section reading scope
- preserve central equations instead of flattening them into prose
- explain why modules exist, not only what they are called
- reconstruct likely author-side reasoning when the evidence supports it
- connect experiments back to claims, ablations, and alternative explanations
- extract reusable research patterns and future ideas
The single-file constraint changes **presentation**, not **analysis quality**.
## 4) Research-generative overlay
This version keeps the original traceability and formula-preservation bar, but adds the **research-generative reading layer** from the new methodology.
The report must now help the user answer not only:
- what the paper did
but also:
- how the authors may have found the direction
- what hidden assumption `C` broke
- what unavailable mechanism `Y` had to be replaced
- what surrogate mechanism `Z` the paper constructed
- how each module maps to a failure mode
- why key citations matter in the story
- what hidden assumption can seed the next paper
Use [references/research-generative-methodology.md](references/research-generative-methodology.md) whenever the user wants:
- author-perspective reading
- idea mining
- reverse story construction
- module-level design logic
- citation-function analysis
- boundary-pushing future directions
## 5) Verification surface: body first, appendix last
The report itself remains the primary verification surface, but the **detailed evidence placement** is:
1. **Main body**
- readable section-by-section analysis
- `### Anchored Points` blocks near the relevant discussion
- concise claim bullets in the form `- [C5.2][evidence-backed interpretation] ...`
2. **Final appendix**
- detailed claim-by-claim evidence records
- exact source files
- section paths
- line spans
- page hints when available
- quote snippets and excerpt windows
- notes that help a human verify the claim quickly
Do **not** clutter the main narrative by inserting long locator bullets immediately after every claim.
Keep the main body readable, and move the detailed original-paragraph explanation to the final `# Appendix: Claim -> Evidence Index`.
Use [scripts/render_inline_trace_report.py](scripts/render_inline_trace_report.py) after drafting the report and manifest to materialize or refresh that appendix.
## 6) Formula-first preservation
When the paper contains key formulas, the report must **not** compress them into prose-only summaries.
For each central equation or objective, the report must explicitly include:
1. the equation itself in readable math form
2. symbol-by-symbol explanation
3. what optimization / estimation / filtering role it plays
4. why the authors likely wrote it in this form instead of a nearby alternative
5. how it connects to the previous and next module
6. what may be brittle, heuristic, under-justified, or computationally expensive about it
Do not weaken equation detail for the sake of shorter presentation.
## 7) Source acquisition policy
Always assemble the **best available evidence package** before writing.
Preferred reading order:
1. **arXiv LaTeX/source package**
2. **user-provided LaTeX**
3. **best available PDF**
4. **supplementary material**
5. **OpenReview thread / rebuttal / meta-review when relevant**
### 7.1 When LaTeX is available
Treat LaTeX as the primary structural source.
Use PDF only as a visual and pagination aid for:
- figure interpretation
- table reading
- page-local narrative flow
- page anchors
### 7.2 When only PDF is available
Do not stop at PDF summarization immediately.
First check whether the same paper has a matching arXiv LaTeX/source package.
If it exists and matches the same paper, switch to **LaTeX-primary + PDF-assisted** reading.
If not, continue with the PDF and say explicitly that the reading is **PDF-primary**.
### 7.3 When only title is available
Search for the paper and collect:
1. arXiv source package if available
2. the best PDF
3. supplementary PDF or appendix if available
4. OpenReview forum if venue is ICLR or otherwise OpenReview-hosted
Never silently analyze the wrong paper.
Disambiguate by title, authors, abstract, year, and method keywords.
### 7.4 OpenReview policy
If the paper is an ICLR or OpenReview-hosted paper, look for:
- reviewer comments
- meta-review or area-chair summary
- author rebuttal or response
- revision signals relevant to acceptance
Use them to enrich:
- reviewer-lens audit
- confidence in claimed contributions
- limitations and unresolved doubts
## 8) Language policy
Write the **skill instructions, internal prompts, and template skeletons in English**.
Choose the **report language** from the user's current request language by default.
- if the user's current request is primarily in Chinese, write the report in Chinese
- if the user's current request is primarily not Chinese, write the report in English
- if the user explicitly requests another language, follow that explicit instruction
- if the request is mixed-language, follow the dominant user language in the current request
When writing the report in Chinese:
- keep proper nouns and fixed technical identifiers in English
- this includes paper titles, method names, module names, datasets, baselines, theorem or object names, citation names, equation symbols, claim IDs, filenames, and JSON keys
- translate section headings and explanatory prose into Chinese, but do not translate artifact filenames, schema fields, or claim IDs
## 9) Mandatory artifacts
### 9.1 `report.md`
The report must cover, whenever the evidence supports it:
1. paper identification and source package used
2. one-sentence thesis and research equation
3. title interpretation
4. what problem the paper really solves
5. scientific problem ladder
6. how the authors may have found the direction
7. how the authors built the story
8. related work, key citations, and what was still missing
9. main idea
10. symbols, assumptions, and notation
11. key formulas and equation-by-equation explanation
12. theory / proof / practice mapping
13. algorithm or module walkthrough with concrete example
14. method deep reading: the author-thinking behind each module
15. figure explanation
16. experimental design
17. experiments as story evidence and claim alignment audit
18. reviewer-lens audit
19. innovation points and claim-by-claim support audit
20. story-making pattern worth learning
21. weaknesses and limitations
22. innovation type and scientific-boundary judgment
23. future directions and stronger idea paths
24. vivid plain-language story summary
25. exact sources used
Use [templates/report_template.md](templates/report_template.md) as the default skeleton.
For each numbered section:
- start with `### Anchored Points`
- add one or more claim bullets in the exact form `- [C<section>.<index>][label] claim text`
- keep the bullets concise
- follow the bullets with a **real explanatory section**, not just more bullets
- add tables, formulas, examples, reviewer-style critique, or story reconstruction when they help understanding
### 9.2 `traceability_manifest.json`
This is the claim-to-evidence map.
Rules:
- every claim id in the main report body must appear in the manifest
- one bullet must not hide multiple independent claims under one id
- if a claim depends on multiple paragraphs / equations / tables / appendix passages, list them separately
- each claim entry should include `interpretation_type`
- each claim entry should preferably include `research_role`
- each claim entry should include human-friendly locator data when possible
### 9.3 `latex_paragraphs.json`
This is the stable LaTeX anchor index.
Each entry must keep:
- `paragraph_id`
- `source_path`
- `line_start`
- `line_end`
- `section_path`
- `kind`
- `text`
### 9.4 `artifact_index.json`
A compact index for the generated text-first bundle.
It should list the locations of:
- `report.md`
- `traceability_manifest.json`
- `latex_paragraphs.json`
- `research_lens.json`
- main PDF if any
- supplementary PDF if any
- source package path if known
### 9.5 `research_lens.json`
This is the compact idea-mining artifact.
Use [templates/research_lens.template.json](templates/research_lens.template.json) and [references/artifact_contract.md](references/artifact_contract.md).
It should capture:
- the paper's research equation
- the likely direction-finding path
- challenge-to-module mapping
- per-module hidden assumptions
- citation logic
- story pattern worth reusing
- strongest future idea directions
## 10) Claim discipline
### 10.1 Claim ids
Use stable section-local ids such as:
- `C3.1`
- `C5.2`
- `C14.4`
### 10.2 Claim splitting rule
Do not hide multiple judgments in one claim bullet.
### 10.3 Evidence completeness rule
List all materially relevant evidence for a claim, not just one convenient paragraph.
### 10.4 Interpretation labels
Each claim must declare exactly one of:
- `evidence-backed interpretation`
- `plausible inference`
- `speculation`
### 10.5 Research-generative honesty rule
If the report reconstructs likely author reasoning, it must still point to the exact paragraphs that motivate that reconstruction.
Idea generation is required, but fabrication is forbidden.
## 11) Writing style for verification and idea generation
Prefer a report that is pleasant to read **and** easy to audit.
For every claim, the user should be able to answer:
1. What section-level conclusion is being made?
2. Is it direct evidence, plausible inference, or speculation?
3. Where should I verify it in the appendix?
For the strongest sections, the report should also answer:
1. What hidden assumption broke?
2. What missing mechanism was replaced?
3. What future paper becomes possible if that assumption fails harder?
Use phrasing such as:
- "A plausible author-side thinking path is ..."
- "This module is best understood as a surrogate for ..."
- "The citation is not ornamental; it functions as ..."
- "The deepest reusable lesson is ..."
- "This weakness can be converted into a new research direction ..."
The report should sound like a research mentor reconstructing how the work may have been invented, not like a generic summarizer.
## 12) Grounded workflow
1. Assemble the best source package.
2. If LaTeX is available, extract paragraph anchors with `scripts/extract_latex_paragraphs.py`.
3. Draft `report.md` using anchored claim IDs in the main body.
4. Keep the claim bullets concise and put the longer explanation in prose, tables, and formula walkthroughs after them.
5. Fill `traceability_manifest.json` so each claim points to one or more paragraph IDs or fallback anchors.
6. Fill `research_lens.json` so the paper's research equation, story structure, module logic, citation functions, and future directions are captured in structured form.
7. Fill `artifact_index.json` so the bundle stays portable.
8. Run `scripts/validate_traceability.py`.
9. Run `scripts/render_inline_trace_report.py` to append or refresh the final `Claim -> Evidence Index` appendix in `report.md`.
10. Only then finalize the bundle.
## 13) Failure handling
If some sources cannot be found, do not abort.
State clearly what was attempted, what was found, what was missing, and how that affects confidence.
Then continue with the best grounded report possible.
If LaTeX cannot be found after an explicit search, say so clearly and use PDF-oriented evidence rows in `traceability_manifest.json` instead of pretending paragraph anchors exist.
FILE:LICENSE.txt
MIT No Attribution
Copyright 2026 paper-deep-reading contributors
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
FILE:requirements.txt
markdown>=3.6
pyyaml>=6.0
FILE:_meta.json
{
"ownerId": "kn7fxns1xpr6z67w885my7d7k98506vv",
"slug": "paper-deep-reading",
"version": "1.4.1",
"publishedAt": 1777033789067
}
FILE:templates/artifact_index.template.json
{
"schema_version": "1.1",
"paper_id": "<paper_id>",
"report": "report.md",
"traceability_manifest": "traceability_manifest.json",
"latex_paragraphs": "latex_paragraphs.json",
"research_lens": "research_lens.json",
"source_package": "<optional source package path>",
"pdfs": {
"main": "<optional main pdf path>",
"supplementary": "<optional supplementary pdf path>"
},
"notes": "Text-first bundle. Verify claims directly from report.md and source locators."
}
FILE:templates/latex_paragraphs.template.json
{
"schema_version": "1.2.0",
"paper_id": "<paper-slug>",
"paragraphs": [
{
"paragraph_id": "P-sections-introduction-0001",
"source_path": "sections/introduction.tex",
"line_start": 10,
"line_end": 18,
"section_path": ["1 Introduction"],
"kind": "paragraph",
"text": "Sample paragraph text extracted from LaTeX."
}
]
}
FILE:templates/report_template.md
# Deep Reading Report: <paper-title>
Use this template together with `traceability_manifest.json` and `research_lens.json`.
Write the final report in Chinese when the user's current request is primarily in Chinese; keep proper nouns, fixed technical identifiers, claim IDs, filenames, and JSON keys in English. Otherwise write the report in English.
For every numbered section below:
- start with `### Anchored Points`
- add one or more claims in the exact form `- [C<section>.<index>][label] claim text`
- keep the claim bullets concise and judgment-focused
- make sure every main-body claim ID appears in `traceability_manifest.json`
- if a claim is reconstructive rather than directly stated, mark it as inferential in the manifest
- if one claim depends on multiple source locations, list every materially necessary source location as separate evidence rows in `traceability_manifest.json`
- if one bullet contains multiple independent claims, split it into multiple claim IDs before writing the manifest
- after the anchored points, add the longer explanation, tables, formulas, critique, and author-side reconstruction as needed
- do not paste detailed locator bullets in the middle of the body; reserve that for the final appendix
## 1. Paper Identification and Source Package Used
### Anchored Points
After anchored points, state the title, authors, venue or status, reading mode (`LaTeX-primary`, `PDF-primary`, or mixed), exact source files used, what was searched for, and what was missing.
## 2. One-Sentence Thesis and Research Equation
### Anchored Points
After anchored points, summarize the paper in one sentence and express the research equation in the form: old success -> broken assumption -> hard setting -> borrowed tool -> unavailable mechanism -> surrogate mechanism.
## 3. Title Interpretation
### Anchored Points
After anchored points, interpret the title term by term and explain how each keyword maps to the actual method, setting, and claim scope.
## 4. What Problem the Paper Really Solves
### Anchored Points
After anchored points, explain the direct problem, the practical pain point, the scientific question, and the larger pressure from the parent field.
## 5. Scientific Problem Ladder
### Anchored Points
After anchored points, build the ladder explicitly from paper-local problem to broader AI or systems boundary, and note any upper-level bottlenecks introduced by the method itself.
## 6. How the Authors May Have Found This Direction
### Anchored Points
After anchored points, reconstruct the likely dissatisfaction, near-transfer from neighboring methods, blocking constraint, and why the surrogate mechanism was worth trying. Keep uncertainty explicit.
## 7. How the Authors Built the Story
### Anchored Points
After anchored points, map challenge -> failure mode -> design principle -> module -> ablation or evidence, and judge whether the story forms a coherent loop instead of a bag of modules.
## 8. Related Work, Key Citations, and What Was Still Missing
### Anchored Points
After anchored points, explain what the key cited works solved, what they left open, and the narrative role of each citation cluster: field anchor, limitation evidence, method ancestor, baseline pressure, protocol justification, or contrast boundary.
## 9. Main Idea
### Anchored Points
After anchored points, explain the conceptual replacement or coordination logic that makes the method coherent, rather than repeating only module names.
## 10. Symbols, Assumptions, and Notation
### Anchored Points
After anchored points, introduce the important symbols, operators, assumptions, and task-specific objects before relying on them heavily later.
## 11. Key Formulas and Equation-by-Equation Explanation
### Anchored Points
After anchored points, preserve the central formulas in readable math form. For each one, explain symbols, role, why this form was chosen, how it connects to adjacent modules, and what looks fragile, heuristic, or expensive.
## 12. Theory / Proof / Practice Mapping
### Anchored Points
After anchored points, explain what is proved, why it is proved, what reviewer concern it addresses, how theory maps to implementation, and where theory and practice diverge.
## 13. Algorithm or Module Walkthrough with Concrete Example
### Anchored Points
After anchored points, give a step-by-step pipeline walkthrough with at least one concrete mini-example that instantiates inputs, intermediate states, and outputs.
## 14. Method Deep Reading: The Author-Thinking Behind Each Module
### Anchored Points
After anchored points, explain for each major module: the failure being fixed, the ideal but unavailable solution, the proxy signal actually used, the hidden assumption, the risk, and the future idea that appears if the assumption breaks.
## 15. Figure Explanation
### Anchored Points
After anchored points, interpret the key figures or captions, explain what each figure is meant to demonstrate, and judge whether the visual evidence really supports the associated claim.
## 16. Experimental Design
### Anchored Points
After anchored points, explain datasets, tasks, baselines, metrics, ablations, implementation details, and how the compared methods map back to the related-work landscape.
## 17. Experiments as Story Evidence and Claim Alignment Audit
### Anchored Points
After anchored points, explain what claim each main result is supposed to support, what alternative explanation it rules out, and whether the evidence strongly, partially, or weakly supports the claim.
## 18. Reviewer-Lens Audit
### Anchored Points
After anchored points, assess novelty, significance, soundness, rigor, reproducibility, clarity, missing controls, limitations honesty, and any OpenReview reviewer or rebuttal signal when available.
## 19. Innovation Points and Claim-by-Claim Support Audit
### Anchored Points
After anchored points, list the paper's main contribution claims and judge whether each one is supported by theory, experiments, qualitative evidence, reviewer discussion, or only weak evidence.
## 20. Story-Making Pattern Worth Learning
### Anchored Points
After anchored points, extract the reusable pattern from the paper, such as a replacement story, three-module story, closed loop, empty-cell positioning, or hidden-assumption break.
## 21. Weaknesses and Limitations
### Anchored Points
After anchored points, discuss unresolved weaknesses, failure modes, scope limits, hidden costs, and where the current idea is likely to break.
## 22. Innovation Type and Scientific-Boundary Judgment
### Anchored Points
After anchored points, judge whether the work is incremental, cross-pollinated, conceptually reframing, or potentially boundary-pushing, and explain why.
## 23. Future Directions and Stronger Idea Paths
### Anchored Points
After anchored points, propose next-step ideas, stronger boundary directions, alternative modules, or more decisive experiments. Tie the best future ideas to hidden assumptions whose failure would break the current method.
## 24. Vivid Plain-Language Story Summary
### Anchored Points
After anchored points, write a short memorable story that stays technically faithful while remaining accessible to a non-specialist.
## 25. Exact Sources Used
### Anchored Points
After anchored points, list exactly which PDFs, LaTeX files, supplementary materials, OpenReview pages, or screenshots were used, and explicitly mention missing or ambiguous sources.
---
## Optional Structured Notes for `research_lens.json`
### Research Equation
- old success / paradigm:
- broken assumption:
- hard setting:
- borrowed tool:
- unavailable mechanism:
- surrogate mechanism:
### Challenge-to-Module Map
| Challenge | Failure mode | Design principle | Module | Evidence |
|---|---|---|---|---|
### Module Lens Table
| Module | Failure fixed | Ideal unavailable solution | Available proxy | Hidden assumption | Future research point |
|---|---|---|---|---|---|
### Citation Function Table
| Citation cluster | Narrative function | Assumption inherited | How the paper modifies it |
|---|---|---|---|
### Story Pattern Worth Reusing
- pattern name:
- compact formula:
- lesson:
### Boundary-Pushing Idea List
- hidden assumption:
- what breaks:
- next mechanism worth exploring:
- linked claim ids:
---
# Appendix: Claim -> Evidence Index
Render this appendix only after the main body is complete.
Use `scripts/render_inline_trace_report.py` to append or refresh the detailed evidence appendix.
For each claim ID from the main report body, create a subsection like:
## C<section>.<index>
- Interpretation type:
- Statement:
- Research role:
- Confidence:
### Evidence 1
- Source file:
- Section path:
- Lines:
- Page:
- Locator method:
- Quote:
- Excerpt window:
- Notes:
FILE:templates/research_lens.template.json
{
"schema_version": "paper-research-lens/1.0",
"paper_id": "<paper-id>",
"research_equation": {
"one_sentence_thesis": "<A works under C, T breaks C, M needs Y, so the paper builds Z>",
"valuable_paradigm": "<old success>",
"broken_assumption": "<hidden assumption>",
"hard_setting": "<realistic constraint>",
"borrowed_tool": "<neighboring method>",
"unavailable_mechanism": "<missing Y>",
"surrogate_mechanism": "<replacement Z>",
"claim_ids": ["C2.1", "C6.1"]
},
"direction_reconstruction": {
"starting_dissatisfaction": "<likely starting dissatisfaction>",
"almost_worked_transfer": "<method that almost transferred>",
"blocking_constraint": "<why it still failed>",
"replacement_logic": "<how Y was replaced by Z>",
"claim_ids": ["C6.1", "C7.1"]
},
"challenge_module_map": [
{
"challenge": "<challenge>",
"failure_mode": "<failure mode>",
"design_principle": "<design principle>",
"module": "<module>",
"ablation_or_evidence": "<evidence>",
"claim_ids": ["C7.1", "C17.2"]
}
],
"module_lenses": [
{
"module": "<module>",
"failure_fixed": "<failure fixed>",
"ideal_unavailable_solution": "<ideal unavailable solution>",
"available_proxy": "<available proxy>",
"hidden_assumption": "<hidden assumption>",
"future_direction": "<next idea if the assumption breaks>",
"claim_ids": ["C14.1", "C23.1"]
}
],
"citation_logic": [
{
"citation_cluster": "<citation cluster>",
"narrative_function": "<field anchor / limitation / ancestor / baseline / contrast>",
"assumption_inherited": "<assumption from prior work>",
"paper_move": "<how this paper modifies it>",
"claim_ids": ["C8.1"]
}
],
"story_patterns": [
{
"pattern_name": "<replacement story / closed loop / empty cell>",
"formula": "<compact formula>",
"lesson": "<reusable lesson>",
"claim_ids": ["C20.1", "C23.1"]
}
],
"boundary_directions": [
{
"title": "<future direction>",
"hidden_assumption": "<assumption>",
"what_breaks": "<what fails if it breaks>",
"new_direction": "<new mechanism or setting>",
"claim_ids": ["C21.1", "C23.1"]
}
]
}
FILE:templates/traceability_manifest.template.json
{
"schema_version": "1.3.0",
"paper_id": "<paper-slug>",
"claims": [
{
"claim_id": "C5.2",
"section_id": "5",
"report_anchor": "## 5. Scientific Problem Ladder",
"statement": "The paper turns representation drift under heterogeneous clients into an explicit design target.",
"interpretation_type": "evidence-backed interpretation",
"research_role": "problem framing",
"confidence": "high",
"evidences": [
{
"evidence_id": "E5.2.a",
"source_kind": "latex_paragraph",
"source_file": "sections/introduction.tex",
"paragraph_id": "P-sections-introduction-0008",
"line_start": 41,
"line_end": 56,
"locator_method": "synctex_preferred",
"quote_text": "representation drift across heterogeneous clients",
"synctex": {
"pdf": "paper.pdf",
"page": 2,
"bbox": [76, 145, 480, 238]
},
"notes": "Gap statement in introduction."
}
]
}
]
}
FILE:scripts/extract_latex_paragraphs.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import hashlib
import json
import re
from pathlib import Path
from typing import Iterable
SECTION_RE = re.compile(r'\\(section|subsection|subsubsection|paragraph)\*?\{(.+?)\}')
BEGIN_RE = re.compile(r'\\begin\{([A-Za-z*]+)\}')
END_RE = re.compile(r'\\end\{([A-Za-z*]+)\}')
CAPTION_RE = re.compile(r'\\caption(?:\[[^\]]*\])?\{(.+?)\}')
INPUT_RE = re.compile(r'\\(?:input|include)\{(.+?)\}')
BLOCK_KINDS = {
'equation': 'equation',
'align': 'equation',
'align*': 'equation',
'gather': 'equation',
'gather*': 'equation',
'multline': 'equation',
'multline*': 'equation',
'theorem': 'theorem',
'lemma': 'theorem',
'proposition': 'theorem',
'corollary': 'theorem',
'definition': 'definition',
'remark': 'remark',
'algorithm': 'algorithm',
'algorithm*': 'algorithm',
'figure': 'figure',
'figure*': 'figure',
'table': 'table',
'table*': 'table',
}
def sanitize_text(text: str) -> str:
text = text.replace('\t', ' ')
lines = []
for raw in text.splitlines():
line = raw.rstrip('\n')
if not line.strip():
lines.append('')
continue
if line.lstrip().startswith('%'):
lines.append('')
continue
# Strip trailing comments when not escaped.
parts = re.split(r'(?<!\\)%', line, maxsplit=1)
lines.append(parts[0].rstrip())
return '\n'.join(lines)
def stable_id(rel_path: str, index: int) -> str:
slug = rel_path.replace('\\', '/').replace('/', '-').replace('.', '-')
return f'P-{slug}-{index:04d}'
def emit_entry(entries: list[dict], rel_path: str, start: int, end: int, text: str, section_path: list[str], kind: str) -> None:
cleaned = ' '.join(text.split())
if not cleaned:
return
entry = {
'paragraph_id': stable_id(rel_path, len(entries) + 1),
'source_path': rel_path,
'line_start': start,
'line_end': end,
'section_path': section_path[:],
'kind': kind,
'text': cleaned,
}
entries.append(entry)
def extract_blocks(rel_path: str, text: str) -> list[dict]:
lines = text.splitlines()
entries: list[dict] = []
section_stack: list[str] = []
buffer: list[str] = []
para_start: int | None = None
env_name: str | None = None
env_start: int | None = None
env_lines: list[str] = []
def flush_paragraph(current_line: int) -> None:
nonlocal buffer, para_start
if buffer and para_start is not None:
emit_entry(entries, rel_path, para_start, current_line, '\n'.join(buffer), section_stack, 'paragraph')
buffer = []
para_start = None
for idx, line in enumerate(lines, start=1):
sec_match = SECTION_RE.search(line)
if sec_match and env_name is None:
flush_paragraph(idx - 1)
level, title = sec_match.groups()
title = title.strip()
level_order = {'section': 1, 'subsection': 2, 'subsubsection': 3, 'paragraph': 4}[level]
section_stack[:] = section_stack[: level_order - 1]
section_stack.append(title)
continue
begin_match = BEGIN_RE.search(line)
end_match = END_RE.search(line)
if env_name is None and begin_match:
flush_paragraph(idx - 1)
env_name = begin_match.group(1)
env_start = idx
env_lines = [line]
continue
if env_name is not None:
env_lines.append(line)
if end_match and end_match.group(1) == env_name:
raw = '\n'.join(env_lines)
kind = BLOCK_KINDS.get(env_name, 'environment')
emit_entry(entries, rel_path, env_start or idx, idx, raw, section_stack, kind)
if kind in {'figure', 'table'}:
for caption_match in CAPTION_RE.finditer(raw):
emit_entry(entries, rel_path, env_start or idx, idx, caption_match.group(1), section_stack, f'{kind}_caption')
env_name = None
env_start = None
env_lines = []
continue
if not line.strip():
flush_paragraph(idx - 1)
continue
if para_start is None:
para_start = idx
buffer.append(line)
flush_paragraph(len(lines))
return entries
def main() -> None:
parser = argparse.ArgumentParser(description='Extract structured LaTeX paragraph anchors.')
parser.add_argument('input_dir', help='Directory containing .tex files')
parser.add_argument('output', help='Output JSON path')
parser.add_argument('--paper-id', default='paper', help='Paper identifier')
args = parser.parse_args()
root = Path(args.input_dir).resolve()
tex_files = sorted(root.rglob('*.tex'))
if not tex_files:
raise SystemExit(f'No .tex files found under {root}')
paragraphs: list[dict] = []
for tex_file in tex_files:
rel = tex_file.relative_to(root).as_posix()
text = sanitize_text(tex_file.read_text(encoding='utf-8', errors='ignore'))
paragraphs.extend(extract_blocks(rel, text))
payload = {
'schema_version': '1.2.0',
'paper_id': args.paper_id,
'paragraphs': paragraphs,
}
out_path = Path(args.output)
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding='utf-8')
print(f'Wrote {len(paragraphs)} anchors to {out_path}')
if __name__ == '__main__':
main()
FILE:scripts/render_inline_trace_report.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import re
from pathlib import Path
APPENDIX_HEADING = '# Appendix: Claim -> Evidence Index'
CLAIM_BULLET_RE = re.compile(r'^\s*-\s+\[(C\d+(?:\.\d+)*)\](?:\[[^\]]+\])?', re.MULTILINE)
def load_json(path: Path) -> dict:
return json.loads(path.read_text(encoding='utf-8'))
def clean_latex_text(text: str) -> str:
text = text.replace('~', ' ')
text = re.sub(r'\\begin\{[^}]+\}', ' ', text)
text = re.sub(r'\\end\{[^}]+\}', ' ', text)
text = re.sub(r'\\(cite|ref|label|url|footnote)\{[^}]*\}', ' ', text)
text = re.sub(r'\\textbf\{([^}]*)\}', r'\1', text)
text = re.sub(r'\\textit\{([^}]*)\}', r'\1', text)
text = re.sub(r'\\emph\{([^}]*)\}', r'\1', text)
text = re.sub(r'\\[a-zA-Z]+\*?(?:\[[^\]]*\])?\{([^}]*)\}', r'\1', text)
text = re.sub(r'\\[a-zA-Z]+\*?', ' ', text)
text = text.replace('{', '').replace('}', '')
return re.sub(r'\s+', ' ', text).strip()
def section_label(paragraph: dict) -> str:
section_path = paragraph.get('section_path') or []
if section_path:
return ' > '.join(str(part) for part in section_path)
kind = paragraph.get('kind') or ''
if kind == 'abstract':
return 'Abstract'
return 'Front matter / unspecified'
def start_excerpt(text: str, n_words: int = 14) -> str:
words = clean_latex_text(text).split()
return ' '.join(words[:n_words])
def end_excerpt(text: str, n_words: int = 14) -> str:
words = clean_latex_text(text).split()
return ' '.join(words[-n_words:])
def detect_language(report_text: str, language: str) -> str:
if language in {'en', 'zh'}:
return language
cjk_chars = sum(1 for char in report_text if '\u4e00' <= char <= '\u9fff')
return 'zh' if cjk_chars >= 20 else 'en'
def claim_ids_in_body(report_text: str) -> list[str]:
body = report_text.split(APPENDIX_HEADING, 1)[0]
seen: set[str] = set()
ordered: list[str] = []
for claim_id in CLAIM_BULLET_RE.findall(body):
if claim_id in seen:
continue
seen.add(claim_id)
ordered.append(claim_id)
return ordered
def localize(language: str) -> dict[str, str]:
if language == 'zh':
return {
'interpretation_type': '判定类型',
'statement': '结论',
'research_role': '研究角色',
'confidence': '置信度',
'human_locators': '人工定位提示',
'evidence': '证据',
'source_file': '来源文件',
'section_path': '章节路径',
'lines': '行号',
'page': '页码',
'locator_method': '定位方式',
'quote': '引用片段',
'excerpt_window': '摘录窗口',
'notes': '备注',
'missing_manifest': 'traceability_manifest.json 中缺少该 claim。',
'unknown': '未知',
'not_available': '无',
}
return {
'interpretation_type': 'Interpretation type',
'statement': 'Statement',
'research_role': 'Research role',
'confidence': 'Confidence',
'human_locators': 'Human locators',
'evidence': 'Evidence',
'source_file': 'Source file',
'section_path': 'Section path',
'lines': 'Lines',
'page': 'Page',
'locator_method': 'Locator method',
'quote': 'Quote',
'excerpt_window': 'Excerpt window',
'notes': 'Notes',
'missing_manifest': 'Missing from traceability_manifest.json.',
'unknown': 'unknown',
'not_available': 'n/a',
}
def format_optional_value(value: object, unknown_label: str) -> str:
if value is None:
return unknown_label
text = str(value).strip()
return text or unknown_label
def format_line_span(evidence: dict, paragraph: dict | None, unknown_label: str) -> str:
line_start = evidence.get('line_start')
line_end = evidence.get('line_end')
if paragraph:
line_start = paragraph.get('line_start', line_start)
line_end = paragraph.get('line_end', line_end)
if line_start and line_end:
return f'{line_start}-{line_end}'
if line_start:
return str(line_start)
return unknown_label
def format_evidence_block(evidence: dict, paragraph: dict | None, labels: dict[str, str], index: int) -> list[str]:
source_file = evidence.get('source_file') or evidence.get('doc')
section_path = labels['unknown']
excerpt_window = labels['not_available']
if paragraph:
source_file = paragraph.get('source_path', source_file)
section_path = section_label(paragraph)
excerpt_window = f'"{start_excerpt(paragraph.get("text", ""))}" -> "{end_excerpt(paragraph.get("text", ""))}"'
if not source_file and evidence.get('paragraph_id'):
source_file = evidence.get('paragraph_id')
lines = format_line_span(evidence, paragraph, labels['unknown'])
page = format_optional_value(evidence.get('page'), labels['not_available'])
locator_method = format_optional_value(evidence.get('locator_method') or evidence.get('relation'), labels['not_available'])
quote = format_optional_value(evidence.get('quote_text') or evidence.get('quote'), labels['not_available'])
locator_snippets = evidence.get('locator_snippets')
notes_value = evidence.get('notes')
if isinstance(locator_snippets, list) and locator_snippets:
snippet_text = '; '.join(str(item).strip() for item in locator_snippets if str(item).strip())
if notes_value:
notes_value = f'{notes_value} | Locator snippets: {snippet_text}'
else:
notes_value = f'Locator snippets: {snippet_text}'
notes = format_optional_value(notes_value, labels['not_available'])
return [
f'### {labels["evidence"]} {index}',
f'- {labels["source_file"]}: `{format_optional_value(source_file, labels["unknown"])}`',
f'- {labels["section_path"]}: {section_path}',
f'- {labels["lines"]}: {lines}',
f'- {labels["page"]}: {page}',
f'- {labels["locator_method"]}: `{locator_method}`',
f'- {labels["quote"]}: {quote}',
f'- {labels["excerpt_window"]}: {excerpt_window}',
f'- {labels["notes"]}: {notes}',
]
def render_appendix(report_text: str, manifest: dict, paragraphs: dict, language: str) -> str:
labels = localize(language)
claim_ids = claim_ids_in_body(report_text)
para_map = {
item['paragraph_id']: item
for item in paragraphs.get('paragraphs', [])
if isinstance(item, dict) and item.get('paragraph_id')
}
claim_map = {
item['claim_id']: item
for item in manifest.get('claims', [])
if isinstance(item, dict) and item.get('claim_id')
}
appendix_lines = [APPENDIX_HEADING, '']
for claim_id in claim_ids:
claim = claim_map.get(claim_id)
appendix_lines.append(f'## {claim_id}')
if not claim:
appendix_lines.append(f'- {labels["notes"]}: {labels["missing_manifest"]}')
appendix_lines.append('')
continue
appendix_lines.extend(
[
f'- {labels["interpretation_type"]}: {format_optional_value(claim.get("interpretation_type"), labels["unknown"])}',
f'- {labels["statement"]}: {format_optional_value(claim.get("statement") or claim.get("claim_text"), labels["unknown"])}',
f'- {labels["research_role"]}: {format_optional_value(claim.get("research_role"), labels["not_available"])}',
f'- {labels["confidence"]}: {format_optional_value(claim.get("confidence"), labels["not_available"])}',
]
)
human_locators = claim.get('human_locators')
if isinstance(human_locators, list) and human_locators:
appendix_lines.append(f'- {labels["human_locators"]}:')
for locator in human_locators:
appendix_lines.append(f' - {str(locator).strip()}')
appendix_lines.append('')
evidences = claim.get('evidences')
if not isinstance(evidences, list):
evidences = claim.get('evidence', [])
for index, evidence in enumerate(evidences, start=1):
if not isinstance(evidence, dict):
continue
paragraph = para_map.get(evidence.get('paragraph_id'))
appendix_lines.extend(format_evidence_block(evidence, paragraph, labels, index))
appendix_lines.append('')
appendix = '\n'.join(appendix_lines).rstrip() + '\n'
if APPENDIX_HEADING in report_text:
body = report_text.split(APPENDIX_HEADING, 1)[0].rstrip()
return body + '\n\n' + appendix
return report_text.rstrip() + '\n\n' + appendix
def main() -> None:
parser = argparse.ArgumentParser(description='Render the final claim-to-evidence appendix into report.md')
parser.add_argument('--report', required=True)
parser.add_argument('--manifest', required=True)
parser.add_argument('--paragraphs', required=True)
parser.add_argument('--output', required=True)
parser.add_argument('--language', choices=['auto', 'en', 'zh'], default='auto')
args = parser.parse_args()
report_text = Path(args.report).read_text(encoding='utf-8')
manifest = load_json(Path(args.manifest))
paragraphs = load_json(Path(args.paragraphs))
language = detect_language(report_text, args.language)
rendered = render_appendix(report_text, manifest, paragraphs, language)
Path(args.output).write_text(rendered, encoding='utf-8')
if __name__ == '__main__':
main()
FILE:scripts/validate_traceability.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
APPENDIX_HEADING = '# Appendix: Claim -> Evidence Index'
CLAIM_BULLET_RE = re.compile(r'^\s*-\s+\[(C\d+(?:\.\d+)*)\](?:\[[^\]]+\])?', re.MULTILINE)
ALLOWED_LABELS = {
'evidence-backed interpretation',
'plausible inference',
'speculation',
}
def load_json(path: Path) -> dict:
try:
return json.loads(path.read_text(encoding='utf-8'))
except Exception as exc:
raise SystemExit(f'Failed to read JSON {path}: {exc}') from exc
def extract_report_claims(report_path: Path) -> list[str]:
text = report_path.read_text(encoding='utf-8')
if APPENDIX_HEADING in text:
text = text.split(APPENDIX_HEADING, 1)[0]
return CLAIM_BULLET_RE.findall(text)
def validate(report_path: Path, manifest_path: Path, paragraphs_path: Path | None = None) -> tuple[list[str], list[str]]:
errors: list[str] = []
warnings: list[str] = []
report_claims = extract_report_claims(report_path)
report_set = set(report_claims)
duplicates_in_report = sorted({claim for claim in report_claims if report_claims.count(claim) > 1})
if duplicates_in_report:
errors.append(f'Duplicate claim ids in report: {", ".join(duplicates_in_report)}')
manifest = load_json(manifest_path)
claims = manifest.get('claims', [])
if not isinstance(claims, list):
errors.append('`claims` must be a list in traceability manifest.')
claims = []
paragraph_ids: set[str] = set()
if paragraphs_path:
paragraphs = load_json(paragraphs_path).get('paragraphs', [])
if isinstance(paragraphs, list):
paragraph_ids = {item.get('paragraph_id') for item in paragraphs if isinstance(item, dict) and item.get('paragraph_id')}
manifest_ids: list[str] = []
for idx, entry in enumerate(claims, start=1):
if not isinstance(entry, dict):
errors.append(f'Claim entry #{idx} is not an object.')
continue
claim_id = entry.get('claim_id')
if not claim_id:
errors.append(f'Claim entry #{idx} is missing `claim_id`.')
continue
manifest_ids.append(claim_id)
label = entry.get('interpretation_type')
if label not in ALLOWED_LABELS:
errors.append(f'{claim_id}: invalid interpretation_type `{label}`.')
statement = str(entry.get('statement') or entry.get('claim_text') or '').strip()
if not statement:
errors.append(f'{claim_id}: missing statement.')
evidences = entry.get('evidences')
if not isinstance(evidences, list):
evidences = entry.get('evidence')
if not isinstance(evidences, list) or not evidences:
errors.append(f'{claim_id}: missing evidence list.')
continue
seen_evidence_ids: set[str] = set()
for eidx, evidence in enumerate(evidences, start=1):
if not isinstance(evidence, dict):
errors.append(f'{claim_id}: evidence #{eidx} is not an object.')
continue
evidence_id = evidence.get('evidence_id')
if not evidence_id:
errors.append(f'{claim_id}: evidence #{eidx} is missing evidence_id.')
elif evidence_id in seen_evidence_ids:
errors.append(f'{claim_id}: duplicate evidence_id `{evidence_id}`.')
else:
seen_evidence_ids.add(evidence_id)
paragraph_id = evidence.get('paragraph_id')
if paragraph_id and paragraph_ids and paragraph_id not in paragraph_ids:
errors.append(f'{claim_id}: paragraph_id `{paragraph_id}` not found in latex_paragraphs.json.')
locator_method = evidence.get('locator_method') or evidence.get('relation')
if not locator_method:
warnings.append(f'{claim_id}: evidence #{eidx} has no locator_method.')
if not any(
evidence.get(key)
for key in ('paragraph_id', 'page', 'caption_label', 'equation_label', 'synctex', 'source_file', 'doc', 'locator_snippets')
):
errors.append(f'{claim_id}: evidence #{eidx} has no usable locator field.')
manifest_dupes = sorted({claim_id for claim_id in manifest_ids if manifest_ids.count(claim_id) > 1})
if manifest_dupes:
errors.append(f'Duplicate claim ids in manifest: {", ".join(manifest_dupes)}')
manifest_set = set(manifest_ids)
missing_from_manifest = sorted(report_set - manifest_set)
if missing_from_manifest:
errors.append('Claims present in report but missing from manifest: ' + ', '.join(missing_from_manifest))
missing_from_report = sorted(manifest_set - report_set)
if missing_from_report:
warnings.append('Claims present in manifest but missing from report: ' + ', '.join(missing_from_report))
return errors, warnings
def main() -> None:
parser = argparse.ArgumentParser(description='Validate claim-to-evidence traceability.')
parser.add_argument('--report', required=True, help='Path to report.md')
parser.add_argument('--manifest', required=True, help='Path to traceability_manifest.json')
parser.add_argument('--paragraphs', help='Path to latex_paragraphs.json')
args = parser.parse_args()
report_path = Path(args.report)
manifest_path = Path(args.manifest)
paragraphs_path = Path(args.paragraphs) if args.paragraphs else None
errors, warnings = validate(report_path, manifest_path, paragraphs_path)
if warnings:
print('Warnings:')
for item in warnings:
print(f' - {item}')
if errors:
print('Errors:')
for item in errors:
print(f' - {item}')
raise SystemExit(1)
print('Traceability validation passed.')
if __name__ == '__main__':
main()
FILE:references/artifact_contract.md
# Artifact Contract
## `artifact_index.json`
Top-level index for all outputs that belong to one paper reading bundle.
Required keys:
- `schema_version`
- `paper_id`
- `report`
- `traceability_manifest`
- `latex_paragraphs`
- `research_lens`
Optional keys:
- `source_package`
- `pdfs`
- `notes`
## `traceability_manifest.json`
Maps report claims to source evidence.
Each claim entry should include:
- `claim_id`
- `section_id`
- `report_anchor`
- `statement`
- `interpretation_type`
- `confidence`
- `evidences`
Recommended extra fields:
- `research_role`
- `human_locators`
Each evidence entry may include:
- `evidence_id`
- `source_kind`
- `source_file`
- `paragraph_id`
- `page`
- `line_start`
- `line_end`
- `locator_method`
- `synctex`
- `quote_text`
- `notes`
## `latex_paragraphs.json`
Stable anchor list extracted from LaTeX.
Each paragraph entry should include:
- `paragraph_id`
- `source_path`
- `line_start`
- `line_end`
- `section_path`
- `kind`
- `text`
## `research_lens.json`
This is the compact idea-mining layer.
It should capture:
- research equation
- direction reconstruction
- challenge-to-module map
- module hidden assumptions
- citation logic
- reusable story pattern
- strongest future directions
Every `claim_ids` entry inside `research_lens.json` must point to a real report claim.
## Report requirement
In `report.md`, every main-body claim bullet should appear in the form:
- `[C<section>.<index>][interpretation label] statement`
The detailed locator material should live in the final `# Appendix: Claim -> Evidence Index`, not inside the middle of the narrative body.
For each claim entry in the appendix, provide enough detail for a human verifier to know:
- which source file to open
- which section / subsection to inspect
- which line span or page span to inspect
- what quote snippet or excerpt window to look for
- what note or role explains why that evidence matters
## Claim typing
Allowed interpretation labels:
- `evidence-backed interpretation`
- `plausible inference`
- `speculation`
FILE:references/research-generative-methodology.md
# Research-Generative Methodology
Use this reference when the user wants more than grounded verification.
Its goal is to turn a paper reading into a **research-generation exercise** while keeping every important statement source-aware.
The core move is:
> Read the paper as a hidden design path.
## 1. Research Equation
Compress the paper into:
`old success + broken assumption + hard setting + borrowed tool + surrogate mechanism`
Useful questions:
- What important paradigm already worked?
- What hidden assumption made it work?
- In what realistic setting does that assumption fail?
- What neighboring method almost transfers?
- What missing mechanism `Y` blocks direct transfer?
- What surrogate `Z` does the paper build instead?
## 2. How the Direction Was Likely Found
Use evidence-backed phrasing:
- "The authors likely noticed that ..."
- "A plausible thinking path is ..."
- "The setup suggests ..."
Try to reconstruct:
- starting dissatisfaction
- tempting transferred method
- blocking constraint
- replacement logic
## 3. How the Story Was Built
Look for:
`challenge -> failure mode -> design principle -> module -> ablation`
Strong papers often create a loop instead of a bag of tricks.
Explain whether one module creates the resource that the next module needs.
## 4. Method Deep Reading
For each module, reconstruct:
`failure + ideal unavailable solution + available proxy + design choice + hidden assumption + risk`
The most useful framing is usually:
> This module is not just a trick; it is a surrogate for the missing mechanism `Y`.
## 5. Reverse Citation Logic
Treat citations as narrative functions:
- field anchor
- limitation evidence
- method ancestor
- neighboring inspiration
- baseline pressure
- protocol justification
- contrast boundary
Explain what permission each key citation gives the paper.
## 6. Experiments as Story Evidence
Read each result as:
`claim + counterfactual + metric + stress condition`
Ask:
- what claim it supports
- what alternative explanation it rules out
- which module it validates
- whether the stress condition really matches the paper's target difficulty
## 7. Story Pattern Worth Learning
Extract one reusable pattern, such as:
- replacement story
- three-module story
- two-axis empty cell
- closed-loop contribution
- hidden-assumption break
## 8. Weakness to New Idea Conversion
Use:
`future work = current method + violated assumption + new mechanism`
For each strong weakness, ask what next paper becomes possible if the key hidden assumption fails harder.
## 9. Writing Rules
Prefer phrasing like:
- "A plausible author-side thinking path is ..."
- "This module is best understood as a surrogate for ..."
- "The citation is not ornamental; it functions as ..."
- "This weakness can be converted into a new research direction ..."
Avoid:
- restating the abstract
- listing sections without causal explanation
- paraphrasing equations without saying why they exist
- speaking as if private author intent were directly observed
FILE:references/synctex_locator_notes.md
# SyncTeX Locator Notes
This bundle does not require SyncTeX for every run, but the reader and artifact contract are designed to use it whenever available.
## Recommended workflow
1. Compile the paper with SyncTeX enabled, for example:
- `latexmk -pdf -synctex=1 main.tex`
2. Keep:
- the compiled PDF
- the generated `.synctex.gz`
3. Preserve `source_path`, `line_start`, and `line_end` in `latex_paragraphs.json`
4. During bundle enrichment, convert line anchors to PDF page / bbox coordinates
5. Store the result under each evidence item, for example:
```json
"synctex": {
"pdf": "paper.pdf",
"page": 3,
"bbox": [70, 120, 480, 170]
}
```
## Localization priority
1. explicit SyncTeX page + bbox
2. explicit PDF page + bbox already stored in manifest
3. source path + line span
4. text-search fallback
FILE:agents/openai.yaml
interface:
display_name: "Paper Deep Reading"
short_description: "Deep single-file paper reading for ClawHub."
brand_color: "#2563EB"
default_prompt: "Deep-read this paper, produce a rich narrative report.md, keep main-body claims anchored with claim IDs, move detailed evidence into the final Claim -> Evidence Index appendix, emit traceability_manifest.json, latex_paragraphs.json, artifact_index.json, and research_lens.json, and match the report language to the user request language."
Offline-first workflow for turning Chinese web page video or audio into text and Word deliverables. Use when Codex needs to (1) extract playable media stream...
---
name: web-video-transcribe-docx
description: Offline-first workflow for turning Chinese web page video or audio into text and Word deliverables. Use when Codex needs to (1) extract playable media streams from arbitrary web pages, including pages that expose MP4, M3U8, MPD, or separate audio streams, (2) download direct media URLs with common headers such as Referer or Origin, (3) run local Chinese ASR with SenseVoice via sherpa-onnx, (4) clean or chapterize the transcript conservatively, or (5) produce TXT and DOCX files from the result. Toutiao pages are a supported special case, not the only target.
license: MIT-0
metadata: {"openclaw":{"requires":{"anyBins":["python","python3","py"]}},"author":"qu_yw","version":"1.0.0","category":"media-transcription"}
---
# Web Video Transcribe Docx
## Overview
Use the bundled scripts to extract media, download it, transcribe it offline, and render DOCX output.
Prefer the deterministic scripts before hand-rolling new code.
Use `{baseDir}` when constructing file paths inside this skill so the instructions stay portable across agents and marketplaces.
## Environment
- Require Python 3 and local filesystem access.
- Require network access for first-run model download and for fetching page or media URLs.
- Require a local Chrome or Edge browser only when extracting media from a web page.
## Quick Start
1. Run `python {baseDir}/scripts/bootstrap_env.py` once in the target environment.
2. For a generic web page URL, run `python {baseDir}/scripts/pipeline_web_to_docx.py <url> --output-dir <dir>`.
3. For a direct media URL, run `python {baseDir}/scripts/download_url.py <url> <output>` and then `python {baseDir}/scripts/transcribe_sensevoice.py --input <file> --output-txt <txt> --output-docx <docx>`.
4. For a local media file, run `python {baseDir}/scripts/transcribe_sensevoice.py --input <file> --output-txt <txt> --output-docx <docx>`.
5. If the user asks for a polished reading version rather than a raw transcript, read [references/cleanup-guidelines.md](references/cleanup-guidelines.md), produce a refined `.txt`, and then render it with `python {baseDir}/scripts/transcript_to_docx.py`.
## Example Requests
- "Transcribe the Chinese audio from this web video and export it as a Word document."
- "Turn this MP4 into a transcript, then reorganize it into chaptered reading notes."
- "This page needs a `Referer` header for media download. Extract the media stream and convert it to DOCX."
## Workflow
### 1. Classify the source
- **Generic page URL**: Use `python {baseDir}/scripts/pipeline_web_to_docx.py` first. If the page is especially stubborn and it is a Toutiao page, `python {baseDir}/scripts/pipeline_toutiao_to_docx.py` and `python {baseDir}/scripts/extract_toutiao_media.py` remain available as site-specific fallbacks.
- **Direct media URL**: Use `python {baseDir}/scripts/download_url.py`, then transcribe.
- **Local file**: Transcribe directly.
### 2. Preserve raw outputs
- Keep the raw transcript as its own `.txt`.
- If you produce a cleaned or chapterized version, save it as a separate file.
- Do not overwrite the raw transcript unless the user explicitly asks.
### 3. Prefer the audio stream
- If a page exposes a dedicated audio stream, prefer downloading that instead of the full video stream.
- If the page only exposes a video stream, let ffmpeg decode audio during transcription.
- If the page exposes HLS or DASH manifests, prefer downloading them through the bundled downloader or pipeline instead of raw HTTP GET.
### 4. Refine conservatively
- Preserve meaning.
- Fix obvious ASR mistakes, punctuation, paragraph breaks, headings, and chapter boundaries.
- Do not invent quotes or historical claims that are not supported by the transcript.
- If a passage is too noisy to restore confidently, keep it neutral instead of fabricating detail.
### 5. Stay within scope
- Only download URLs that the user supplied directly or that the extractor captured from the target page.
- Do not request, store, or exfiltrate cookies, access tokens, or account credentials.
- Do not attempt to bypass DRM, login walls, or geo-restriction controls.
- If a page requires authenticated browser state that is not already available, say so plainly and stop at the supported boundary.
### 6. Render deliverables
- Use `python {baseDir}/scripts/transcript_to_docx.py` for generic TXT-to-DOCX rendering.
- Use the raw transcript for auditability and the refined transcript for reading quality.
## Scripts
- `scripts/bootstrap_env.py`
Install or verify the Python packages used by this skill.
- `scripts/extract_web_media.py`
Open a generic web page in a real browser, capture likely media URLs plus common download headers, and emit a JSON manifest.
- `scripts/extract_toutiao_media.py`
Open a Toutiao page in a real browser, capture audio/video URLs, and emit a JSON manifest with the same schema as the generic extractor.
- `scripts/download_url.py`
Download a direct media URL to disk with a stable user agent, optional headers, and HLS/DASH handling.
- `scripts/transcribe_sensevoice.py`
Download the SenseVoice model on demand, segment media, run offline ASR, and emit TXT and optional DOCX.
- `scripts/transcript_to_docx.py`
Render timestamped transcripts or chapterized notes into a Word document.
- `scripts/pipeline_web_to_docx.py`
Run the generic end-to-end pipeline: extract, download, transcribe, and render.
- `scripts/pipeline_toutiao_to_docx.py`
Run the Toutiao-specialized end-to-end pipeline for cases where the generic extractor is not preferred.
## References
- Read [references/workflow.md](references/workflow.md) for dependency expectations, command examples, output layout, and troubleshooting.
- Read [references/cleanup-guidelines.md](references/cleanup-guidelines.md) before polishing a noisy transcript into a chapterized reading copy.
- Read [references/publishing.md](references/publishing.md) when preparing marketplace metadata, tags, versioning, or ClawHub publish commands.
## Validation
- Run `python {baseDir}/scripts/bootstrap_env.py` before first use in a fresh environment.
- Validate the skill folder with `skill-creator/scripts/quick_validate.py`.
- Prefer testing `--help` and one representative happy path after changing the scripts.
- If extraction fails on a page, capture a direct media URL with browser tooling and continue with the downloader + transcriber.
- Do not promise support for DRM-protected streams, authenticated cookies, or sites that only expose encrypted EME playback.
FILE:scripts/bootstrap_env.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import importlib.util
import subprocess
import sys
REQUIRED_PACKAGES = [
("requests", "requests"),
("playwright", "playwright"),
("imageio_ffmpeg", "imageio-ffmpeg"),
("sherpa_onnx", "sherpa-onnx"),
("docx", "python-docx"),
("numpy", "numpy"),
]
def missing_packages() -> list[str]:
missing: list[str] = []
for module_name, package_name in REQUIRED_PACKAGES:
if importlib.util.find_spec(module_name) is None:
missing.append(package_name)
return missing
def main() -> int:
parser = argparse.ArgumentParser(description="Install Python packages required by the skill.")
parser.add_argument(
"--install-playwright-browser",
action="store_true",
help="Also run `python -m playwright install chromium`.",
)
args = parser.parse_args()
missing = missing_packages()
if missing:
subprocess.run([sys.executable, "-m", "pip", "install", *missing], check=True)
else:
print("All required Python packages are already installed.")
if args.install_playwright_browser:
subprocess.run([sys.executable, "-m", "playwright", "install", "chromium"], check=True)
print("Environment bootstrap complete.")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/download_url.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
from pathlib import Path
from pipeline_common import download_media_url
def load_headers(
inline_headers: list[str] | None = None,
headers_json: str | None = None,
) -> dict[str, str]:
headers: dict[str, str] = {}
if headers_json:
candidate = Path(headers_json)
if candidate.is_file():
loaded = json.loads(candidate.read_text(encoding="utf-8"))
else:
loaded = json.loads(headers_json)
if not isinstance(loaded, dict):
raise SystemExit("`--headers-json` must point to or contain a JSON object.")
headers.update({str(key): str(value) for key, value in loaded.items() if value is not None})
for item in inline_headers or []:
if "=" not in item:
raise SystemExit(f"Invalid `--header` value: {item!r}. Expected KEY=VALUE.")
key, value = item.split("=", 1)
key = key.strip()
value = value.strip()
if not key or not value:
raise SystemExit(f"Invalid `--header` value: {item!r}. Expected KEY=VALUE.")
headers[key] = value
return headers
def main() -> int:
parser = argparse.ArgumentParser(
description="Download a direct media URL to disk, including HLS manifests when possible."
)
parser.add_argument("url", help="Direct media URL")
parser.add_argument("output", help="Output file path")
parser.add_argument(
"--header",
action="append",
help="Optional request header in KEY=VALUE form; may be repeated",
)
parser.add_argument(
"--headers-json",
help="Optional JSON string or path to a JSON file containing additional request headers",
)
args = parser.parse_args()
output_path = Path(args.output)
headers = load_headers(args.header, args.headers_json)
download_media_url(args.url, output_path, headers=headers)
print(f"Saved: {output_path.resolve()}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/extract_toutiao_media.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import time
from pathlib import Path
try:
from playwright.sync_api import sync_playwright
except ImportError as exc:
raise SystemExit(
"Missing Playwright. Run `python scripts/bootstrap_env.py` first."
) from exc
from pipeline_common import USER_AGENT, find_browser_executable, sanitize_captured_headers
def _looks_like_audio(url: str, content_type: str = "") -> bool:
lowered = url.lower()
ctype = content_type.lower()
return (
"media-audio" in lowered
or "/audio" in lowered
or "audio/" in ctype
or "mime_type=audio" in lowered
)
def _looks_like_video(url: str, content_type: str = "") -> bool:
lowered = url.lower()
ctype = content_type.lower()
return (
"media-video" in lowered
or ".mp4" in lowered
or ".m3u8" in lowered
or "video/" in ctype
or "mime_type=video" in lowered
)
def _is_relevant(url: str) -> bool:
lowered = url.lower()
return (
"toutiaovod.com" in lowered
or "toutiao.com" in lowered
or ".mp4" in lowered
or ".m3u8" in lowered
or "media-audio" in lowered
or "media-video" in lowered
)
def extract_media_info(url: str, wait_seconds: int = 12) -> dict[str, object]:
browser_executable = find_browser_executable()
audio_candidates: dict[str, dict[str, object]] = {}
video_candidates: dict[str, dict[str, object]] = {}
def collect(
candidate_url: str,
content_type: str = "",
headers: dict[str, str] | None = None,
) -> None:
if not _is_relevant(candidate_url):
return
candidate = {
"url": candidate_url,
"content_type": content_type or None,
"headers": sanitize_captured_headers(headers),
}
if _looks_like_audio(candidate_url, content_type):
audio_candidates.setdefault(candidate_url, candidate)
elif _looks_like_video(candidate_url, content_type):
video_candidates.setdefault(candidate_url, candidate)
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=True, executable_path=browser_executable)
page = browser.new_page(user_agent=USER_AGENT)
page.on("request", lambda request: collect(request.url, headers=getattr(request, "headers", None)))
page.on(
"response",
lambda response: collect(
response.url,
(response.headers or {}).get("content-type", ""),
headers=getattr(response.request, "headers", None),
),
)
page.goto(url, wait_until="domcontentloaded", timeout=120000)
time.sleep(wait_seconds)
title = page.title()
final_url = page.url
browser.close()
audio_list = list(audio_candidates.values())
video_list = list(video_candidates.values())
best_audio = audio_list[0] if audio_list else None
best_video = video_list[0] if video_list else None
return {
"source_url": url,
"final_url": final_url,
"title": title,
"audio_urls": [str(item["url"]) for item in audio_list],
"video_urls": [str(item["url"]) for item in video_list],
"audio_candidates": audio_list,
"video_candidates": video_list,
"best_audio_url": best_audio["url"] if best_audio else None,
"best_video_url": best_video["url"] if best_video else None,
"best_audio_headers": best_audio["headers"] if best_audio else None,
"best_video_headers": best_video["headers"] if best_video else None,
"best_audio": best_audio,
"best_video": best_video,
}
def main() -> int:
parser = argparse.ArgumentParser(description="Extract audio/video URLs from a Toutiao video page.")
parser.add_argument("url", help="Toutiao page URL")
parser.add_argument("--wait-seconds", type=int, default=12, help="Extra wait time after page load")
parser.add_argument("--output-json", help="Optional output JSON path")
args = parser.parse_args()
info = extract_media_info(args.url, wait_seconds=args.wait_seconds)
rendered = json.dumps(info, ensure_ascii=False, indent=2)
if args.output_json:
Path(args.output_json).write_text(rendered + "\n", encoding="utf-8")
print(f"Saved media info to: {Path(args.output_json).resolve()}")
else:
print(rendered)
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/extract_web_media.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import time
from pathlib import Path
from urllib.parse import urlparse
try:
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
from playwright.sync_api import sync_playwright
except ImportError as exc:
raise SystemExit(
"Missing Playwright. Run `python scripts/bootstrap_env.py` first."
) from exc
from pipeline_common import (
AUDIO_SUFFIXES,
STREAM_SUFFIXES,
USER_AGENT,
VIDEO_SUFFIXES,
classify_media_url,
find_browser_executable,
is_probable_media_url,
sanitize_captured_headers,
)
def _candidate_score(
url: str,
*,
kind: str,
content_type: str = "",
headers: dict[str, str] | None = None,
) -> int:
lowered = url.lower()
ctype = content_type.lower()
suffix = Path(urlparse(url).path).suffix.lower()
score = 200 if kind == "audio" else 100
if suffix in AUDIO_SUFFIXES | VIDEO_SUFFIXES:
score += 60
elif suffix in STREAM_SUFFIXES:
score += 30
if ctype.startswith("audio/") or ctype.startswith("video/"):
score += 40
if "media-audio" in lowered or "media-video" in lowered:
score += 30
if any(token in lowered for token in ("playurl", "playlist", "manifest", "stream")):
score += 15
if headers:
score += 5 * len(headers)
return score
def _upsert_candidate(
store: dict[str, dict[str, object]],
candidate_url: str,
*,
content_type: str = "",
headers: dict[str, str] | None = None,
) -> None:
if not candidate_url.startswith(("http://", "https://")):
return
if not is_probable_media_url(candidate_url, content_type):
return
kind = classify_media_url(candidate_url, content_type)
if kind is None:
return
clean_headers = sanitize_captured_headers(headers)
candidate = {
"url": candidate_url,
"kind": kind,
"content_type": content_type or None,
"headers": clean_headers,
"score": _candidate_score(
candidate_url,
kind=kind,
content_type=content_type,
headers=clean_headers,
),
}
existing = store.get(candidate_url)
if existing is None:
store[candidate_url] = candidate
return
if candidate["score"] > existing.get("score", 0):
store[candidate_url] = candidate
return
if candidate["content_type"] and not existing.get("content_type"):
existing["content_type"] = candidate["content_type"]
merged_headers = dict(existing.get("headers") or {})
merged_headers.update(clean_headers)
existing["headers"] = merged_headers
def _trigger_page_media(page) -> None:
try:
page.evaluate(
"""
() => {
for (const video of document.querySelectorAll("video")) {
video.muted = true;
const promise = video.play();
if (promise && promise.catch) {
promise.catch(() => {});
}
}
}
"""
)
except Exception:
pass
selectors = [
"video",
"[aria-label*='play' i]",
"[aria-label*='播放']",
"[title*='play' i]",
"[title*='播放']",
"[class*='play']",
"[id*='play']",
]
for selector in selectors:
try:
locator = page.locator(selector).first
locator.click(timeout=1500)
page.wait_for_timeout(800)
except PlaywrightTimeoutError:
continue
except Exception:
continue
def extract_media_info(url: str, wait_seconds: int = 12) -> dict[str, object]:
browser_executable = find_browser_executable()
candidates: dict[str, dict[str, object]] = {}
def collect(candidate_url: str, *, content_type: str = "", headers: dict[str, str] | None = None) -> None:
_upsert_candidate(
candidates,
candidate_url,
content_type=content_type,
headers=headers,
)
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=True, executable_path=browser_executable)
page = browser.new_page(user_agent=USER_AGENT)
page.on(
"request",
lambda request: collect(
request.url,
headers=getattr(request, "headers", None),
),
)
page.on(
"response",
lambda response: collect(
response.url,
content_type=(response.headers or {}).get("content-type", ""),
headers=getattr(response.request, "headers", None),
),
)
page.goto(url, wait_until="domcontentloaded", timeout=120000)
page.wait_for_timeout(1500)
_trigger_page_media(page)
time.sleep(wait_seconds)
title = page.title()
final_url = page.url
browser.close()
audio_candidates = sorted(
(item for item in candidates.values() if item["kind"] == "audio"),
key=lambda item: (-int(item["score"]), str(item["url"])),
)
video_candidates = sorted(
(item for item in candidates.values() if item["kind"] == "video"),
key=lambda item: (-int(item["score"]), str(item["url"])),
)
best_audio = audio_candidates[0] if audio_candidates else None
best_video = video_candidates[0] if video_candidates else None
return {
"source_url": url,
"final_url": final_url,
"title": title,
"audio_urls": [str(item["url"]) for item in audio_candidates],
"video_urls": [str(item["url"]) for item in video_candidates],
"audio_candidates": audio_candidates,
"video_candidates": video_candidates,
"best_audio_url": best_audio["url"] if best_audio else None,
"best_video_url": best_video["url"] if best_video else None,
"best_audio_headers": best_audio["headers"] if best_audio else None,
"best_video_headers": best_video["headers"] if best_video else None,
"best_audio": best_audio,
"best_video": best_video,
}
def main() -> int:
parser = argparse.ArgumentParser(
description="Extract audio/video URLs and common headers from a generic web page."
)
parser.add_argument("url", help="Web page URL")
parser.add_argument("--wait-seconds", type=int, default=12, help="Extra wait time after page load")
parser.add_argument("--output-json", help="Optional output JSON path")
args = parser.parse_args()
info = extract_media_info(args.url, wait_seconds=args.wait_seconds)
rendered = json.dumps(info, ensure_ascii=False, indent=2)
if args.output_json:
Path(args.output_json).write_text(rendered + "\n", encoding="utf-8")
print(f"Saved media info to: {Path(args.output_json).resolve()}")
else:
print(rendered)
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/pipeline_common.py
from __future__ import annotations
import os
import shutil
import subprocess
import tarfile
from pathlib import Path
from typing import Iterable
from urllib.parse import urlparse
try:
import imageio_ffmpeg
import requests
except ImportError as exc:
raise SystemExit(
"Missing Python dependencies. Run `python scripts/bootstrap_env.py` first."
) from exc
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
)
AUDIO_SUFFIXES = {".mp3", ".m4a", ".wav", ".aac", ".flac", ".ogg"}
VIDEO_SUFFIXES = {".mp4", ".mkv", ".webm"}
STREAM_SUFFIXES = {".m3u8", ".mpd"}
MEDIA_SUFFIXES = AUDIO_SUFFIXES | VIDEO_SUFFIXES | STREAM_SUFFIXES
MEDIA_SUFFIX_ORDER = (
".mp3",
".m4a",
".wav",
".aac",
".flac",
".ogg",
".mp4",
".mkv",
".webm",
".m3u8",
".mpd",
)
M3U8_CONTENT_TYPES = {
"application/vnd.apple.mpegurl",
"application/x-mpegurl",
}
KEEP_CAPTURED_HEADERS = {
"origin": "Origin",
"referer": "Referer",
}
MODEL_ARCHIVE_NAME = "sherpa-onnx-sense-voice-zh-en-ja-ko-yue-int8-2024-07-17.tar.bz2"
MODEL_DIR_NAME = "sherpa-onnx-sense-voice-zh-en-ja-ko-yue-int8-2024-07-17"
MODEL_URL = (
"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/"
f"{MODEL_ARCHIVE_NAME}"
)
def cache_root() -> Path:
if os.name == "nt":
base = Path(os.environ.get("LOCALAPPDATA", Path.home() / "AppData" / "Local"))
else:
base = Path(os.environ.get("XDG_CACHE_HOME", Path.home() / ".cache"))
root = base / "web-video-transcribe-docx"
root.mkdir(parents=True, exist_ok=True)
return root
def format_seconds(seconds: float) -> str:
rounded = max(0, int(round(seconds)))
hours = rounded // 3600
minutes = (rounded % 3600) // 60
secs = rounded % 60
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
def ffmpeg_executable() -> str:
return imageio_ffmpeg.get_ffmpeg_exe()
def run_ffmpeg(args: list[str]) -> None:
cmd = [ffmpeg_executable(), "-nostdin", "-y", "-v", "error", *args]
subprocess.run(cmd, check=True)
def guess_suffix_from_url(url: str) -> str:
lowered_path = urlparse(url).path.lower()
for suffix in MEDIA_SUFFIX_ORDER:
if lowered_path.endswith(suffix):
return suffix
lowered = url.lower()
for suffix in MEDIA_SUFFIX_ORDER:
if suffix in lowered:
return suffix
return ".mp4"
def sanitize_captured_headers(headers: dict[str, str] | None) -> dict[str, str]:
if not headers:
return {}
sanitized: dict[str, str] = {}
for key, value in headers.items():
if not value:
continue
normalized = key.strip().lower()
target_key = KEEP_CAPTURED_HEADERS.get(normalized)
if target_key:
sanitized[target_key] = value.strip()
return sanitized
def merge_headers(headers: dict[str, str] | None = None) -> dict[str, str]:
merged = {"User-Agent": USER_AGENT}
if headers:
for key, value in headers.items():
if key and value:
merged[key] = value
return merged
def looks_like_streaming_manifest(url: str, content_type: str = "") -> bool:
lowered = url.lower()
ctype = content_type.lower()
return any(suffix in lowered for suffix in STREAM_SUFFIXES) or any(
token in ctype for token in M3U8_CONTENT_TYPES
)
def classify_media_url(url: str, content_type: str = "") -> str | None:
lowered = url.lower()
ctype = content_type.lower()
suffix = Path(urlparse(url).path).suffix.lower()
if (
suffix in AUDIO_SUFFIXES
or "media-audio" in lowered
or "/audio" in lowered
or "audio/" in ctype
or "mime_type=audio" in lowered
):
return "audio"
if (
suffix in VIDEO_SUFFIXES
or suffix in STREAM_SUFFIXES
or "media-video" in lowered
or ".mp4" in lowered
or ".m3u8" in lowered
or ".mpd" in lowered
or "video/" in ctype
or "mime_type=video" in lowered
or any(token in ctype for token in M3U8_CONTENT_TYPES)
):
return "video"
return None
def is_probable_media_url(url: str, content_type: str = "") -> bool:
if classify_media_url(url, content_type):
return True
lowered = url.lower()
return any(
token in lowered
for token in ("playurl", "playlist", "stream", "manifest", "media", "audio", "video")
)
def download_file(url: str, destination: Path, headers: dict[str, str] | None = None) -> Path:
destination.parent.mkdir(parents=True, exist_ok=True)
merged_headers = merge_headers(headers)
with requests.get(url, headers=merged_headers, stream=True, timeout=60) as response:
response.raise_for_status()
with destination.open("wb") as handle:
for chunk in response.iter_content(chunk_size=1024 * 1024):
if chunk:
handle.write(chunk)
return destination
def _safe_extract_tar(archive: Path, destination: Path) -> None:
destination = destination.resolve()
with tarfile.open(archive, "r:*") as tar:
members = tar.getmembers()
for member in members:
target = (destination / member.name).resolve()
if not str(target).startswith(str(destination)):
raise RuntimeError(f"Unsafe tar member path: {member.name}")
tar.extractall(destination)
def ensure_sensevoice_model(target_root: Path | None = None) -> Path:
root = target_root or cache_root()
root.mkdir(parents=True, exist_ok=True)
model_dir = root / MODEL_DIR_NAME
model_file = model_dir / "model.int8.onnx"
tokens_file = model_dir / "tokens.txt"
if model_file.is_file() and tokens_file.is_file():
return model_dir
archive_path = root / MODEL_ARCHIVE_NAME
if not archive_path.is_file():
download_file(MODEL_URL, archive_path)
_safe_extract_tar(archive_path, root)
if not model_file.is_file() or not tokens_file.is_file():
raise RuntimeError("SenseVoice model download completed, but required files are missing.")
return model_dir
def download_media_url(
url: str,
destination: Path,
*,
headers: dict[str, str] | None = None,
content_type: str = "",
) -> Path:
if not looks_like_streaming_manifest(url, content_type=content_type):
return download_file(url, destination, headers=headers)
destination.parent.mkdir(parents=True, exist_ok=True)
merged_headers = merge_headers(headers)
ffmpeg_args = ["-user_agent", merged_headers["User-Agent"]]
referer = merged_headers.get("Referer")
if referer:
ffmpeg_args.extend(["-referer", referer])
extra_headers = {
key: value
for key, value in merged_headers.items()
if key not in {"User-Agent", "Referer"}
}
if extra_headers:
header_blob = "".join(f"{key}: {value}\r\n" for key, value in extra_headers.items())
ffmpeg_args.extend(["-headers", header_blob])
ffmpeg_args.extend(["-i", url, "-map", "0", "-c", "copy", str(destination)])
run_ffmpeg(ffmpeg_args)
return destination
def segment_media(input_path: Path, output_dir: Path, segment_seconds: int = 600) -> list[Path]:
if output_dir.exists():
for old in output_dir.glob("part_*.wav"):
old.unlink()
output_dir.mkdir(parents=True, exist_ok=True)
pattern = output_dir / "part_%03d.wav"
run_ffmpeg(
[
"-i",
str(input_path),
"-ar",
"16000",
"-ac",
"1",
"-f",
"segment",
"-segment_time",
str(segment_seconds),
"-reset_timestamps",
"1",
str(pattern),
]
)
return sorted(output_dir.glob("part_*.wav"))
def find_browser_executable() -> str:
candidates: Iterable[str | None] = (
shutil.which("chrome"),
shutil.which("msedge"),
shutil.which("chromium"),
r"C:\Program Files\Google\Chrome\Application\chrome.exe",
r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
r"C:\Program Files\Microsoft\Edge\Application\msedge.exe",
r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe",
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge",
"/usr/bin/google-chrome",
"/usr/bin/microsoft-edge",
"/usr/bin/chromium",
"/snap/bin/chromium",
)
for candidate in candidates:
if candidate and Path(candidate).exists():
return str(candidate)
raise RuntimeError(
"No supported Chrome/Edge browser executable was found. "
"Install Chrome or Edge, or extend `find_browser_executable()`."
)
FILE:scripts/pipeline_toutiao_to_docx.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
from pathlib import Path
from urllib.parse import urlparse
from extract_toutiao_media import extract_media_info
from pipeline_common import STREAM_SUFFIXES, classify_media_url, download_media_url, guess_suffix_from_url
from transcript_to_docx import write_docx_from_text
from transcribe_sensevoice import transcribe_media
def is_direct_media_url(url: str) -> bool:
lowered = url.lower()
suffix = Path(urlparse(url).path).suffix.lower()
return suffix in {".mp4", ".mp3", ".m4a", ".wav", ".aac", ".flac", ".ogg", ".webm", ".mkv"} or (
"media-audio" in lowered or "media-video" in lowered or ".m3u8" in lowered
)
def is_toutiao_page(url: str) -> bool:
lowered = url.lower()
return "toutiao.com/video/" in lowered or "m.toutiao.com/video/" in lowered
def pick_candidate(media_info: dict[str, object]) -> dict[str, object] | None:
for key in ("best_audio", "best_video"):
candidate = media_info.get(key)
if isinstance(candidate, dict) and candidate.get("url"):
return candidate
if media_info.get("best_audio_url"):
return {
"url": media_info["best_audio_url"],
"headers": media_info.get("best_audio_headers") or {},
"kind": "audio",
}
if media_info.get("best_video_url"):
return {
"url": media_info["best_video_url"],
"headers": media_info.get("best_video_headers") or {},
"kind": "video",
}
return None
def download_suffix_for_candidate(url: str, kind: str | None = None) -> str:
suffix = guess_suffix_from_url(url)
if suffix in STREAM_SUFFIXES:
return ".m4a" if kind == "audio" else ".mp4"
return suffix
def main() -> int:
parser = argparse.ArgumentParser(description="Extract, download, transcribe, and render a Toutiao video page.")
parser.add_argument("source", help="Toutiao page URL, direct media URL, or local media file")
parser.add_argument("--output-dir", required=True, help="Directory for outputs")
parser.add_argument("--wait-seconds", type=int, default=12, help="Extra page wait after load for Toutiao")
parser.add_argument("--segment-seconds", type=int, default=600, help="Audio segmentation length")
parser.add_argument("--threads", type=int, default=4, help="ASR threads")
parser.add_argument("--language", default="zh", choices=["auto", "zh", "en", "ja", "ko", "yue"])
parser.add_argument("--title", help="Optional DOCX title override")
args = parser.parse_args()
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
source = args.source
title = args.title
if source.startswith(("http://", "https://")):
if is_toutiao_page(source):
media_info = extract_media_info(source, wait_seconds=args.wait_seconds)
(output_dir / "media_info.json").write_text(
json.dumps(media_info, ensure_ascii=False, indent=2) + "\n",
encoding="utf-8",
)
selected = pick_candidate(media_info)
if not selected:
raise SystemExit("No playable audio/video URL was extracted from the Toutiao page.")
selected_url = str(selected["url"])
kind = str(selected.get("kind") or classify_media_url(selected_url) or "video")
suffix = download_suffix_for_candidate(selected_url, kind)
media_path = output_dir / f"source_media{suffix}"
download_media_url(
selected_url,
media_path,
headers=dict(selected.get("headers") or {}),
content_type=str(selected.get("content_type") or ""),
)
title = title or str(media_info.get("title") or "Toutiao Transcript")
elif is_direct_media_url(source):
kind = classify_media_url(source) or "video"
suffix = download_suffix_for_candidate(source, kind)
media_path = output_dir / f"source_media{suffix}"
download_media_url(source, media_path)
title = title or Path(urlparse(source).path).stem or "Web Media Transcript"
else:
raise SystemExit(
"Unsupported page URL. This pipeline extracts page media only for Toutiao. "
"For other sites, capture a direct media URL first."
)
else:
media_path = Path(source)
if not media_path.is_file():
raise SystemExit(f"Local input file not found: {media_path}")
title = title or media_path.stem
transcript_text = transcribe_media(
media_path,
segment_seconds=args.segment_seconds,
threads=args.threads,
language=args.language,
)
transcript_txt = output_dir / "transcript.txt"
transcript_docx = output_dir / "transcript.docx"
transcript_txt.write_text(transcript_text, encoding="utf-8")
write_docx_from_text(transcript_text, transcript_docx, title=title)
print(f"Saved transcript: {transcript_txt.resolve()}")
print(f"Saved DOCX: {transcript_docx.resolve()}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/pipeline_web_to_docx.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
from pathlib import Path
from urllib.parse import urlparse
from extract_toutiao_media import extract_media_info as extract_toutiao_media_info
from extract_web_media import extract_media_info as extract_web_media_info
from pipeline_common import (
MEDIA_SUFFIXES,
STREAM_SUFFIXES,
classify_media_url,
download_media_url,
guess_suffix_from_url,
)
from transcript_to_docx import write_docx_from_text
from transcribe_sensevoice import transcribe_media
def is_direct_media_url(url: str) -> bool:
lowered = url.lower()
suffix = Path(urlparse(url).path).suffix.lower()
return suffix in MEDIA_SUFFIXES or any(
token in lowered for token in ("media-audio", "media-video", ".m3u8", ".mpd")
)
def is_toutiao_page(url: str) -> bool:
lowered = url.lower()
return "toutiao.com/video/" in lowered or "m.toutiao.com/video/" in lowered
def load_headers(
inline_headers: list[str] | None = None,
headers_json: str | None = None,
) -> dict[str, str]:
headers: dict[str, str] = {}
if headers_json:
candidate = Path(headers_json)
if candidate.is_file():
loaded = json.loads(candidate.read_text(encoding="utf-8"))
else:
loaded = json.loads(headers_json)
if not isinstance(loaded, dict):
raise SystemExit("`--headers-json` must point to or contain a JSON object.")
headers.update({str(key): str(value) for key, value in loaded.items() if value is not None})
for item in inline_headers or []:
if "=" not in item:
raise SystemExit(f"Invalid `--header` value: {item!r}. Expected KEY=VALUE.")
key, value = item.split("=", 1)
key = key.strip()
value = value.strip()
if not key or not value:
raise SystemExit(f"Invalid `--header` value: {item!r}. Expected KEY=VALUE.")
headers[key] = value
return headers
def pick_candidate(media_info: dict[str, object]) -> dict[str, object] | None:
for key in ("best_audio", "best_video"):
candidate = media_info.get(key)
if isinstance(candidate, dict) and candidate.get("url"):
return candidate
if media_info.get("best_audio_url"):
return {
"url": media_info["best_audio_url"],
"headers": media_info.get("best_audio_headers") or {},
"kind": "audio",
}
if media_info.get("best_video_url"):
return {
"url": media_info["best_video_url"],
"headers": media_info.get("best_video_headers") or {},
"kind": "video",
}
return None
def download_suffix_for_candidate(url: str, kind: str | None = None) -> str:
suffix = guess_suffix_from_url(url)
if suffix in STREAM_SUFFIXES:
return ".m4a" if kind == "audio" else ".mp4"
return suffix
def source_title(source: str) -> str:
path = Path(urlparse(source).path)
return path.stem or "web-media-transcript"
def main() -> int:
parser = argparse.ArgumentParser(
description="Extract, download, transcribe, and render a generic web page, media URL, or local file."
)
parser.add_argument("source", help="Web page URL, direct media URL, or local media file")
parser.add_argument("--output-dir", required=True, help="Directory for outputs")
parser.add_argument("--wait-seconds", type=int, default=12, help="Extra page wait after load")
parser.add_argument("--segment-seconds", type=int, default=600, help="Audio segmentation length")
parser.add_argument("--threads", type=int, default=4, help="ASR threads")
parser.add_argument("--language", default="zh", choices=["auto", "zh", "en", "ja", "ko", "yue"])
parser.add_argument("--title", help="Optional DOCX title override")
parser.add_argument(
"--extractor",
default="auto",
choices=["auto", "generic", "toutiao"],
help="Page extractor mode for page URLs",
)
parser.add_argument(
"--header",
action="append",
help="Optional download header in KEY=VALUE form; may be repeated",
)
parser.add_argument(
"--headers-json",
help="Optional JSON string or path to a JSON file containing additional download headers",
)
args = parser.parse_args()
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
source = args.source
title = args.title
extra_headers = load_headers(args.header, args.headers_json)
if source.startswith(("http://", "https://")):
if is_direct_media_url(source):
kind = classify_media_url(source) or "video"
suffix = download_suffix_for_candidate(source, kind)
media_path = output_dir / f"source_media{suffix}"
download_media_url(source, media_path, headers=extra_headers)
title = title or source_title(source)
else:
if args.extractor == "toutiao":
if not is_toutiao_page(source):
raise SystemExit("`--extractor toutiao` only supports Toutiao video pages.")
media_info = extract_toutiao_media_info(source, wait_seconds=args.wait_seconds)
elif args.extractor == "auto" and is_toutiao_page(source):
media_info = extract_toutiao_media_info(source, wait_seconds=args.wait_seconds)
else:
media_info = extract_web_media_info(source, wait_seconds=args.wait_seconds)
(output_dir / "media_info.json").write_text(
json.dumps(media_info, ensure_ascii=False, indent=2) + "\n",
encoding="utf-8",
)
selected = pick_candidate(media_info)
if not selected:
raise SystemExit(
"No playable audio/video URL was extracted from the page. "
"The page may use DRM, require login cookies, or hide media behind unsupported scripts."
)
selected_url = str(selected["url"])
selected_headers = dict(selected.get("headers") or {})
selected_headers.update(extra_headers)
kind = str(selected.get("kind") or classify_media_url(selected_url) or "video")
content_type = str(selected.get("content_type") or "")
suffix = download_suffix_for_candidate(selected_url, kind)
media_path = output_dir / f"source_media{suffix}"
download_media_url(
selected_url,
media_path,
headers=selected_headers,
content_type=content_type,
)
title = title or str(media_info.get("title") or source_title(source))
media_title = title or "Web Media Transcript"
else:
media_path = Path(source)
if not media_path.is_file():
raise SystemExit(f"Local input file not found: {media_path}")
media_title = title or media_path.stem
transcript_text = transcribe_media(
media_path,
segment_seconds=args.segment_seconds,
threads=args.threads,
language=args.language,
)
transcript_txt = output_dir / "transcript.txt"
transcript_docx = output_dir / "transcript.docx"
transcript_txt.write_text(transcript_text, encoding="utf-8")
write_docx_from_text(transcript_text, transcript_docx, title=media_title)
print(f"Saved transcript: {transcript_txt.resolve()}")
print(f"Saved DOCX: {transcript_docx.resolve()}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/transcribe_sensevoice.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import re
import shutil
import tempfile
import wave
from pathlib import Path
try:
import numpy as np
import sherpa_onnx
except ImportError as exc:
raise SystemExit(
"Missing transcription dependencies. Run `python scripts/bootstrap_env.py` first."
) from exc
from pipeline_common import ensure_sensevoice_model, format_seconds, segment_media
from transcript_to_docx import write_docx_from_text
NORMALIZE_REPLACEMENTS = {
"事频": "视频",
"势频": "视频",
"谋选": "毛选",
"某选": "毛选",
"神做": "神作",
"立经": "历经",
"正聋发聩": "振聋发聩",
"阵聋发聩": "振聋发聩",
"前数年": "前苏联",
"造猫画虎": "照猫画虎",
"辅持": "扶持",
"时间那位到": "时间回到",
}
def normalize_text(text: str) -> str:
cleaned = re.sub(r"\s+", "", text)
for old, new in NORMALIZE_REPLACEMENTS.items():
cleaned = cleaned.replace(old, new)
return cleaned.strip()
def load_wav(path: Path) -> tuple[int, np.ndarray, float]:
with wave.open(str(path), "rb") as wav_file:
sample_rate = wav_file.getframerate()
frame_count = wav_file.getnframes()
data = wav_file.readframes(frame_count)
samples = np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
duration = frame_count / float(sample_rate)
return sample_rate, samples, duration
def build_recognizer(model_dir: Path, threads: int, language: str) -> sherpa_onnx.OfflineRecognizer:
return sherpa_onnx.OfflineRecognizer.from_sense_voice(
model=str(model_dir / "model.int8.onnx"),
tokens=str(model_dir / "tokens.txt"),
num_threads=threads,
language=language,
use_itn=True,
)
def transcribe_media(
input_path: Path,
*,
segment_seconds: int = 600,
threads: int = 4,
language: str = "zh",
model_dir: Path | None = None,
) -> str:
resolved_model_dir = ensure_sensevoice_model(model_dir)
recognizer = build_recognizer(resolved_model_dir, threads, language)
temp_root = Path(tempfile.mkdtemp(prefix="web-video-transcribe-docx-"))
try:
segments = segment_media(input_path, temp_root / "segments", segment_seconds=segment_seconds)
if not segments:
raise RuntimeError("No audio segments were created from the input media.")
cursor = 0.0
lines: list[str] = []
for segment in segments:
sample_rate, samples, duration = load_wav(segment)
stream = recognizer.create_stream()
stream.accept_waveform(sample_rate, samples)
recognizer.decode_stream(stream)
text = normalize_text(stream.result.text)
start_text = format_seconds(cursor)
end_text = format_seconds(cursor + duration)
stamp = f"[{start_text}-{end_text}]"
lines.append(stamp if not text else f"{stamp} {text}")
cursor += duration
return "\n\n".join(lines).strip() + "\n"
finally:
shutil.rmtree(temp_root, ignore_errors=True)
def main() -> int:
parser = argparse.ArgumentParser(description="Run offline SenseVoice transcription on a local media file.")
parser.add_argument("--input", required=True, help="Input local media file")
parser.add_argument("--output-txt", required=True, help="Transcript output path")
parser.add_argument("--output-docx", help="Optional DOCX output path")
parser.add_argument("--title", help="Optional document title for DOCX")
parser.add_argument("--segment-seconds", type=int, default=600, help="Segment length for ffmpeg splitting")
parser.add_argument("--threads", type=int, default=4, help="Number of ASR threads")
parser.add_argument(
"--language",
default="zh",
choices=["auto", "zh", "en", "ja", "ko", "yue"],
help="SenseVoice language hint",
)
parser.add_argument("--model-dir", help="Optional pre-downloaded model directory")
args = parser.parse_args()
input_path = Path(args.input)
output_txt = Path(args.output_txt)
model_dir = Path(args.model_dir) if args.model_dir else None
text = transcribe_media(
input_path,
segment_seconds=args.segment_seconds,
threads=args.threads,
language=args.language,
model_dir=model_dir,
)
output_txt.write_text(text, encoding="utf-8")
print(f"Saved transcript: {output_txt.resolve()}")
if args.output_docx:
output_docx = Path(args.output_docx)
write_docx_from_text(text, output_docx, title=args.title or input_path.stem)
print(f"Saved DOCX: {output_docx.resolve()}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:scripts/transcript_to_docx.py
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import re
from pathlib import Path
try:
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.ns import qn
from docx.shared import Pt
except ImportError as exc:
raise SystemExit(
"Missing python-docx. Run `python scripts/bootstrap_env.py` first."
) from exc
TIMESTAMP_RE = re.compile(r"^\[(\d{2}:\d{2}:\d{2}(?:-\d{2}:\d{2}:\d{2})?)\]\s*(.*)$")
def _set_font(run, size: float, *, bold: bool = False, italic: bool = False) -> None:
run.font.name = "Microsoft YaHei"
run._element.rPr.rFonts.set(qn("w:eastAsia"), "Microsoft YaHei")
run.font.size = Pt(size)
run.bold = bold
run.italic = italic
def _set_normal_style(document: Document) -> None:
style = document.styles["Normal"]
style.font.name = "Microsoft YaHei"
style._element.rPr.rFonts.set(qn("w:eastAsia"), "Microsoft YaHei")
style.font.size = Pt(11)
def write_docx_from_text(text: str, output_path: Path, title: str | None = None) -> None:
document = Document()
_set_normal_style(document)
if title:
title_paragraph = document.add_paragraph()
title_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
title_run = title_paragraph.add_run(title)
_set_font(title_run, 16, bold=True)
for raw_line in text.splitlines():
line = raw_line.strip()
if not line:
continue
if line.startswith("# "):
document.add_heading(line[2:].strip(), level=1)
continue
if line.startswith("## "):
document.add_heading(line[3:].strip(), level=1)
continue
if line.startswith("### "):
document.add_heading(line[4:].strip(), level=2)
continue
timestamp_match = TIMESTAMP_RE.match(line)
if timestamp_match:
document.add_heading(timestamp_match.group(1), level=2)
trailing_text = timestamp_match.group(2).strip()
if trailing_text:
document.add_paragraph(trailing_text)
continue
document.add_paragraph(line)
document.save(output_path)
def main() -> int:
parser = argparse.ArgumentParser(description="Render a UTF-8 transcript or chapterized text file into DOCX.")
parser.add_argument("--input", required=True, help="Input text file")
parser.add_argument("--output", required=True, help="Output .docx path")
parser.add_argument("--title", help="Optional document title")
args = parser.parse_args()
input_path = Path(args.input)
output_path = Path(args.output)
text = input_path.read_text(encoding="utf-8")
title = args.title or input_path.stem
write_docx_from_text(text, output_path, title=title)
print(f"Saved: {output_path.resolve()}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
FILE:references/cleanup-guidelines.md
# Cleanup Guidelines
Use these rules when turning a noisy ASR transcript into a readable chapterized draft.
## Preserve meaning
- Keep the speaker's claims, examples, and structure intact.
- Fix obvious ASR errors, but do not add facts that are not supported by the audio.
- Do not manufacture historical quotes or exact wording when the source is uncertain.
## Split outputs by purpose
- Raw transcript: Keep timestamps and preserve recoverable wording.
- Reading copy: Add chapter headings, rewrite broken sentences, and remove repeated filler.
## Typical fixes
- Correct obvious homophone errors in names, titles, dates, and keywords.
- Add punctuation and paragraph breaks.
- Merge broken sentences that are clearly part of one thought.
- Replace impossible phrases with conservative reconstructions.
## Chapter refinement pattern
1. Read the full transcript once.
2. Identify the major article or topic boundaries.
3. Add neutral chapter headings such as `## 第1章 ...`.
4. Rewrite each chapter into readable prose while preserving the argument order.
5. Keep the tone factual rather than ornamental.
## Uncertainty rule
If a fragment is too noisy to restore confidently:
- keep the sentence more general, or
- omit only the irrecoverable fragment while preserving the surrounding argument.
Do not guess at quotes, names, or numbers you cannot support.
FILE:references/publishing.md
# Publishing
## Release Checklist
1. Keep the skill folder name, frontmatter `name`, and release zip basename aligned.
2. Bump the version in `SKILL.md` metadata before each public release.
3. Re-run validation and at least one real extraction or transcription smoke test.
4. Keep the root `LICENSE` file and the `license` field in `SKILL.md` aligned. This release uses `MIT-0`.
5. If you have a public repository or homepage, add it to `metadata.openclaw.homepage` before publication.
## Marketplace Notes
### ClawHub / OpenClaw
- The listing benefits from `metadata.openclaw.requires` because the marketplace can surface runtime requirements.
- The broader Agent Skills spec also documents an optional `compatibility` field, but some older validators still reject it. This release keeps the runtime notes in the Markdown body to stay compatible with stricter legacy tooling.
- Keep `metadata` as a compact JSON object in frontmatter. This is easier for scanners and marketplace tooling to consume.
- This skill currently declares `python` or `python3` or `py` as acceptable runtime binaries because the scripts are Python-first.
- Browser extraction is optional. Do not mark Chrome or Edge as hard requirements at the metadata level because direct-media and local-file flows still work without a browser.
### Agent Skills ecosystem
- Keep frontmatter portable: `name` and `description` are the hard minimum; prefer conservative optional fields when you need to satisfy older validators.
- Avoid marketplace-specific instructions in the main workflow unless they materially help the agent use the skill.
## Suggested Tags
- transcription
- asr
- chinese
- docx
- video
- audio
- web-scraping
- toutiao
## Suggested Short Summary
Offline-first Chinese web media transcription skill that extracts playable audio or video from ordinary web pages, runs local SenseVoice ASR, and exports TXT plus DOCX deliverables.
## ClawHub Publish Template
Replace the placeholder values before running:
```powershell
clawhub skill publish `
--slug web-video-transcribe-docx `
--version 1.0.0 `
--summary "Offline-first Chinese web media transcription skill with TXT and DOCX output." `
--tags transcription,asr,chinese,docx,video,audio `
--dir .
```
## Pre-Publish Smoke Tests
```powershell
python {baseDir}/scripts/bootstrap_env.py
python {baseDir}/scripts/extract_web_media.py --help
python {baseDir}/scripts/pipeline_web_to_docx.py --help
python {baseDir}/scripts/download_url.py --help
```
For a real smoke test, run either:
```powershell
python {baseDir}/scripts/pipeline_web_to_docx.py "https://example.com/video-page" --output-dir ".\out"
```
or:
```powershell
python {baseDir}/scripts/transcribe_sensevoice.py --input ".\sample.wav" --output-txt ".\out\transcript.txt" --output-docx ".\out\transcript.docx"
```
FILE:references/workflow.md
# Workflow
## Supported inputs
- Generic web page URLs that expose playable media requests in the browser, including many news, learning, and social video pages
- Toutiao page URLs such as `https://www.toutiao.com/video/...` or `https://m.toutiao.com/video/...`
- Direct MP4, MP3, M4A, WAV, AAC, FLAC, and OGG URLs
- Direct HLS or DASH manifests such as `.m3u8` or `.mpd`
- Local media files in the same formats
## Recommended commands
### 1. Install dependencies
```powershell
python scripts/bootstrap_env.py
```
### 2. Generic web page to transcript and DOCX
```powershell
python scripts/pipeline_web_to_docx.py "https://example.com/video-page" --output-dir ".\out"
```
Outputs:
- `media_info.json`
- `source_media.*`
- `transcript.txt`
- `transcript.docx`
### 3. Toutiao page to transcript and DOCX
```powershell
python scripts/pipeline_web_to_docx.py "https://www.toutiao.com/video/..." --output-dir ".\out"
```
If a specific Toutiao page needs the site-specific extractor:
```powershell
python scripts/pipeline_toutiao_to_docx.py "https://www.toutiao.com/video/..." --output-dir ".\out"
```
### 4. Direct media file or direct media URL to transcript and DOCX
```powershell
python scripts/transcribe_sensevoice.py --input ".\source_media.mp4" --output-txt ".\transcript.txt" --output-docx ".\transcript.docx"
```
If the media URL needs request headers:
```powershell
python scripts/download_url.py "https://cdn.example.com/video.m3u8" ".\source_media.mp4" --header "Referer=https://example.com/video-page" --header "Origin=https://example.com"
```
### 5. Chapterized or manually refined text to DOCX
```powershell
python scripts/transcript_to_docx.py --input ".\chapter_refined.txt" --output ".\chapter_refined.docx" --title "视频章节整理稿"
```
## Model cache
The SenseVoice model is downloaded on demand into a user cache directory:
- Windows: `%LOCALAPPDATA%\web-video-transcribe-docx`
- Other systems: `~/.cache/web-video-transcribe-docx`
The model is not bundled inside the skill because it is large.
## Browser expectations
`extract_web_media.py` and `extract_toutiao_media.py` try these browsers in order:
- Chrome / Edge on PATH
- Common Windows install paths for Chrome and Edge
If it cannot find a browser, install Chrome or Edge, or extend the script for the local environment.
## Output rules
- Keep the raw transcript and the refined transcript as separate files.
- Use raw transcript files for auditability.
- Use refined transcript files for end-user reading quality.
- Prefer the page's dedicated audio stream when one exists.
- Prefer the generic page pipeline first, then fall back to the Toutiao-specific extractor only when needed.
- Preserve `media_info.json` whenever you extracted a page URL so later retries can reuse the captured media URL and headers.
## Troubleshooting
### Playwright import error
Run:
```powershell
python scripts/bootstrap_env.py
```
### DOCX import error
Install `python-docx` via the bootstrap script.
### sherpa-onnx or model download failure
- Check internet access for the first run.
- Re-run the same command; partial archives can be deleted from the cache if needed.
### Page extraction finds nothing
The bundled generic extractor works for many ordinary video pages, but not all. Common failure modes:
1. The page uses DRM or encrypted EME playback.
2. The media request requires login cookies that the skill does not replay.
3. The video is rendered in a way that never exposes a stable media request to the browser session.
Fallback:
1. Capture a direct media URL with browser automation or developer tools.
2. Preserve the required `Referer` or `Origin` headers.
3. Use `download_url.py` and `transcribe_sensevoice.py`.
FILE:agents/openai.yaml
interface:
display_name: "Web Video Transcribe DOCX"
short_description: "Extract web media, transcribe Chinese audio, and export TXT or DOCX"
default_prompt: "Use $web-video-transcribe-docx to extract media from a web page or media URL, transcribe the audio into cleaned Chinese text, and generate a Word document."