Pippin

@clawhub-pippin1214-edf7ee0d4a
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
RedNote Research
Skill
Research a topic through RedNote/Xiaohongshu discussion signals using either public-web mode (no login) or optional login-enhanced browser review when the us...
---
name: rednote-research
description: Research a topic through RedNote/Xiaohongshu discussion signals using either public-web mode (no login) or optional login-enhanced browser review when the user explicitly chooses deeper access. Use when checking RedNote community sentiment, reputation, latest policy/community updates, gossip/drama/news synthesis, local recommendations like restaurants/shops, when recovering evidence from weak public-web snippets/titles/OCR/subtitle fragments, or when analyzing posts, comments, screenshots, image posts, video/gif snippets, subtitles, or audio/transcript clues. Especially useful for prompts like "查小红书口碑", "搜 RedNote 讨论", "看看最近有什么风向/新政策", "总结八卦/争议", "找本地探店推荐", "分析评论区", "分析截图/视频/字幕", "根据截图线索继续搜", "总结某个账号最近发了什么", or "做一个 RedNote 社区情报初筛".
---

# RedNote Community Intelligence

Research a topic with a RedNote/Xiaohongshu-first lens. Default to public-web mode, but support an optional login-enhanced path when the user explicitly wants fuller coverage. Expand queries deliberately, collect signals from multiple source types, separate evidence from vibe, and return a concise report that is honest about uncertainty.

## Access modes

Read `references/access-modes.md` when deciding whether to stay in public-web mode or offer login-enhanced browser review.
Read `references/login-enhanced-workflow.md` when the user explicitly chooses deeper access and you need an execution pattern for authenticated review.
Read `references/minimal-user-input-paths.md` when public-web access is weak and the user prefers not to log in.
Read `references/account-summary-template.md` when the task is to summarize a creator/account or recent posting behavior.

Default behavior:
- start in public-web mode
- do not assume login
- if the user wants fuller account-level, recent-post, or comment-level coverage, offer the login-enhanced path as an explicit choice
- if the user declines login, ask for the smallest useful seed input instead of giving up
- state in the final answer whether findings came from public-web mode or login-enhanced mode

## Core operating rules

- Treat RedNote as a signal-discovery layer, not final proof.
- Prefer a few inspectable sources over many shallow snippets.
- Separate direct evidence, repeated anecdote, platform chatter, and rumor.
- Put dates on fast-moving claims whenever possible.
- Do not imply access to hidden comments, full threads, or app-only media.
- If a page is inaccessible, do not overclaim from the search snippet alone.
- Keep modality explicit: text-page, screenshot, image, video, gif, audio, or transcript.
- Separate extraction from interpretation: OCR/ASR output is evidence, not automatic truth.
- When media access is partial, say exactly what is visible and what remains uninspectable.

## Default workflow

1. Clarify the subject, time scope, geography, output goal, and whether the user wants no-login mode or login-enhanced mode.
2. Start in public-web mode unless the user explicitly chooses login-enhanced mode.
3. Build a compact query set with mixed query families.
4. Search broadly across RedNote, official sources, media, and supporting review sites.
5. If public-web coverage is too thin for the task, explain that and offer login-enhanced browser review as the next step.
6. Extract recurring claims, contradictions, and missing evidence.
7. Score credibility separately from risk or recommendation strength.
8. Deliver a short report with links, caveats, next checks, and a brief note about which access mode was used.

## 1) Clarify the research target

Identify:
- canonical name in Chinese and English
- aliases, abbreviations, nicknames, hashtags, old names
- category: `education`, `policy`, `gossip`, `local`, or `general`
- geography if relevant: city, district, mall, campus, country, online/offline
- time scope: latest 7 days, latest month, current season, or broader background
- user intent: reputation check, update scan, controversy synthesis, shortlist, comment analysis, or post/video analysis

If the prompt is broad, infer likely aliases before searching.

For account-summary tasks, ask for the smallest useful identifier if available: profile URL, user ID/handle, screenshot, copied title list, or 3-5 recent note links. If the user wants fuller coverage and agrees to log in, switch from public-web mode to login-enhanced browser review instead of pretending public-web search is complete. If the user does not want login, read `references/minimal-user-input-paths.md` and ask for the least burdensome seed material that will improve coverage.

## 2) Build queries

Use `scripts/query_builder.py` when deterministic query expansion would help, especially if you need a media-focused query set or a starter claim log schema.
Use `scripts/recovery_query_builder.py` when your starting point is weak public-web evidence: a thin search snippet, partial title, OCR fragment, subtitle line, hashtag, price, or visible date that needs recovery-oriented search pivots.

Prefer a mixed query set instead of one giant keyword dump:
- `overview`: baseline discovery
- `latest`: newest updates and recent turns in sentiment
- `trending`: hot discussion and rumor-tracking discovery
- `comment`: comment-area reactions and repeated talking points
- `review`: reputation, quality, warning signs, user experience
- `recommendation`: worth-it, shortlist, comparison, local picks
- `verification`: official notices, registration records, named responses, implementation details

Typical source patterns:
- `site:xiaohongshu.com <entity> <modifier>`
- `site:www.xiaohongshu.com <entity> <modifier>`
- `<entity> 小红书 <modifier>`
- `<entity> <modifier>`

Category hints:
- `education`: 口碑, 避雷, 退费, 课程质量, 就业, offer, 合同, 维权
- `policy`: 政策, 新规, 通知, 官方回应, 执行, 解读, 影响
- `gossip`: 爆料, 八卦, 翻车, 塌房, 争议, 后续, 聊天记录, 回应
- `local`: 推荐, 探店, 菜品, 排队, 价格, 服务, 环境, 值不值, 避雷
- `general`: 评价, 口碑, 体验, 真实反馈, 怎么样, 值不值

Query-building heuristics:
- Start with 8-16 queries, not 40+.
- Mix discovery queries with 2-4 verification queries.
- Add geography for local or policy tasks.
- Use a narrow time scope for fast-moving topics.
- Search aliases and nicknames when drama or local slang is involved.
- For cross-market topics, run both Chinese and English variants.

## 3) Search public-web sources

Prefer breadth before depth. Search first, then fetch only the strongest pages.

Target source mix:
- RedNote/Xiaohongshu indexed pages and snippets
- official statements, brands, schools, stores, regulators, or platform notices
- reputable media reports for disputes or policy coverage
- maps/review/listing sites for local businesses
- forums and other community sites only as supplementary anecdotal signals

Search heuristics:
- favor recency for policy, gossip, and local recommendations
- keep a short source list with one-line relevance notes
- search exact names, aliases, hashtags, and comparison targets
- cross-check surprising claims with at least one non-RedNote source when possible
- if the task is about a specific account and public-web search returns thin results, say so explicitly instead of overclaiming; then offer the login-enhanced path or ask for a few seed links/screenshots

## 4) Extract claims and discussion patterns

Normalize findings into compact bullets with fields like:
- claim type: complaint / praise / neutral fact / official claim / media report / rumor / recommendation
- theme: pricing, quality, service, fraud risk, policy impact, taste, queue, environment, controversy, support, etc.
- evidence snippet
- source URL
- source class
- visible date
- repetition count if multiple sources echo the same point

Read `references/output-patterns.md` when you need output templates or comment clustering patterns.
Read `references/claim-log-schema.md` when the task is evidence-heavy, rumor-sensitive, or needs claim-by-claim tracking.
Read `references/multimodal-capture.md` when screenshots, images, videos, gifs, subtitles, or audio cues materially affect the answer.
Read `references/public-web-recovery.md` when the first page is partial, blocked, snippet-only, or clearly weaker than the underlying media/discussion.
Use `scripts/claim_log_tools.py` to initialize, normalize, or summarize a structured claim log when you have enough evidence items that manual tracking will become noisy.

### Post / comment / screenshot / image / video / gif / audio analysis

Stay explicit about what is and is not directly observable from public-web access.

Break analysis into layers:
1. **Surface metadata** — visible title, caption, date, platform text, source URL.
2. **Observed media evidence** — visible text, OCR-able text, subtitles, scene details, sequence, speaker labels, or audio/transcript clues.
3. **Content summary** — what is clearly shown, spoken, or claimed.
4. **Reaction summary** — visible comment themes, sentiment split, repeated jokes, skepticism, support.
5. **Credibility check** — firsthand evidence vs repost vs edit-heavy clip vs rumor relay.
6. **Open questions** — what would require login, in-app rendering, browser automation, direct file access, frame extraction, OCR cleanup, or ASR.

If the user provides screenshots, transcripts, fetched page text, or media files, analyze those directly and keep extraction separate from interpretation.

### Claim-first working pattern

When the topic is messy, do not jump straight from search results to a vibe summary.

Use this loop instead:
1. list the 2-6 decision-relevant claims
2. attach evidence items with explicit modality and access level
3. downgrade anything that remains snippet-only or relay-only
4. summarize only after the strongest claim/evidence pairs are visible

Good trigger conditions for a claim log:
- rumor-heavy controversies
- screenshot-led accusations
- policy interpretation disputes
- local recommendation tasks with sharply conflicting chatter
- any answer where you need to explain why one repeated claim is still weak

## 5) Verify before concluding

Read `references/verification-patterns.md` when the task involves rumors, policy changes, business legitimacy, or claims that could materially affect a decision.

Default verification moves:
- find the earliest visible source, not just the loudest repost
- separate claim, response, and confirmed consequence
- check whether the page is firsthand, quoted, scraped, or relayed
- look for official names, dates, location details, and implementation language
- for local businesses, compare RedNote chatter with maps/review data or official menu/hours pages
- for policy topics, prioritize primary notices over interpretation posts
- for gossip, keep anonymous screenshots and clipped media at rumor level unless independently corroborated
- for screenshots, note whether key text is fully visible, cropped, or OCR-uncertain
- for videos, distinguish caption-level evidence from frame-level evidence
- for audio claims, distinguish direct transcript, ASR-derived wording, and second-hand paraphrase

## 6) Score credibility and decision risk

Read `references/scoring-rubric.md` when you need the full rubric.

Use at least two separate judgments:

### Credibility score (0-5)
- 5: official documents, regulator notices, court/government records, direct statements, reputable reporting
- 4: detailed firsthand post or review with dates, screenshots, prices, names, or concrete specifics
- 3: specific but weakly corroborated anecdote or snippet
- 2: vague anecdote, repost, engagement bait, or SEO page
- 1: obvious rumor or unsourced assertion
- 0: cannot inspect or verify

### Risk / caution / recommendation score (0-5)
Interpret the second score according to task type:
- education / reputation / policy / gossip: higher means more caution or downside risk
- local recommendation: higher can mean stronger recommendation confidence only if you label it explicitly; otherwise keep it as caution risk to avoid ambiguity

Weight repeated, independent, recent, and specific evidence more heavily than loud but vague posts.

## 7) Deliver the report

Keep the report concise and decision-oriented.

Choose the smallest fitting format:

### A) Quick snapshot
- Subject:
- Category:
- Time scope:
- Overall signal: positive / mixed / caution / high risk / inconclusive
- Confidence: low / medium / high

### B) Findings
- 3-6 bullets, strongest evidence first

### C) Evidence list
Use compact bullets when tables are awkward:
- `[credibility 4 | score 4 | first-hand | 2025-09] refund complaints repeat across multiple posts — <url>`

### D) Discussion clusters
- cluster name
- representative wording pattern
- approximate repetition count
- confidence note

### E) What remains unverified
- missing items that public-web access cannot confirm

### F) Suggested next checks
- official notice or registration lookup
- a more recent search window
- map/review cross-check for local places
- direct in-app or browser review if the user wants deeper comment or media extraction

## Fast paths

### Quick reputation check
1. Build a mixed `overview` + `review` + `verification` query set.
2. Search 6-12 strong queries.
3. Capture 5-10 sources.
4. Score each source.
5. Return a short summary plus caveats.

### Latest update or policy scan
1. Use `latest` + `trending` + `verification`.
2. Bias toward the last 7-30 days.
3. Separate official update from community interpretation.
4. State whether the trend is confirmed, contested, or still rumor-level.

### Local recommendation scan
1. Use category `local`.
2. Mix review, recommendation, complaint, and verification queries.
3. Cluster themes: taste, price, queue, service, environment, location convenience.
4. Return a shortlist plus tradeoffs, not just one winner.

### Comment or post analysis
1. Collect visible text, screenshots, snippets, or transcript first.
2. Cluster reactions into 3-5 themes.
3. Mark what is directly seen vs inferred.
4. State clearly when deeper extraction would require login, browser automation, or direct media processing.
5. If the user wants deeper comment-level coverage, offer login-enhanced mode as an explicit escalation path.
6. If the user declines login, ask for screenshots or copied comment text instead of pretending the full thread was inspected.

### Account summary or recent-post scan
1. Start with public-web mode and gather any inspectable profile URL, note URLs, snippets, mirrors, or search-engine traces.
2. Read `references/account-summary-template.md` for output structure.
3. If the goal is a broad impression only, summarize from public-web evidence with caveats.
4. If the goal is recent-post completeness, tell the user public-web coverage may be partial and offer login-enhanced browser review.
5. If the user chooses login-enhanced mode, read `references/login-enhanced-workflow.md` and follow the controlled authenticated-review path.
6. If the user does not want login, read `references/minimal-user-input-paths.md` and ask for a few seed links, screenshots, or copied note titles to improve coverage.
7. Distinguish clearly between account-level observations, note-level evidence, and anything missing because of access limits.

### Screenshot / image-led analysis
1. Capture the page context plus image-visible text, prices, dates, names, and watermarks.
2. Note image legibility and likely OCR uncertainty.
3. Separate image-contained claims from caption-contained claims.
4. If the page itself is weak, pivot on the strongest visible fragment with `scripts/recovery_query_builder.py`.
5. Log the strongest inspectable claim(s) before summarizing.

### Video / subtitle / gif-led analysis
1. Capture caption, visible duration, upload date, and any subtitle/on-screen text.
2. Distinguish clip content from commentary about the clip.
3. If you only have snippet-level access, keep conclusions provisional and pivot on distinctive subtitle fragments or overlays with `scripts/recovery_query_builder.py`.
4. Say whether frames or the original file would materially improve confidence.

### Audio / transcript-led analysis
1. Identify whether you have direct audio, subtitles, ASR text, or only quoted paraphrases.
2. Treat transcript quality as part of the evidence rating.
3. Avoid overreading tone, sarcasm, or exact wording without direct audio access.
4. If the only foothold is a quoted line, subtitle fragment, or repost caption, use `scripts/recovery_query_builder.py` to search for the earliest visible source or mirrors.
5. Log the spoken claim separately from reactions to it.

## Reliability caveats

- Search indexing can lag behind live app discussion.
- Viral repetition does not equal verification.
- Snippets can omit qualifiers or updates visible only on the landing page.
- Local quality and policy enforcement can change quickly; recency matters.
- Platform anti-bot controls can make no-login account research much thinner than in-app browsing.
- If evidence stays thin after cross-checking, say `inconclusive` rather than stretching.

## Recommended product posture

Treat this skill as a dual-mode RedNote research tool:
- **public-web mode** for broad research, weak-clue recovery, and no-login investigations
- **login-enhanced mode** for fuller account, recent-post, and comment review when the user explicitly opts in

When neither mode is enough on its own, use a hybrid path:
- public-web evidence + a few user-provided links/screenshots

FILE:references/access-modes.md
# RedNote Access Modes

Use this file when deciding how deep to go on a RedNote/Xiaohongshu task.

## Principle

The user should choose the access level.
Do not assume login by default.
Start in public-web mode unless the user explicitly wants deeper coverage and is willing to log in.

## Mode 1: Public-web mode (default)

Use when:
- the user wants a fast reputation check, trend scan, local recommendation scan, gossip synthesis, or weak-clue recovery
- the user does not want to log in
- the task can tolerate incomplete coverage

Capabilities:
- search-engine discovery
- public page fetching
- snippet/title/OCR/subtitle recovery
- cross-source verification
- structured claim extraction and concise reporting

Limits:
- weak coverage for cold accounts or newly posted content
- comment areas are often unavailable or incomplete
- user profile feeds may be inaccessible or thin
- public-web indexing can lag behind the app
- platform anti-bot controls may block direct browsing

Recommended phrasing:
- "I can do a public-web pass without login; coverage may be partial."
- "If you want fuller account-level or comment-level coverage, you can choose the login-enhanced mode."

## Mode 2: Login-enhanced mode (optional)

Use only when the user explicitly agrees to log in.

Use when:
- the user wants a fuller account summary
- the user wants recent posts from a specific account
- the user wants comment-area inspection
- public-web mode is too thin due to anti-bot, missing indexing, or cold-account visibility

What login changes:
- profile pages and recent posts are more likely to load fully
- note lists, media, and metadata may become inspectable
- comment and reaction patterns may become much easier to analyze
- browser automation can operate on a real authenticated session instead of a blocked public view

### Login path to offer the user

Present login as a choice, not a requirement.

Suggested explanation:
- "This skill supports two paths: public-web mode (no login, partial coverage) and login-enhanced mode (more complete coverage). If you want, you can log in inside the controlled browser session and I’ll use that session only for this research task."

Suggested workflow:
1. Explain the tradeoff between public-web mode and login-enhanced mode.
2. Ask whether the user wants to stay no-login or switch to login-enhanced mode.
3. If the user chooses login:
   - open the RedNote/Xiaohongshu site in the browser automation environment
   - let the user complete QR/code/password login themselves
   - wait for confirmation that login succeeded
   - continue research inside that authenticated browser session
4. If the user does not choose login, stay in public-web mode and proceed with explicit caveats.

## Safety and privacy notes

- Never force login.
- Never ask for the user’s password in chat when a normal site login flow is available.
- Prefer user-completed QR or site-native login flows inside the browser session.
- Treat the authenticated session as task-scoped and privacy-sensitive.
- Do not over-collect data just because login is available.
- State clearly when a finding depends on authenticated access versus public-web access.

## Decision rule

Choose the smallest sufficient access mode.

- If the user asks a broad question and public-web mode is enough, stay no-login.
- If the user asks for account-level completeness, recent-post coverage, or comments, offer login-enhanced mode.
- If public-web mode fails because of anti-bot or thin indexing, explicitly say login-enhanced mode is the next escalation path.

## Output note

When login-enhanced mode was used, say so briefly in the answer.
Example:
- "This summary is based on an authenticated in-app/browser review, so coverage is fuller than public-web search alone."

When public-web mode was used, say that too.
Example:
- "This summary is based on public-web evidence only; recent or low-visibility posts may be missing."
FILE:references/account-summary-template.md
# Account Summary Templates

Use this file for account-level RedNote/Xiaohongshu tasks.

## 1) Quick account snapshot

- Account:
- Access mode: public-web only / login-enhanced browser review
- Coverage window requested:
- Coverage window actually inspected:
- Material inspected: profile / recent notes / comments / screenshots / mirrors
- Overall impression:
- Confidence: low / medium / high

## 2) Recent 1-3 month summary

- **Main themes:**
  - theme 1
  - theme 2
  - theme 3
- **Posting pattern:** frequent / moderate / sparse / bursty
- **Content format mix:** image posts / video / guide / roundup / repost / mixed
- **Tone/style:** diary-like / recommendation-heavy / polished / commercial / informational / meme-like
- **Likely audience:** locals / tourists / fans / shoppers / general lifestyle readers

## 3) Theme clustering pattern

For each cluster:
- **Theme:** brunch spots / coffee / AZ local events / malls / discounts / daily life / beauty / travel / etc.
- **Evidence:** representative post titles, visible captions, repeated venue types, recurring hashtags
- **Approx. share:** dominant / recurring / occasional
- **Confidence:** low / medium / high

## 4) Commercial intent / collaboration check

Use only if evidence exists.
- obvious ad markers visible?
- repeated merchant promotion?
- unusually similar call-to-action structure?
- affiliate / discount / booking / store push patterns?

Label carefully:
- no obvious signs
- possible commercial leaning
- clear sponsored/promotional pattern

## 5) Comment-area summary pattern

Only use when comments were directly inspected.

- **Main audience reaction:** positive / mixed / skeptical / negative / joking
- **Comment clusters:**
  - where is this?
  - is it worth it?
  - price/value discussion
  - queue / booking / logistics
  - support / hype / fandom
- **Caveat:** comment review is sampled unless explicitly exhaustive

## 6) Best-practice caveat lines

Public-web mode caveat:
- "This summary is based on public-web traces only, so recent or low-visibility posts may be missing."

Login-enhanced caveat:
- "This summary is based on an authenticated browser review of visible account content; it is fuller than public-web search but still limited to the material directly inspected."

## 7) Recommended final answer shape

### A. Snapshot
- Account:
- Access mode:
- Time window:
- Overall read:

### B. What this account has been posting about
- 3-6 bullets

### C. Posting style and pattern
- 2-4 bullets

### D. If relevant: audience reaction / commercial signals
- 2-4 bullets

### E. What may still be missing
- 1-3 bullets
FILE:references/claim-log-schema.md
# RedNote Structured Claim Log

Use this file when the task is noisy, multimodal, rumor-prone, or evidence-heavy. The goal is to separate claims from proof instead of dumping vibes into a summary.

## 1) Core rule

A claim log is not the final answer. It is the working ledger that keeps:
- what was claimed
- where it came from
- what evidence modality supports it
- how strong that support is
- what remains unresolved

## 2) Minimum claim fields

For each material claim, capture:
- `claim_id`
- `claim_text`
- `claim_type`: fact / allegation / praise / complaint / rumor / official statement / recommendation
- `theme`: pricing / service / policy / controversy / quality / fraud-risk / taste / queue / subtitles / audio / image-text / etc.
- `entity`
- `time_scope`
- `geography`
- `status`: supported / mixed / weak / contradicted / unresolved
- `confidence`: low / medium / high
- `notes`

## 3) Evidence item fields

Each claim can map to one or more evidence items.

Capture:
- `evidence_id`
- `claim_id`
- `source_url`
- `source_class`: official / media / first-hand / community / low-trust
- `modality`: text-page / image / screenshot / video / gif / audio / mixed
- `access_level`: direct file / fetched page / search snippet / quoted relay
- `visible_date`
- `extract`
- `summary`
- `credibility`: 0-5
- `score`: 0-5
- `supports`: supports / partially-supports / contradicts / contextual-only

## 4) Compact JSON template

```json
{
  "claims": [
    {
      "claim_id": "c1",
      "claim_text": "...",
      "claim_type": "allegation",
      "theme": ["controversy", "video"],
      "entity": "...",
      "time_scope": "2026-03",
      "geography": "...",
      "status": "weak",
      "confidence": "low",
      "notes": "Only snippet-level video references so far"
    }
  ],
  "evidence": [
    {
      "evidence_id": "e1",
      "claim_id": "c1",
      "source_url": "https://...",
      "source_class": "community",
      "modality": "video",
      "access_level": "search snippet",
      "visible_date": "2026-03-20",
      "extract": "caption or OCR/transcript fragment",
      "summary": "What is directly visible",
      "credibility": 2,
      "score": 3,
      "supports": "partially-supports"
    }
  ]
}
```

## 5) Spreadsheet-style markdown fallback

When JSON is too heavy, use bullets grouped by claim.

Template:
- **Claim c1:** store raised prices after going viral
  - status: mixed
  - confidence: medium
  - evidence: `[credibility 4 | score 3 | image | fetched page | 2026-03] menu screenshot shows updated prices — <url>`
  - evidence: `[credibility 3 | score 2 | community | snippet | 2026-03] repeated complaints mention 贵 but without receipts — <url>`
  - gap: need official menu or multiple dated receipts

## 6) Status meanings

- `supported`: strongest available evidence points in the same direction
- `mixed`: meaningful evidence conflicts or scope is unclear
- `weak`: claim appears repeatedly but evidence is thin
- `contradicted`: stronger evidence cuts against the claim
- `unresolved`: too little inspectable evidence either way

## 7) Multimodal notes

When the evidence is not plain text, add one short modality note:
- image-text legible / partly legible / unclear
- video sequence direct / excerpted / not inspectable
- audio transcript direct / ASR-derived / second-hand quote

## 8) Aggregation guidance

Before writing the final answer:
1. identify which 2-4 claims actually matter to the user's decision
2. collapse duplicate evidence items
3. keep rumor claims visible but clearly downgraded
4. separate "widely discussed" from "well supported"
5. cite the strongest evidence for each conclusion

## 9) Final-answer bridge

Translate the claim log into a human summary using this order:
1. strongest confirmed point
2. most repeated but weaker claim
3. major contradiction or uncertainty
4. practical recommendation or next check

FILE:references/login-enhanced-workflow.md
# Login-Enhanced Workflow

Use this file when the user explicitly chooses deeper RedNote/Xiaohongshu access and is willing to log in.

## Goal

Turn login-enhanced mode into a controlled, user-choice workflow.
Do not treat login as implicit.
Do not request passwords in chat when a site-native login flow is available.

## When to escalate to login-enhanced mode

Offer this mode when one or more of these are true:
- the user wants a full account summary or recent-post scan
- the user wants comment-area analysis
- public-web mode returned thin or blocked results
- the user wants higher confidence on an account, post, or trend

Suggested wording:
- "Public-web mode is too thin for this account. If you want, you can switch to login-enhanced mode and log in inside the controlled browser session for fuller coverage."

## Recommended browser flow

1. Explain why public-web mode is incomplete.
2. Ask whether the user wants to stay no-login or switch to login-enhanced mode.
3. If the user chooses login-enhanced mode:
   - open RedNote/Xiaohongshu in browser automation
   - navigate to the relevant profile/search/post page if known
   - tell the user to complete QR/code/password login in the browser session themselves
   - wait for confirmation that login succeeded
4. After login, continue only with the minimum navigation needed for the task.
5. At the end, summarize findings and say the result used login-enhanced access.

## What to collect after login

Collect the smallest evidence set that answers the user's request.

For an account summary, prefer:
- profile headline and self-description
- visible recent note list
- visible timestamps / posting cadence
- recurring themes
- visible media formats: image post, video, repost, guide, etc.
- visible engagement clues when available
- visible comments only if the user asked for them

## Account summary checklist

For "summarize this account's recent 1-3 months":
- identify the visible date range actually inspected
- count or estimate how many recent notes were directly inspected
- group notes into 3-6 recurring themes
- note whether posts are mostly food, local guide, lifestyle, ads, reposts, or mixed
- note whether style is diary-like, recommendation-heavy, deal-focused, aesthetic, meme-like, or informational
- note whether there are obvious campaigns, merchant collaborations, or repeated venue types
- note what remains missing even after login

## Comment review checklist

When comments are requested:
- inspect only enough comments to identify major clusters
- cluster into 3-5 themes
- separate support / criticism / joking / questions / purchase intent
- do not imply exhaustive full-thread coverage unless the full thread was actually reviewed

## Failure and fallback

If login cannot be completed or the platform still blocks access:
- say exactly what failed
- fall back to public-web mode
- ask for seed links, screenshots, or copied titles if needed

Example fallback wording:
- "Login-enhanced review didn’t complete successfully, so I’m falling back to public-web evidence plus any links/screenshots you provide."

## Privacy and restraint

- Do not browse unrelated private areas.
- Do not message, like, follow, or post unless the user explicitly asks.
- Do not scrape more history than needed.
- Treat authenticated sessions as sensitive.

## Output note

Add one short line in the final answer:
- "Access mode: login-enhanced browser review"
or
- "Access mode: public-web only"
FILE:references/minimal-user-input-paths.md
# Minimal User Input Paths

Use this file when public-web access is weak and the user does not want to log in.

## Principle

Ask for the smallest possible input that unlocks a much better answer.
Do not immediately ask for everything.

## Best low-friction inputs

From most useful to least useful:
1. profile URL
2. 3-5 recent note URLs
3. screenshot(s) of the profile and recent note list
4. copied note titles or captions
5. one screenshot per note with visible date/title/caption

## Suggested asks by task

### Account summary
Ask for one of:
- profile URL
- 3-5 recent note links
- profile screenshot + recent post screenshot list

### Post or comment analysis
Ask for one of:
- post URL
- screenshot(s) of the post and comments
- copied caption / comments / subtitles

### Local recommendation scan tied to one creator
Ask for one of:
- 3 representative posts
- visible venue names and dates
- profile screenshot plus top recent note titles

## Why this works

A few seed artifacts are usually enough to:
- infer themes
- estimate posting cadence
- identify recurring places or products
- analyze tone and commercial leaning
- recover more public-web evidence from titles, OCR, subtitles, and venue names

## Suggested phrasing

- "If you’d rather not log in, send me the profile link or 3-5 recent note links and I can still build a solid summary."
- "A screenshot of the profile and recent posts is enough for a first-pass account read."
- "You don’t need to send everything; a few representative posts is usually enough."
FILE:references/multimodal-capture.md
# RedNote Multimodal Capture Patterns

Use this file when the task involves screenshots, image posts, videos, gifs, subtitles, audio clips, or partial pages where the visible snippet is weaker than the underlying media signal.

## 1) Goal

Convert fragile public-web media evidence into inspectable notes without pretending to have more access than you do.

Priorities:
1. preserve what is directly visible
2. separate OCR/transcript output from interpretation
3. capture recovery paths when search snippets are partial
4. log each claim with evidence strength and modality

## 2) Observation ladder

Work from the most direct material available.

1. **Original media file or user-provided screenshot/video/audio**
2. **Fetched page with visible caption, alt text, surrounding text, or embedded metadata**
3. **Search snippet or cached preview**
4. **Secondary discussion about the media**

Do not collapse these levels together. A caption about a video is not the same as direct video inspection.

## 3) Image / screenshot workflow

For screenshots, posters, menus, notices, chat logs, or image-heavy posts:

### Capture
- note source URL and visible page title
- preserve visible date, username, location, and caption text if any
- describe image count and obvious layout if visible
- if text inside the image matters, treat OCR output as extracted evidence, not as guaranteed ground truth

### Extract
Break the image into layers:
- **Visible text:** exact words, numbers, prices, names, dates, hashtags, watermarks
- **Visual context:** scene, product, place, document type, UI layout, whether it looks cropped or edited
- **Confidence note:** clear / partially legible / low-resolution / occluded

### Analyze
Ask:
- is the key claim actually inside the image, or only in the caption/snippet?
- is the screenshot complete or selectively cropped?
- do visual details support or weaken the claim?
- is there a document number, store name, map pin, date, or interface cue that enables follow-up verification?

### Safe wording
- "The screenshot appears to show..."
- "The visible text suggests..."
- "I can read X with medium confidence; the lower section is cut off."

## 4) Video / gif workflow

Use when a post or search result points to a clip, reel, vlog, surveillance excerpt, or reaction video.

### Capture
Log:
- caption/title text
- visible duration if shown
- source URL
- visible upload date
- whether audio, subtitles, or comments are available from the public-web surface

### Extract in channels
- **Frames:** key scenes, objects, people, locations, documents, gestures, on-screen overlays
- **On-screen text:** subtitles, location stickers, date overlays, product names, prices
- **Narrative sequence:** what happens first / next / last
- **Edit cues:** cuts, zooms, reaction overlays, stitched repost indicators, meme captions

### Analysis questions
- does the clip itself show the claimed event, or only a reaction to it?
- are subtitles consistent with what is visible?
- does the clip appear heavily edited or excerpted?
- what would need actual frame extraction to verify more confidently?

### If only snippet-level access exists
Do not claim full clip analysis. Say:
- "Public-web access shows the caption/snippet, but not enough of the clip to verify the full sequence."
- "This looks like a video-led discussion, but I would need the file or extracted frames to analyze the clip itself."

## 5) Audio / ASR workflow

Use when the topic relies on voice notes, spoken explanations, livestream audio, or subtitle-less clips.

### Capture
- identify whether audio is directly available, indirectly referenced, or absent
- record speaker labels only when explicit
- preserve surrounding metadata: date, caption, URL, claimed context

### Extract
Separate:
- **Direct transcript** — exact spoken words if a transcript or subtitle exists
- **ASR-derived text** — machine-readable approximation from audio, if obtained
- **Interpretation** — meaning, tone, accusation, explanation, promise, denial

### Reliability notes
Lower confidence when:
- speaker overlap is strong
- background noise is heavy
- clip is very short
- the uploaded audio appears edited or spliced
- quote summaries differ from available transcript

### Safe wording
- "If the subtitle line is accurate, the speaker is claiming..."
- "I do not have enough direct audio access to verify tone or exact wording."
- "The available evidence is transcript-level, not waveform-level."

## 6) Partial-page fallback strategies

When the landing page is weak, inaccessible, or only partly indexed, recover carefully:

1. capture the search snippet verbatim enough to preserve the claim
2. search the exact title/caption phrase in quotes
3. search distinctive names, prices, dates, hashtags, or subtitle fragments
4. search for reposts, mirrors, or discussions quoting the same line
5. look for non-RedNote corroboration using the same entities or phrases
6. mark whether each recovered source is original, relay, or commentary

Good recovery targets:
- exact subtitle phrase
- menu price or product name
- notice title + date
- unique complaint wording
- hashtag + location

## 7) Evidence packaging for multimodal tasks

For each evidence item, log:
- modality: image / screenshot / video / gif / audio / mixed / text-page
- access level: direct file / fetched page / snippet / secondary relay
- extracted text: visible text or transcript fragment
- visual or audio summary
- claim supported
- confidence
- missing piece

## 8) Common traps

Avoid these mistakes:
- treating OCR text as perfectly accurate
- assuming subtitles are faithful to speech
- inferring full comment sentiment from one snippet
- calling a reposted clip "first-hand evidence"
- confusing popularity with verification
- ignoring crop/edit signs in screenshots or videos

## 9) Decision rule

If media matters to the conclusion, but you only have snippet-level access, keep the final answer provisional and say what direct media artifact would most improve confidence.

FILE:references/output-patterns.md
# RedNote Community Intelligence Output Patterns

Use this file when the task needs compact but structured output, especially for comment clustering, fast-moving community updates, local recommendation summaries, and rumor-sensitive writeups.

## 1) Quick scan output

- Subject:
- Category:
- Time scope:
- Overall signal:
- Confidence:
- One-line take:

## 2) Evidence bullet format

Use one bullet per source or per consolidated claim.

Pattern:
- `[credibility 4 | score 3 | first-hand | 2026-03] refund dispute repeated across 3 posts — <url>`
- `[credibility 5 | score 4 | official | 2026-03] regulator notice confirms policy change — <url>`
- `[credibility 3 | score 1 | community | date unknown] many users praise atmosphere but specifics are thin — <url>`

Write the summary so the reader can understand the point without opening the link.

## 3) Comment clustering pattern

Cluster comments only when you have visible text, search snippets, screenshots, or fetched content. Do not imply full-thread access if you only saw snippets.

For each cluster, capture:
- cluster name
- stance: support / oppose / mixed / joking / skeptical / recommendation / complaint
- representative wording pattern
- approximate repetition count
- evidence quality note

Template:
- **Cluster:** price complaints
  - stance: complaint
  - pattern: "贵", "不值这个价", "性价比一般"
  - repetition: ~6 visible mentions
  - note: mostly snippet-level, medium confidence

- **Cluster:** still worth trying once
  - stance: mixed recommendation
  - pattern: "排队久但是拍照出片", "适合打卡"
  - repetition: ~4 visible mentions
  - note: consistent but not deeply verified

### Useful clustering buckets

Use 3-5 buckets unless the evidence is unusually rich:
- price / value
- quality / outcome
- service / attitude
- logistics / queue / wait / access
- credibility / scam / official response
- humor / meme / sarcasm
- support / defense / fandom

## 4) Latest update or policy scan pattern

Use this structure for policy or community updates:
- **What changed:** the claimed update in one sentence
- **Confirmed by:** official / media / community chatter
- **Earliest visible date:**
- **Likely impact:** who is affected and how
- **Disagreement / confusion:** where interpretation diverges
- **Confidence:** low / medium / high

## 5) Gossip or controversy synthesis pattern

Keep allegation, response, and consequence separate.

Template:
- **Trigger event:** what started the discussion
- **Main allegations or rumors:** 2-4 bullets
- **Response:** official or involved-party reaction
- **What is actually verified:** hard facts only
- **What remains rumor-level:** unverified claims, leaks, screenshots, hearsay
- **Current community mood:** mockery / anger / fatigue / split / support

## 6) Local recommendation pattern

Use tradeoffs instead of declaring a single winner.

Template:
- **Best for:** photo spot / date / fast meal / family / specialty dish / late night
- **Strengths:**
- **Common complaints:**
- **Price impression:** cheap / fair / expensive for what it is
- **Queue / booking note:**
- **Decision:** worth trying / only if nearby / hype > quality / avoid for now

### Shortlist comparison pattern

When comparing multiple local options, use one bullet per venue:
- `Name — best for X; common complaint Y; confidence low/medium/high`

## 7) Media analysis guardrails

For post, video, or gif tasks, explicitly mark:
- what is directly visible
- what is inferred from captions or snippets
- what would require login, in-app rendering, full comment expansion, or frame extraction

Safe phrasing examples:
- "Based on the visible snippet..."
- "I can infer a likely reaction split, but not confirm full comment distribution from public search results alone."
- "To analyze the clip itself rather than metadata, I would need the file or extracted frames."

## 8) Verification-aware conclusion pattern

Use this when evidence is noisy:
- **Strongest confirmed point:**
- **Most repeated but weakly verified claim:**
- **Main uncertainty:**
- **Decision posture:** proceed / proceed cautiously / wait for confirmation / avoid for now

FILE:references/public-web-recovery.md
# RedNote Public-Web Recovery Patterns

Use this file when the first RedNote/Xiaohongshu result is partial, blocked, weakly indexed, or reduced to a search snippet.

## 1) Goal

Recover inspectable evidence from the public web without pretending you accessed the full in-app thread.

Priority order:
1. preserve the original snippet/title/caption clues
2. pivot on distinctive fragments
3. recover original or mirrored pages
4. separate originals, relays, and commentary
5. log what remains inaccessible

## 2) What to preserve immediately

Before chasing more links, capture:
- search query used
- visible result title
- visible snippet text
- URL/domain
- visible date if shown
- any prices, names, hashtags, store names, usernames, dates, or subtitle phrases
- whether the result seems to be a post, profile, note, menu image, video page, or discussion about a post

These details often disappear after the next search pivot.

## 3) Recovery pivots

Use the strongest available fragment, not generic keywords.

Best pivots:
- exact title phrase in quotes
- distinctive subtitle line in quotes
- hashtag + location
- store name + dish name + price
- notice title + date
- complaint wording + refund / contract / order / queue / course
- username + unique phrase

Weak pivots:
- generic words like `避雷`, `评价`, `值得吗`
- emotional summaries with no entity/date
- broad one-word category labels

## 4) Recovery workflow

### A. Exact-fragment recovery
1. search the visible title or subtitle fragment in quotes
2. search the same fragment with and without the entity name
3. search the fragment on `site:xiaohongshu.com` and the wider web
4. compare whether results are originals, mirrors, or quote-relays

### B. Entity-detail recovery
Use when you only have partial page access.
1. combine entity name with the strongest inspectable detail: price, product, date, district, hashtag, event name
2. search those combinations in both Chinese and English where relevant
3. search likely aliases, abbreviations, nicknames, or old names

### C. Cross-source recovery
Use when RedNote pages are thin.
1. search the same phrase on news, maps, forum, blog, or official domains
2. recover named actors, dates, and consequences
3. use those recovered facts to search back into RedNote discussion

### D. Media-led recovery
Use when the snippet points to image/video/audio evidence.
1. pivot on on-screen text, watermark, subtitle fragment, menu price, or visible date
2. separate caption-based claims from media-contained claims
3. search for reposts or mirrors quoting the same text
4. keep the claim provisional if you still cannot inspect the media itself

## 5) Source labeling after recovery

When recovery succeeds, tag the page clearly:
- **original**: likely firsthand or original host page
- **mirror/repost**: copied or republished material
- **commentary**: people discussing the original claim
- **verification**: official or documentary follow-up

Do not let a commentary page silently inherit the credibility of the original artifact.

## 6) Good recovery notes to carry into the final answer

Examples:
- "Search snippet suggested a refund dispute; exact phrase recovery found two reposts and one detailed firsthand post."
- "The indexed RedNote page was thin, but a quoted subtitle fragment recovered mirrored discussion and an official response page."
- "I recovered menu-price discussion from public reposts, but not the original image file, so the pricing claim stays medium-confidence."

## 7) When to stop

Stop escalating when:
- repeated searches return only recycled commentary
- no distinct named source emerges
- the remaining uncertainty depends on in-app comments, full media access, or login-only rendering

At that point, say what the public web does show and what artifact would most improve confidence.

FILE:references/scoring-rubric.md
# RedNote Community Intelligence Scoring Rubric

Use this file when the task needs more careful judgment than the short rubric in `SKILL.md`.

## Source classes

1. **Official / documentary**
   - official brand, school, shop, regulator, or platform pages
   - government notices, court judgments, company registrations, policy releases

2. **Reported / journalistic**
   - established media or industry reporting with named sources

3. **First-hand user account**
   - detailed personal experience with dates, prices, screenshots, names, order details, or concrete specifics

4. **Community discussion / anecdote**
   - RedNote/Xiaohongshu snippets, Zhihu answers, Tieba/forum threads, reposts, roundups

5. **Low-trust pages**
   - SEO farms, scraped aggregators, unverifiable reposts

## Credibility cues

Increase credibility when a source has:
- named entity and consistent branding
- concrete dates, fees, store names, policy text, screenshots, or timeline details
- direct first-hand experience
- corroboration from independent sources
- recent publication date when the topic is time-sensitive

Lower credibility when a source has:
- vague emotional wording without specifics
- no visible date or context
- engagement bait wording like "千万别去" or "塌房实锤" with no evidence
- copied text across many pages
- obvious marketing or affiliate incentives
- edited screenshots or second-hand relays with no provenance

## Risk and recommendation cues by category

### Education reputation
High-signal red flags:
- guaranteed admission, job, or offer claims
- refund disputes or impossible refund conditions
- mismatch between sales pitch and contract
- fake or inflated employment or admission outcomes
- repeated complaints about absent teaching or support

### Policy or community updates
High-signal caution signs:
- claims of a new rule with no primary source
- community over-reading a vague official notice
- screenshots without source link or date
- old policy being recirculated as new

### Gossip or controversy
High-signal caution signs:
- anonymous chat logs with no provenance
- "everyone is saying" rumor loops
- clipped videos or gifs stripped of context
- quote screenshots without original post link

### Local recommendations
Useful decision cues:
- repeated praise on a specific dish or service strength
- repeated complaints on price, queue, hygiene, or service attitude
- consistency across different source types such as RedNote plus maps/reviews plus official menu or hours pages
- recency, because local quality can shift quickly

## Suggested scoring method

For each source:
- assign `credibility: 0-5`
- assign `score: 0-5`
- tag one or more themes such as `policy`, `refund`, `pricing`, `support`, `controversy`, `taste`, `service`, `environment`

Interpret `score` carefully:
- for risk-oriented tasks, higher = more caution or downside risk
- for recommendation tasks, either relabel the field clearly or explain whether high means stronger recommendation vs stronger caution

Then aggregate:
- if strongest sources are weak but numerous, say `signal exists but confidence is limited`
- if a few high-credibility sources show severe issues, escalate overall caution even if sentiment is mixed
- if positive claims are mostly official marketing, do not treat them as independent reputation evidence
- if local praise is broad but shallow, avoid overconfident recommendations

## Lightweight aggregation rule

Use this simple decision rule unless the user needs a more formal method:
1. Start from the strongest 3-5 inspectable sources.
2. Let independent corroboration matter more than repetition within one platform.
3. Let recency matter more for policy, drama, and local venues.
4. If the evidence is split, say `mixed` and state what would resolve the uncertainty.
5. If the evidence is thin, say `inconclusive` instead of forcing a verdict.

## Example normalized evidence items

- Source class: first-hand user account
- Theme: refund
- Summary: user reports refund denied after promised trial period; includes screenshots and contract excerpt
- Credibility: 4
- Score: 4
- Date: 2025-09
- URL: <link>

- Source class: official / documentary
- Theme: policy
- Summary: city notice confirms revised licensing rule and implementation date
- Credibility: 5
- Score: 4
- Date: 2026-03
- URL: <link>

- Source class: community discussion
- Theme: taste
- Summary: repeated praise for signature dish, but service complaints recur during peak hours
- Credibility: 3
- Score: 2
- Date: 2026-03
- URL: <link>

## Example overall labels

- **Positive**: mostly positive first-hand feedback, no serious recurring red flags, official claims broadly consistent
- **Mixed**: quality or interpretation varies, complaints exist but are not clearly systemic
- **Caution**: repeated complaints, unclear policy interpretation, or weakly verified controversy signals
- **High risk**: strong signs of fraud, legal trouble, repeated refund or credential issues, or verified severe misconduct
- **Inconclusive**: too little inspectable evidence to judge confidently

FILE:references/verification-patterns.md
# RedNote Community Intelligence Verification Patterns

Use this file when a claim could materially change the answer: rumors, policy changes, legitimacy checks, major controversies, safety concerns, or expensive local decisions.

## 1) Claim decomposition

Split the topic into separate parts before verifying:
- **Claim:** what is being said
- **Actor:** who allegedly did or announced it
- **Time:** when it supposedly happened
- **Place / jurisdiction:** where it applies
- **Consequence:** what changed in practice

Do not verify a vague bundle like "this school is a scam" in one step. Break it into inspectable subclaims such as refunds, licensing, outcome claims, lawsuits, or official penalties.

## 2) Source ladder

Prefer this order:
1. primary source or official record
2. reputable report with named sourcing
3. detailed firsthand post with concrete specifics
4. community discussion that helps discover themes
5. rumor relays, reposts, SEO pages

A loud claim can stay low-confidence if it never climbs the ladder.

## 3) Earliest-source check

For viral claims, ask:
- what is the earliest visible source?
- is the viral post original or copied?
- do later posts add evidence or just repeat wording?
- is the screenshot cropped, translated, or edited?

If you cannot locate the origin, keep the claim at rumor level.

## 4) Policy verification pattern

For policy or rule changes:
- find the primary notice, circular, or named authority first
- capture the effective date, jurisdiction, and affected group
- separate the official text from community interpretation
- check whether the update is new, a draft, or an older rule resurfacing
- note enforcement uncertainty if the notice exists but implementation evidence is thin

## 5) Business legitimacy pattern

For schools, clinics, agencies, shops, or service providers:
- verify the legal or operating name if possible
- look for registration, licensing, named address, or official contact points
- compare marketing claims with contract, menu, tuition, or service details
- note mismatches between branding and the formal entity
- treat missing basics as caution signals, not automatic proof of fraud

## 6) Controversy verification pattern

Keep these lanes separate:
- **allegation**
- **response**
- **verified consequence**

Examples of verified consequence:
- official statement issued
- account removed or suspended
- event canceled
- regulatory or legal action documented
- named organization confirms personnel change

Do not upgrade an allegation to fact just because a response exists.

## 7) Local recommendation verification pattern

When local hype is strong:
- check recency of praise and complaints
- compare RedNote enthusiasm with maps or review-site complaints
- watch for one photogenic item dominating otherwise weak reviews
- separate `good for photos` from `good to eat / good service / worth the trip`
- if the place is new, say quality may still be unstable

## 8) Safe wording patterns

Use these phrases when certainty is limited:
- "The strongest confirmed point is..."
- "This claim appears repeatedly, but I did not find a primary source confirming it."
- "Public-web evidence supports X, while Y remains rumor-level."
- "The update is likely real, but scope and enforcement remain unclear."
- "I found popularity signals, not enough evidence for a strong quality claim."

## 9) Minimum bar before strong conclusions

Before saying `high risk`, `confirmed`, or `worth recommending`, try to meet at least one of these bars:
- one high-credibility source plus independent corroboration
- multiple specific firsthand accounts with consistent details
- official record plus visible real-world implementation evidence

If the bar is not met, downgrade the confidence or overall signal.

FILE:scripts/claim_log_tools.py
#!/usr/bin/env python3
import argparse
import json
import sys
from collections import Counter, defaultdict
from pathlib import Path
from typing import Any, Dict, List

CLAIM_LOG_TEMPLATE = {
    'claims': [],
    'evidence': [],
}


def load_json(path: str) -> Dict[str, Any]:
    if path == '-':
        return json.load(sys.stdin)
    return json.loads(Path(path).read_text(encoding='utf-8'))


def dump_json(data: Dict[str, Any]) -> None:
    print(json.dumps(data, ensure_ascii=False, indent=2))


def normalize(data: Dict[str, Any]) -> Dict[str, Any]:
    claims = data.get('claims') or []
    evidence = data.get('evidence') or []

    normalized_claims: List[Dict[str, Any]] = []
    for idx, claim in enumerate(claims, start=1):
        item = dict(claim)
        item.setdefault('claim_id', f'c{idx}')
        item.setdefault('claim_text', '')
        item.setdefault('claim_type', 'fact')
        theme = item.get('theme', [])
        if isinstance(theme, str):
            theme = [theme] if theme else []
        item['theme'] = theme
        item.setdefault('entity', '')
        item.setdefault('time_scope', '')
        item.setdefault('geography', '')
        item.setdefault('status', 'unresolved')
        item.setdefault('confidence', 'low')
        item.setdefault('notes', '')
        normalized_claims.append(item)

    claim_ids = {claim['claim_id'] for claim in normalized_claims}
    normalized_evidence: List[Dict[str, Any]] = []
    for idx, evidence_item in enumerate(evidence, start=1):
        item = dict(evidence_item)
        item.setdefault('evidence_id', f'e{idx}')
        item.setdefault('claim_id', '')
        if item['claim_id'] and item['claim_id'] not in claim_ids:
            claim_ids.add(item['claim_id'])
            normalized_claims.append({
                'claim_id': item['claim_id'],
                'claim_text': '',
                'claim_type': 'fact',
                'theme': [],
                'entity': '',
                'time_scope': '',
                'geography': '',
                'status': 'unresolved',
                'confidence': 'low',
                'notes': 'Auto-created because evidence referenced a missing claim_id',
            })
        item.setdefault('source_url', '')
        item.setdefault('source_class', 'community')
        item.setdefault('modality', 'text-page')
        item.setdefault('access_level', 'fetched page')
        item.setdefault('visible_date', '')
        item.setdefault('extract', '')
        item.setdefault('summary', '')
        item['credibility'] = int(item.get('credibility', 0) or 0)
        item['score'] = int(item.get('score', 0) or 0)
        item.setdefault('supports', 'contextual-only')
        normalized_evidence.append(item)

    return {'claims': normalized_claims, 'evidence': normalized_evidence}


def summary(data: Dict[str, Any]) -> Dict[str, Any]:
    claims = data.get('claims', [])
    evidence = data.get('evidence', [])
    evidence_by_claim = defaultdict(list)
    for item in evidence:
        evidence_by_claim[item.get('claim_id', '')].append(item)

    source_classes = Counter(item.get('source_class', 'unknown') for item in evidence)
    modalities = Counter(item.get('modality', 'unknown') for item in evidence)
    access_levels = Counter(item.get('access_level', 'unknown') for item in evidence)
    statuses = Counter(item.get('status', 'unresolved') for item in claims)

    claim_summaries = []
    for claim in claims:
        claim_id = claim['claim_id']
        linked = evidence_by_claim.get(claim_id, [])
        credibility_values = [item['credibility'] for item in linked]
        score_values = [item['score'] for item in linked]
        support_counts = Counter(item.get('supports', 'contextual-only') for item in linked)
        claim_summaries.append({
            'claim_id': claim_id,
            'claim_text': claim.get('claim_text', ''),
            'status': claim.get('status', 'unresolved'),
            'confidence': claim.get('confidence', 'low'),
            'evidence_count': len(linked),
            'max_credibility': max(credibility_values) if credibility_values else 0,
            'avg_credibility': round(sum(credibility_values) / len(credibility_values), 2) if credibility_values else 0,
            'avg_score': round(sum(score_values) / len(score_values), 2) if score_values else 0,
            'support_mix': dict(support_counts),
        })

    return {
        'claim_count': len(claims),
        'evidence_count': len(evidence),
        'status_counts': dict(statuses),
        'source_class_counts': dict(source_classes),
        'modality_counts': dict(modalities),
        'access_level_counts': dict(access_levels),
        'claims': claim_summaries,
    }


def init_template() -> Dict[str, Any]:
    return CLAIM_LOG_TEMPLATE


def main() -> None:
    parser = argparse.ArgumentParser(description='Normalize and summarize RedNote structured claim logs.')
    subparsers = parser.add_subparsers(dest='command', required=True)

    subparsers.add_parser('init', help='Print an empty claim log template')

    normalize_parser = subparsers.add_parser('normalize', help='Normalize a claim log JSON file')
    normalize_parser.add_argument('path', help='Path to claim log JSON, or - for stdin')

    summary_parser = subparsers.add_parser('summary', help='Summarize a claim log JSON file')
    summary_parser.add_argument('path', help='Path to claim log JSON, or - for stdin')
    summary_parser.add_argument('--markdown', action='store_true', help='Emit markdown instead of JSON')

    args = parser.parse_args()

    if args.command == 'init':
        dump_json(init_template())
        return

    data = normalize(load_json(args.path))

    if args.command == 'normalize':
        dump_json(data)
        return

    result = summary(data)
    if not args.markdown:
        dump_json(result)
        return

    print('# Claim log summary')
    print()
    print(f"- Claims: {result['claim_count']}")
    print(f"- Evidence items: {result['evidence_count']}")
    print(f"- Status counts: {json.dumps(result['status_counts'], ensure_ascii=False)}")
    print(f"- Source classes: {json.dumps(result['source_class_counts'], ensure_ascii=False)}")
    print(f"- Modalities: {json.dumps(result['modality_counts'], ensure_ascii=False)}")
    print(f"- Access levels: {json.dumps(result['access_level_counts'], ensure_ascii=False)}")
    print()
    print('## Claim-level view')
    for item in result['claims']:
        text = item['claim_text'] or '(blank claim text)'
        print(f"- {item['claim_id']} | status={item['status']} | confidence={item['confidence']} | evidence={item['evidence_count']} | max_cred={item['max_credibility']} | avg_cred={item['avg_credibility']} | avg_score={item['avg_score']} | {text}")


if __name__ == '__main__':
    main()

FILE:scripts/query_builder.py
#!/usr/bin/env python3
import argparse
import json
import re
from typing import Dict, Iterable, List

CATEGORY_MODIFIERS: Dict[str, List[str]] = {
    "education": [
        "评价", "口碑", "避雷", "靠不靠谱", "真实体验", "投诉", "退费",
        "虚假宣传", "课程质量", "师资", "就业", "offer", "录取", "学费", "合同",
        "隐形消费", "霸王条款", "维权", "中介", "保录", "保offer"
    ],
    "policy": [
        "政策", "新规", "通知", "官方回应", "执行", "解读", "影响", "整改",
        "监管", "实施", "变化", "风向"
    ],
    "gossip": [
        "八卦", "爆料", "热议", "争议", "翻车", "塌房", "后续", "回应",
        "聊天记录", "是真的吗", "真假", "瓜", "避雷"
    ],
    "local": [
        "推荐", "探店", "评价", "口碑", "避雷", "值得去吗", "排队", "菜品",
        "价格", "服务", "环境", "打卡", "攻略", "值不值"
    ],
    "general": [
        "评价", "口碑", "避雷", "体验", "投诉", "靠谱吗", "真实反馈",
        "最近", "热议", "怎么样", "值不值"
    ],
}

FAMILY_MODIFIERS: Dict[str, List[str]] = {
    "overview": ["小红书", "怎么样", "评价", "口碑"],
    "latest": ["最新", "最近", "近况", "本周", "本月"],
    "trending": ["热议", "爆了", "上热搜", "讨论", "风向", "后续"],
    "comment": ["评论", "评论区", "大家怎么说", "反馈", "吐槽"],
    "review": ["避雷", "体验", "值得吗", "投诉", "真实体验"],
    "recommendation": ["推荐", "值得去吗", "值不值", "攻略", "对比"],
    "verification": ["官方回应", "通知", "声明", "注册", "资质", "处罚", "法院", "备案"],
    "image": ["截图", "图片", "配图", "海报", "菜单", "聊天记录", "通知图"],
    "video": ["视频", "片段", "录像", "监控", "vlog", "gif", "动图"],
    "subtitle": ["字幕", "文案", "台词", "画面文字", "字幕截图"],
    "audio": ["录音", "语音", "音频", "直播录屏", "说了什么"],
}

SOURCE_PATTERNS = [
    "site:xiaohongshu.com {entity} {modifier}",
    "site:www.xiaohongshu.com {entity} {modifier}",
    "{entity} 小红书 {modifier}",
    "{entity} {modifier}",
]

CLAIM_LOG_TEMPLATE = {
    "claims": [
        {
            "claim_id": "c1",
            "claim_text": "...",
            "claim_type": "fact|allegation|praise|complaint|rumor|official statement|recommendation",
            "theme": ["pricing|service|policy|controversy|quality|fraud-risk|taste|queue|image-text|video|audio"],
            "entity": "",
            "time_scope": "",
            "geography": "",
            "status": "supported|mixed|weak|contradicted|unresolved",
            "confidence": "low|medium|high",
            "notes": ""
        }
    ],
    "evidence": [
        {
            "evidence_id": "e1",
            "claim_id": "c1",
            "source_url": "",
            "source_class": "official|media|first-hand|community|low-trust",
            "modality": "text-page|image|screenshot|video|gif|audio|mixed",
            "access_level": "direct file|fetched page|search snippet|quoted relay",
            "visible_date": "",
            "extract": "",
            "summary": "",
            "credibility": 0,
            "score": 0,
            "supports": "supports|partially-supports|contradicts|contextual-only"
        }
    ]
}

REPORT_TEMPLATE = {
    "snapshot": {
        "subject": "",
        "category": "education|policy|gossip|local|general",
        "time_scope": "",
        "overall_signal": "positive|mixed|caution|high risk|inconclusive",
        "confidence": "low|medium|high",
    },
    "main_findings": [
        "finding 1",
        "finding 2",
    ],
    "discussion_clusters": [
        {
            "theme": "pricing|quality|service|fraud risk|policy impact|taste|queue|environment|controversy|support",
            "summary": "",
            "repetition_count": 0,
            "sentiment": "positive|mixed|negative|split",
            "confidence": "low|medium|high",
        }
    ],
    "evidence": [
        {
            "credibility": 0,
            "score": 0,
            "theme": "refund|teaching|outcomes|pricing|support|legal|credentials|policy|controversy|taste|service|environment|image-text|video|audio",
            "summary": "",
            "source_class": "official|media|first-hand|community|low-trust",
            "modality": "text-page|image|screenshot|video|gif|audio|mixed",
            "access_level": "direct file|fetched page|search snippet|quoted relay",
            "url": "",
            "date": "",
        }
    ],
    "unverified": [""],
    "next_checks": [""],
}


def clean_text(text: str) -> str:
    return re.sub(r"\s+", " ", text.strip())


def split_csv(values: Iterable[str]) -> List[str]:
    out: List[str] = []
    for value in values:
        for item in value.split(","):
            item = clean_text(item)
            if item:
                out.append(item)
    return out


def dedupe(items: Iterable[str]) -> List[str]:
    seen = set()
    out: List[str] = []
    for item in items:
        normalized = clean_text(item)
        if normalized and normalized not in seen:
            seen.add(normalized)
            out.append(normalized)
    return out


def combined_modifiers(category: str, families: List[str]) -> List[str]:
    base = ["小红书"] + CATEGORY_MODIFIERS.get(category, CATEGORY_MODIFIERS["general"])
    family_terms: List[str] = []
    for family in families:
        family_terms.extend(FAMILY_MODIFIERS.get(family, []))
    return dedupe(base + family_terms)


def validate_families(families: List[str]) -> List[str]:
    allowed = set(FAMILY_MODIFIERS)
    invalid = [family for family in families if family not in allowed]
    if invalid:
        raise SystemExit(
            "Invalid family value(s): "
            + ", ".join(invalid)
            + ". Allowed values: "
            + ", ".join(sorted(allowed))
        )
    return families


def apply_time_scope(terms: Iterable[str], time_scope: str) -> List[str]:
    scope = clean_text(time_scope)
    if not scope:
        return list(terms)
    return dedupe(list(terms) + [scope])


def build_queries(
    entity: str,
    category: str,
    families: List[str],
    aliases: List[str],
    peers: List[str],
    geography: List[str],
    time_scope: str,
    limit: int,
) -> List[str]:
    modifiers = apply_time_scope(combined_modifiers(category, families), time_scope)
    names = dedupe([entity] + aliases)
    places = dedupe(geography)
    queries: List[str] = []

    for name in names:
        for modifier in modifiers:
            for pattern in SOURCE_PATTERNS:
                queries.append(pattern.format(entity=name, modifier=modifier).replace("小红书 小红书", "小红书"))
            for place in places:
                geo_modifier = modifier.replace("小红书", "").strip() or "小红书"
                queries.append(f"{place} {name} 小红书 {geo_modifier}".replace("小红书 小红书", "小红书"))
                queries.append(f"{name} {place} {geo_modifier}".replace("小红书 小红书", "小红书"))

    for name in names:
        for family in families:
            if family == "latest":
                queries.extend([
                    f"{name} 最新",
                    f"{name} 最近",
                    f"{name} 最新消息",
                    f"{name} 官方回应",
                ])
            elif family == "trending":
                queries.extend([
                    f"{name} 热议",
                    f"{name} 爆料",
                    f"{name} 后续",
                    f"{name} 争议",
                ])
            elif family == "comment":
                queries.extend([
                    f"{name} 评论区",
                    f"{name} 评论 怎么说",
                    f"{name} 吐槽",
                    f"{name} 反馈",
                ])
            elif family == "review":
                queries.extend([
                    f"{name} 真实体验",
                    f"{name} 避雷",
                    f"{name} 值不值",
                    f"{name} 怎么样",
                ])
            elif family == "recommendation":
                queries.extend([
                    f"{name} 推荐吗",
                    f"{name} 攻略",
                    f"{name} 对比",
                    f"{name} 值得去吗",
                ])
            elif family == "verification":
                queries.extend([
                    f"{name} 官方回应",
                    f"{name} 声明",
                    f"{name} 注册",
                    f"{name} 资质",
                    f"{name} 处罚",
                ])
            elif family == "image":
                queries.extend([
                    f"{name} 截图",
                    f"{name} 图片",
                    f"{name} 聊天记录",
                    f"{name} 菜单 图",
                ])
            elif family == "video":
                queries.extend([
                    f"{name} 视频",
                    f"{name} 片段",
                    f"{name} 录屏",
                    f"{name} gif",
                ])
            elif family == "subtitle":
                queries.extend([
                    f"{name} 字幕",
                    f"{name} 台词",
                    f"{name} 画面文字",
                    f"{name} 字幕截图",
                ])
            elif family == "audio":
                queries.extend([
                    f"{name} 录音",
                    f"{name} 语音",
                    f"{name} 音频",
                    f"{name} 说了什么",
                ])

    for name in names:
        for peer in peers:
            queries.append(f"{name} vs {peer}")
            queries.append(f"{name} {peer} 对比")
            queries.append(f"{name} 和 {peer} 怎么选")

    if time_scope:
        queries = [f"{query} {time_scope}" if time_scope not in query else query for query in queries]

    return dedupe(queries)[:limit]


def to_markdown(payload: Dict[str, object], include_claim_log: bool) -> str:
    lines = [f"# Research starter: {payload['entity']}", ""]
    lines.append(f"- Category: {payload['category']}")
    lines.append(f"- Families: {', '.join(payload['families'])}")
    if payload["aliases"]:
        lines.append(f"- Aliases: {', '.join(payload['aliases'])}")
    if payload["geography"]:
        lines.append(f"- Geography: {', '.join(payload['geography'])}")
    if payload["time_scope"]:
        lines.append(f"- Time scope hint: {payload['time_scope']}")
    if payload["peers"]:
        lines.append(f"- Comparison targets: {', '.join(payload['peers'])}")
    lines.extend(["", "## Recommended queries"])
    for idx, query in enumerate(payload["recommended_queries"], start=1):
        lines.append(f"{idx}. {query}")
    lines.extend([
        "",
        "## Suggested search order",
        "1. Run 3-5 overview/review queries to map the topic.",
        "2. Run 2-4 verification queries before trusting viral claims.",
        "3. If media matters, run image/video/subtitle/audio queries before concluding.",
        "4. Turn the strongest claims into a claim log before writing the final summary.",
        "",
        "## Normalized report template",
        "```json",
        json.dumps(REPORT_TEMPLATE, ensure_ascii=False, indent=2),
        "```",
    ])
    if include_claim_log:
        lines.extend([
            "",
            "## Structured claim log template",
            "```json",
            json.dumps(CLAIM_LOG_TEMPLATE, ensure_ascii=False, indent=2),
            "```",
        ])
    return "\n".join(lines)


def main() -> None:
    parser = argparse.ArgumentParser(description="Build RedNote/Xiaohongshu public-web research queries and normalized evidence templates.")
    parser.add_argument("keyword", help="Entity or topic to research")
    parser.add_argument("--category", default="general", choices=["education", "policy", "gossip", "local", "general"], help="Choose the topic category")
    parser.add_argument("--family", action="append", default=[], help="Query family to emphasize; repeatable or comma-separated")
    parser.add_argument("--alias", action="append", default=[], help="Alias or alternate name; repeatable or comma-separated")
    parser.add_argument("--peer", action="append", default=[], help="Optional comparison target; repeatable or comma-separated")
    parser.add_argument("--geo", action="append", default=[], help="City, district, mall, campus, or market; repeatable or comma-separated")
    parser.add_argument("--time-scope", default="", help="Optional time hint such as '近7天' or '2026' to append to queries")
    parser.add_argument("--limit", type=int, default=18, help="Maximum query count")
    parser.add_argument("--json", action="store_true", help="Emit JSON instead of markdown")
    parser.add_argument("--claim-log", action="store_true", help="Include the structured claim log template in the output")
    args = parser.parse_args()

    entity = clean_text(args.keyword)
    default_families = ["overview", "review", "verification"]
    families = validate_families(dedupe(split_csv(args.family)) or default_families)
    aliases = dedupe(split_csv(args.alias))
    peers = dedupe(split_csv(args.peer))
    geography = dedupe(split_csv(args.geo))

    queries = build_queries(
        entity=entity,
        category=args.category,
        families=families,
        aliases=aliases,
        peers=peers,
        geography=geography,
        time_scope=args.time_scope,
        limit=args.limit,
    )

    payload = {
        "entity": entity,
        "category": args.category,
        "families": families,
        "aliases": aliases,
        "peers": peers,
        "geography": geography,
        "time_scope": clean_text(args.time_scope),
        "recommended_queries": queries,
        "normalized_report_template": REPORT_TEMPLATE,
    }
    if args.claim_log:
        payload["claim_log_template"] = CLAIM_LOG_TEMPLATE

    if args.json:
        print(json.dumps(payload, ensure_ascii=False, indent=2))
        return

    print(to_markdown(payload, include_claim_log=args.claim_log))


if __name__ == "__main__":
    main()

FILE:scripts/recovery_query_builder.py
#!/usr/bin/env python3
import argparse
import json
import re
from typing import Iterable, List

STOPWORDS = {
    "的", "了", "和", "是", "在", "就", "都", "而", "及", "与", "着", "或", "被", "把", "让",
    "我们", "你们", "他们", "这个", "那个", "一个", "一些", "没有", "不是", "真的", "感觉",
    "the", "a", "an", "and", "or", "of", "to", "for", "in", "on", "with", "from", "by", "is",
}


def clean(text: str) -> str:
    return re.sub(r"\s+", " ", text.strip())


def dedupe(items: Iterable[str]) -> List[str]:
    seen = set()
    out = []
    for item in items:
        item = clean(item)
        key = item.casefold()
        if item and key not in seen:
            seen.add(key)
            out.append(item)
    return out


def extract_tokens(text: str) -> List[str]:
    text = clean(text)
    if not text:
        return []
    cjk_chunks = re.findall(r"[\u4e00-\u9fffA-Za-z0-9#@_.:-]{2,}", text)
    tokens = []
    for chunk in cjk_chunks:
        parts = re.split(r"[^\u4e00-\u9fffA-Za-z0-9#@_.:-]+", chunk)
        for part in parts:
            part = clean(part)
            if len(part) >= 2 and part.casefold() not in STOPWORDS:
                tokens.append(part)
    return dedupe(tokens)


def extract_dates(text: str) -> List[str]:
    patterns = [
        r"20\d{2}(?:[-/.年]\d{1,2})(?:[-/.月]\d{1,2}[日号]?)?",
        r"\d{1,2}月\d{1,2}[日号]?",
        r"近\d+[天日周月年]",
    ]
    dates = []
    for pattern in patterns:
        dates.extend(match.group(0) for match in re.finditer(pattern, text))
    return dedupe(dates)


def extract_hashtags(text: str) -> List[str]:
    return dedupe(re.findall(r"[#＃][^#＃\s]{2,30}", text))


def quote_candidates(title: str, snippet: str, max_quotes: int) -> List[str]:
    candidates = []
    for source in [title, snippet]:
        source = clean(source)
        if not source:
            continue
        clauses = re.split(r"[。！？!?.；;，,、]\s*", source)
        for clause in clauses:
            clause = clean(clause)
            if 6 <= len(clause) <= 20:
                candidates.append(f'"{clause}"')
    return dedupe(candidates)[:max_quotes]


def build_queries(entity: str, title: str, snippet: str, extracted_text: str, context: List[str], limit: int) -> List[str]:
    text_pool = " ".join([entity, title, snippet, extracted_text, *context])
    tokens = extract_tokens(text_pool)
    hashtags = extract_hashtags(text_pool)
    quotes = quote_candidates(title, snippet or extracted_text, max_quotes=4)
    dates = extract_dates(text_pool)

    strong_tokens = [t for t in tokens if len(t) >= 2][:12]
    context_terms = dedupe(context + hashtags + dates)[:8]

    queries = []

    if entity:
        queries.extend([
            f'{entity} 小红书',
            f'site:xiaohongshu.com {entity}',
            f'site:www.xiaohongshu.com {entity}',
        ])

    for quote in quotes:
        queries.append(quote)
        if entity:
            queries.append(f'{entity} {quote}')
        queries.append(f'site:xiaohongshu.com {quote}')

    for token in strong_tokens[:8]:
        if entity and token.casefold() != entity.casefold():
            queries.append(f'{entity} {token}')
            queries.append(f'{entity} 小红书 {token}')
        if not entity or token.casefold() != entity.casefold():
            queries.append(token)

    for term in context_terms:
        if entity:
            queries.append(f'{entity} {term}')
            queries.append(f'{entity} 小红书 {term}')
        for quote in quotes[:2]:
            queries.append(f'{quote} {term}')

    for token in strong_tokens[:6]:
        queries.append(f'site:xiaohongshu.com {token}')
        queries.append(f'site:www.xiaohongshu.com {token}')

    return dedupe(queries)[:limit]


def main() -> None:
    parser = argparse.ArgumentParser(description='Build recovery-oriented search queries from partial RedNote snippets, titles, OCR text, or subtitle fragments.')
    parser.add_argument('--entity', default='', help='Main subject/entity if known')
    parser.add_argument('--title', default='', help='Visible title or page title')
    parser.add_argument('--snippet', default='', help='Search snippet or short visible description')
    parser.add_argument('--text', default='', help='OCR / subtitle / extracted visible text fragment')
    parser.add_argument('--context', action='append', default=[], help='Extra context tokens such as city, product, hashtag, date; repeatable or comma-separated')
    parser.add_argument('--limit', type=int, default=20, help='Maximum number of recovery queries')
    parser.add_argument('--json', action='store_true', help='Emit JSON instead of markdown')
    args = parser.parse_args()

    context = []
    for value in args.context:
        for item in value.split(','):
            item = clean(item)
            if item:
                context.append(item)

    queries = build_queries(args.entity, args.title, args.snippet, args.text, dedupe(context), args.limit)
    payload = {
        'entity': clean(args.entity),
        'title': clean(args.title),
        'snippet': clean(args.snippet),
        'text': clean(args.text),
        'context': dedupe(context),
        'queries': queries,
        'notes': [
            'Use quoted clauses first to recover reposts or mirrors.',
            'Then pivot on distinctive names, prices, dates, hashtags, or subtitle fragments.',
            'Treat recovered pages as original, relay, or commentary separately in the claim log.',
        ],
    }

    if args.json:
        print(json.dumps(payload, ensure_ascii=False, indent=2))
        return

    print('# Recovery query set')
    print()
    if payload['entity']:
        print(f"- Entity: {payload['entity']}")
    if payload['title']:
        print(f"- Title: {payload['title']}")
    if payload['snippet']:
        print(f"- Snippet: {payload['snippet']}")
    if payload['text']:
        print(f"- Extracted text: {payload['text']}")
    if payload['context']:
        print(f"- Context: {', '.join(payload['context'])}")
    print()
    print('## Queries')
    for i, query in enumerate(payload['queries'], start=1):
        print(f'{i}. {query}')
    print()
    print('## Recovery notes')
    for note in payload['notes']:
        print(f'- {note}')


if __name__ == '__main__':
    main()
ClawHub Research Automation+2
P@clawhub-pippin1214-edf7ee0d4a