wangzairong

@clawhub-wangzairong-b68a57dde8

2prompts

0upvotes received

0contributions

Joined 3 months ago

2 contributions in the last year

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Less

Multi-Skill-Eval | 集成化技能评估系统

Skill

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。触发词(中文): 评估技...

---
name: multi-skill-eval
description: |
  集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。

  触发词(中文): 评估技能、技能评分、技能审计、对比技能、批量评估、技能质量检查、静态分析、基准测试、触发词检测、幽灵工具检测
  触发词(English): evaluate skill, compare skills, audit skill, benchmark skill, static analysis, skill quality, skill assessment

  Use when you need to evaluate, audit, benchmark, or improve OpenClaw skills.
---

# Multi-Skill-Eval v1.0.0
## Integrated Multi-Method Skill Evaluation System

Combines three evaluation approaches into one unified system:
1. **Skill Assessment** — lightweight static analysis (fast, automated)
2. **Skill Evaluator** — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
3. **Skill-Eval** — autonomous benchmark evaluation with skill card generation

---

## 🚀 快速开始 / Quick Start

```bash
# 完整评估（三种方法）
multi-skill-eval ~/.openclaw/skills/my-skill

# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick

# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full

# 对比两个技能
multi-skill-eval --compare skill-a skill-b

# 批量评估所有本地技能
multi-skill-eval --all

# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2
```

---

## Three Evaluation Methods

### 方法一：静态分析 (快速 — 约30秒)

轻量级自动化检查，覆盖4个维度：

```bash
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json    # 机器可读格式
```

**检查项目：**
- 文档完整性（SKILL.md、描述质量、示例）
- 代码质量与安全信号（脚本语法、错误处理）
- 配置友好性（环境变量文档化、默认值清晰）
- 维护性信号（版本管理、近期更新）

**输出：** 0-100分数 + 按严重性分类的问题列表。

---

### 方法二：Rubric打分 (详细 — 约10分钟)

25项标准，覆盖8个类别。自动化检查 + 手动评审结合。

**运行自动化结构检查：**
```bash
python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose
```


**然后使用** `references/rubric.md` **进行手动评分**

#### The 25 Criteria (8 Categories)

| # | Category | Framework | Criteria |
|---|----------|-----------|----------|
| 1 | Functional Suitability | ISO 25010 | Completeness, Correctness, Appropriateness |
| 2 | Reliability | ISO 25010 | Fault Tolerance, Error Reporting, Recoverability |
| 3 | Performance / Context | ISO 25010 + Agent | Token Cost, Execution Efficiency |
| 4 | Usability — AI Agent | Shneiderman, Gerhardt-Powals | Learnability, Consistency, Feedback, Error Prevention |
| 5 | Usability — Human | Tognazzini, Norman | Discoverability, Forgiveness |
| 6 | Security | ISO 25010 + OpenSSF | Credentials, Input Validation, Data Safety |
| 7 | Maintainability | ISO 25010 | Modularity, Modifiability, Testability |
| 8 | Agent-Specific | Novel | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches |

**Scoring:** Each criterion 0–4. Total 100 max.

| Score | Verdict | Action |
|-------|---------|--------|
| 90–100 | Excellent | Publish confidently |
| 80–89 | Good | Publishable, note known issues |
| 70–79 | Acceptable | Fix P0s before publishing |
| 60–69 | Needs Work | Fix P0+P1 before publishing |
| <60 | Not Ready | Significant rework needed |

#### Rubric Score Sheet

Copy `assets/EVAL-TEMPLATE.md` to the skill directory as `EVAL.md`.

**P0 Issues (blocks publishing):**
- Missing SKILL.md or invalid frontmatter
- Hardcoded credentials or secrets
- Phantom tooling (referenced scripts not in package)
- No description or description < 50 chars

**P1 Issues (should fix):**
- No usage examples
- No error handling in scripts
- Missing dependency documentation
- Unclear trigger conditions

---

### 方法三：自主基准测试 (深度 — 约30分钟/技能)

Full multi-phase evaluation with multi-model support. **Requires AI agent execution.**

```bash
# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4
```

> ⚠️ **Note**: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.

> 📋 **Planned**: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.

#### Phase 1: Pre-flight Analysis

1. Read `SKILL.md` — understand claims, dependencies, target use cases
2. Classify skill type:
   - **Capability uplift** — teaches the agent something it can't do well
   - **Encoded preference** — sequences steps according to specific process
3. Dependency check:
   - Required CLI tools, API keys, env vars
   - Mark `dependency-gated` if credentials missing (skip eval, not fault of skill)
   - Check for **phantom tooling** (referenced scripts not in package)
4. Marketing claims check: flag any metrics ("7.8x faster") without evidence
5. Read knowledge base: `knowledge/lessons.md`, `eval-patterns.md`, `failures.md`
6. Check prior evaluations: `knowledge/skill-profiles/<slug>.md`

#### Phase 2: Test Case Design

Design 2-3 test prompts across four categories:
- **Outcome** — Did the task complete correctly?
- **Process** — Did the agent follow the skill's intended steps?
- **Style** — Does output follow skill-claimed conventions?
- **Efficiency** — Reasonable time/token usage?

**Assertion design (two layers):**

*Layer 1: Deterministic checks* (fast, reproducible)
- File existence, word counts, keyword presence
- Format compliance (valid JSON, SQL, markdown)
- Programmatic verification (run tests, check syntax)

*Layer 2: Rubric-based quality assessment* (LLM-as-judge)
- Judge model (NOT execution model) grades output against specific rubric
- Structured scoring, not pass/fail

**Key assertion patterns:**
- Banned-word checks for style-constrained skills (highly discriminating)
- Methodology/structure assertions for technical domains (baseline already strong on correctness)
- Output-floor assertions: required sections must appear even in error/fallback paths
- Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)

#### Phase 3: Execution

For each test case, spawn two subagents:

**With-skill:**
```
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
```

**Without-skill (baseline):**
```
[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
```

**Multi-model mode:** Run same skill across multiple models to check cross-model consistency.

#### Phase 4: Grading

Programmatic grading for deterministic checks. LLM-based grading for qualitative:

```bash
python3 scripts/grade-assertions.py --workspace /path/to/results
```

Save to `grading.json`:
```json
{
  "expectations": [
    {"text": "assertion text", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
```

#### Phase 5: Benchmark Aggregation

```json
{
  "with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "delta": {"pass_rate": "+0.XX", "time": "+Xx"},
  "model_used": "claude-sonnet-4",
  "verdict": "Recommended"
}
```

**Efficiency flags:** Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").

#### Phase 6: Skill Card Generation

```bash
python3 scripts/generate_skill_card.py \
  --workspace /path/to/results \
  --skill-name "My Skill" \
  --skill-slug my-skill \
  --eval-model claude-sonnet-4 \
  --output skill-cards/my-skill-v1.md
```

**Skill Card Contents:**
- Metadata: name, source, eval date, model, engine version
- Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
- With-skill vs without-skill comparison table
- Per-test-case breakdown with assertions, timing, grading
- Strengths / Weaknesses
- Recommendation: Recommended / Conditional / Marginal / Not Recommended

#### Phase 7: Leaderboard Update

```bash
python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html
```

---

## Self-Evolution Improvement Engine

> ⚠️ **Planned — Not Yet Implemented**
>
> The self-evolution improvement engine is designed but not yet implemented. The knowledge base (`knowledge/improve/`) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.

### Planned Improvement Process (Phase 7-12)

1. **Read knowledge base:**
   - `knowledge/improve/lessons.md` — proven strategies
   - `knowledge/improve/patterns.md` — category-specific playbooks
   - `knowledge/improve/failures.md` — what NOT to try

2. **Diagnose root cause:**
   - Skill too vague? (Doesn't specify enough to change model behavior)
   - Skill redundant? (Teaches things model already knows)
   - Skill too heavy? (Adds overhead without proportional quality gain)
   - Missing structure? (No clear output format)
   - Phantom tooling? (References tools that don't exist)
   - Reference manual anti-pattern? (>200 lines of educational content)
   - Library-as-skill anti-pattern? (Contains code instead of instructions)

3. **Select improvement strategy** from patterns:
   - Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
   - Library-to-Instructions: Convert code to behavioral instructions
   - Phantom Tooling Replacement: Replace missing tool references with inline instructions
   - Overhead Routing: Add quick-mode vs full-framework routing
   - Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions

4. **Rewrite SKILL.md** with selected strategy:
   - Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
   - Add specific, enforceable conventions
   - Remove redundant content model already handles
   - Save as `SKILL-improved.md`

5. **Update assertions** to match improved skill

6. **Re-evaluate** with improved version

### Planned Re-Eval (Phase 10-11)

Run same eval against `SKILL-improved.md`:
- Score improved by >= 1.5 points → Success
- Less than 50% of previously-failed assertions fixed → Document limitation, move on

### Planned Improvement Knowledge Update (Phase 12)

After each improvement batch:
- Update `knowledge/improve/lessons.md` with what worked
- Update `knowledge/improve/patterns.md` with reusable patterns
- Update `knowledge/improve/failures.md` with failed attempts
- Fold proven patterns back into this SKILL.md

---

## Scoring Summary

| Method | Speed | Coverage | Best For |
|--------|-------|---------|----------|
| Static Analysis | ~30s | 4 dimensions | Quick comparison, batch scan |
| Rubric Scoring | ~10min | 25 criteria | Pre-publish audit, detailed report |
| Benchmark Eval | ~30min | Full + self-evolution | Production evaluation, skill improvement |

| Overall Score | Verdict |
|--------------|---------|
| 7-10 | Recommended |
| 5-6.9 | Conditional |
| 3-4.9 | Marginal |
| 0-2.9 | Not Recommended |

---

## Anti-Patterns to Detect

- **Reference manual anti-pattern**: SKILL.md >200 lines of educational content (not behavioral instructions)
- **Library-as-skill anti-pattern**: SKILL.md contains Python/JS class definitions instead of instructions
- **Phantom tooling**: SKILL.md references scripts/binaries not in the package
- **Phantom tooling framework skills**: Evaluate template/output structure separately from real data execution
- **Unsubstantiated claims**: Skill claims specific metrics without evidence — do not use self-reported numbers
- **High-overhead framework inflation**: Quality delta ≈ 0 but cost delta >2x — penalize efficiency

---

## Deeper Security Scanning

For thorough security audits, complement with [SkillLens](https://www.npmjs.com/package/skilllens):
```bash
npx skilllens scan /path/to/skill
```
Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.

---

## Dependencies

- Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
- PyYAML (`pip install pyyaml`) — frontmatter parsing
- Node.js (for SkillLens security scanning)
FILE:CLAWHUB_PUBLISH.md
# ClawhHub Publish Metadata

## multi-skill-eval v1.0.1

**Slug**: multi-skill-eval
**Name**: Multi-Skill-Eval
**Version**: 1.0.1

---

## 变更日志 (Changelog)

### v1.0.1 (当前版本)
- 添加中文触发词和中文使用场景
- 修复 grade-assertions.py 与 eval-skill.py 相同问题
- 补全缺失的 generate_skill_card.py 和 generate_leaderboard.py
- 移除 static-analyze.py 中导致自我矛盾的 dangerous 函数检测
- 标注 benchmark 需要 AI agent 执行，self-evolution 为计划中功能

### v1.0.0
- 初始版本
- 集成静态分析、Rubric打分、基准测试三种评估方法

---

## 发布命令

```bash
clawhub publish /Users/metaclaw/.openclaw/skills/multi-skill-eval \
  --slug multi-skill-eval \
  --name "Multi-Skill-Eval" \
  --version 1.0.1 \
  --changelog "v1.0.1: 添加中文触发词和中文使用场景；修复grade-assertions.py与eval-skill.py相同问题；补全缺失的generate_skill_card.py和generate_leaderboard.py；移除static-analyze.py中导致自我矛盾的dangerous函数检测；标注benchmark需要AI agent执行，self-evolution为计划中功能"
```

---

## ClawhHub 链接

- 技能页面: https://clawhub.com/multi-skill-eval
- 发布ID: k971zvy4szs4a4yqdgssgsqrmx85h3qa

---

## 标签 (Tags)

- evaluation
- skill-assessment
- rubric
- benchmark
- multi-model
- openclaw
- skill-quality
- auditing
- Chinese-supported

FILE:assets/EVAL-TEMPLATE.md
# [SKILL_NAME] Evaluation

**Date:** YYYY-MM-DD
**Evaluator:** [agent/human]
**Skill version:** [version or commit]
**Automated score:** [run eval-skill.py and paste summary]

---

## Automated Checks

```
[paste eval-skill.py output here]
```

## Manual Assessment

| # | Criterion | Score | Notes |
|---|-----------|-------|-------|
| 1.1 | Completeness | /4 | |
| 1.2 | Correctness | /4 | |
| 1.3 | Appropriateness | /4 | |
| 2.1 | Fault Tolerance | /4 | |
| 2.2 | Error Reporting | /4 | |
| 2.3 | Recoverability | /4 | |
| 3.1 | Token Cost | /4 | |
| 3.2 | Execution Efficiency | /4 | |
| 4.1 | Learnability | /4 | |
| 4.2 | Consistency | /4 | |
| 4.3 | Feedback Quality | /4 | |
| 4.4 | Error Prevention | /4 | |
| 5.1 | Discoverability | /4 | |
| 5.2 | Forgiveness | /4 | |
| 6.1 | Credential Handling | /4 | |
| 6.2 | Input Validation | /4 | |
| 6.3 | Data Safety | /4 | |
| 7.1 | Modularity | /4 | |
| 7.2 | Modifiability | /4 | |
| 7.3 | Testability | /4 | |
| 8.1 | Trigger Precision | /4 | |
| 8.2 | Progressive Disclosure | /4 | |
| 8.3 | Composability | /4 | |
| 8.4 | Idempotency | /4 | |
| 8.5 | Escape Hatches | /4 | |
| | **TOTAL** | **/100** | |

## Priority Fixes

### P0 — Fix Before Publishing
1.

### P1 — Should Fix
1.

### P2 — Nice to Have
1.

## Revision History
| Date | Score | Notes |
|------|-------|-------|
| | /100 | Baseline |

FILE:knowledge/eval-patterns.md
# Reusable Evaluation Patterns

## Assertion Templates

### Banned-Word Check (Style-Constrained)
```python
{"type": "keyword_absent", "keyword": "filler_word"}
```
- Highly discriminating for writing skills
- Base model uses common filler words freely
- Skill with banned list produces 100% delta

### Output-Floor Assertion (Required Sections)
```python
{"type": "required_section", "section": "source", "example": "Sources: [...]"}
```
- Skills with required output sections must assert them even in error/fallback paths
- Template compliance drift under data outages is a known failure mode

### Bilingual Keyword Variant
```python
{"type": "either_keyword", "keywords": ["索引", "index"]}
```
- Reduces false negatives when model responds in Chinese

### Methodology Adherence (Technical Skills)
```python
{"type": "methodology_check", "requires": ["EXPLAIN", "ANALYZE"]}
```
- For technical domains where baseline is already strong on correctness
- Target process, not just output correctness

## Category-Specific Patterns

### Writing/Style Skills
- Always add banned-word assertions for skill-defined forbidden vocabulary
- Assert output structure (sections, headers, format)

### Technical Analysis (SQL, Debugging)
- Prefer methodology/structure assertions over correctness
- Baseline models are already strong on correctness
- Focus on systematic process adherence

### CLI Wrapper Skills
- Assert: tool is actually invoked
- Assert: output meaningfully differs from baseline
- Assert: graceful handling when dependency missing

### Reference Manual Skills
- Detect: >200 lines of educational content
- Rewrite: delete 70%+, add MUST/ALWAYS/NEVER behavioral mandates
- Add quick-mode routing for simple use cases
FILE:knowledge/failures.md
# Failure Mode Catalog

## Evaluation Failures

### Non-Discriminating Assertions
- **Symptom**: Assertion passes with AND without skill
- **Root cause**: Assertion tests generic capability baseline already has
- **Fix**: Replace with skill-specific behavior assertion

### Phantom Tooling
- **Symptom**: SKILL.md references script X but file doesn't exist
- **Root cause**: Skill author documented aspirational tooling
- **Fix**: Separate framework/template evaluation from operational readiness

### False Negatives (Multilingual)
- **Symptom**: Chinese-language skill fails assertions in English
- **Root cause**: Model responds in Chinese, keyword checks miss it
- **Fix**: Add bilingual keyword variants

### Self-Grading Bias
- **Symptom**: Same model grades its own output
- **Root cause**: Execution model = judge model
- **Fix**: Use separate judge model from different provider

## Skill Quality Failures

### Reference Manual Anti-Pattern
- **Symptom**: SKILL.md is 200+ lines of educational content
- **Root cause**: Author wrote a textbook, not a skill
- **Fix**: Delete 70%+, add behavioral mandates (MUST/ALWAYS/NEVER)

### Library-as-Skill Anti-Pattern
- **Symptom**: SKILL.md contains Python/JS class definitions
- **Root cause**: Author wrote code instead of instructions
- **Fix**: Convert to behavioral instructions describing what the agent should do

### High-Overhead Framework Inflation
- **Symptom**: Quality delta ≈ 0 but time/token cost >2x baseline
- **Root cause**: Framework adds overhead without proportional benefit
- **Fix**: Add quick-mode routing or reject as not publishable
FILE:knowledge/improve/failures.md
# Failed Improvement Attempts

## What Doesn't Work

### Incremental content addition
- Adding more explanation to a skill that already has structural problems doesn't help
- Root cause: wrong paradigm, not insufficient detail

### Rewriting without diagnosis
- Blindly rewriting SKILL.md without first identifying root cause of failures wastes cycles
- Always diagnose before rewriting

### Trying to "improve" content that should be deleted
- Skills with reference-manual anti-pattern: adding more content (even "better" content) doesn't fix the problem
- The issue is verbosity and lack of behavioral directives, not content quality

### Over-engineering simple skills
- Simple CLI wrapper skills do not benefit from complex multi-step workflows
- Keep simple skills simple

## Known Failure Modes

1. **Same model grades own output** → confirmation bias, inflated scores
2. **>3 improvement cycles** → diminishing returns, need to document and move on
3. **Complex assertions on simple skills** → overhead without quality gain
4. **Ignoring phantom tooling** → operational failures in production
5. **Unsubstantiated claims as evidence** → circular self-referencing validation
FILE:knowledge/improve/lessons.md
# Improvement Lessons (What Actually Works)

## Verified Improvements

### Deletion-first rewriting
- Skills that simply deleted 60-80% of content and added strong directives consistently improved more than those that tried to "improve" existing content
- The model has too much context already — less is more for behavioral control

### MUST/ALWAYS/NEVER directives work
- Vague guidance ("consider doing X") has near-zero behavioral effect
- Strong directives ("MUST do X before Y") significantly change agent behavior
- Best results: 1 directive per major step, max 10 directives total

### Quick-mode routing reduces overhead
- Adding a simple "if [simple case], use this fast path" reduced time cost by 40-60% for simple tasks
- No quality degradation observed for the targeted simple cases

### Bilingual assertions reduce false negatives
- Adding Chinese keyword variants alongside English reduced false negatives by ~30% for Chinese-language skills
- Cost: assertion complexity increases, but worth it for non-English skills

## Partial Success Patterns

### Framework-to-instructions conversion
- Partially effective — some code patterns converted well, others lost nuance
- Best for: CLI wrappers, data transformation helpers
- Poor for: complex stateful logic

### Phantom tooling replacement
- Works when the missing tool is simple (single command)
- Doesn't work when the tool encapsulates complex logic (rewrite just shifts complexity elsewhere)

## Failed Experiments

### Adding more examples to vague skills
- Adding examples to a skill that had structural problems (wrong paradigm) did not improve scores
- Fix the structure first, then add examples

### Multi-round iterative improvement
- Attempting >3 improvement cycles on the same skill rarely yields additional gains
- After 2 cycles with <0.5 point improvement, document limitations and move on

### Using the same model for improvement + eval
- Self-improvement introduces confirmation bias
- Best practice: use different model for improvement vs. grading
FILE:knowledge/improve/patterns.md
# Improvement Strategies (Proven Playbooks)

## Strategy 1: Reference Manual Slim-Down
**Problem**: >200 lines of educational content, agent skips most of it
**Solution**: Delete 70%+ redundant content, replace with behavioral mandates
**When to use**: SKILL.md reads like a textbook, not a skill

**Steps**:
1. Identify which paragraphs describe WHAT the agent should do vs WHY
2. Delete all "why" and "background" content (keep <30% of original)
3. Add MUST/ALWAYS/NEVER directives
4. Add quick-mode routing for simple cases

**Example transformation**:
Before: "The agent should first analyze the database schema, considering the relationships between tables, then carefully examine the indexes, taking into account the query patterns..."
After: "MUST analyze schema before querying. ALWAYS check index coverage for WHERE clauses."

## Strategy 2: Library-to-Instructions
**Problem**: SKILL.md contains Python/JS class definitions instead of behavioral instructions
**Solution**: Convert code to step-by-step behavioral guidance
**When to use**: Skill is a "library" with classes and methods

**Steps**:
1. Identify each function/class purpose
2. Rewrite as: "When [trigger], do [step 1] → [step 2] → [step 3]"
3. Remove implementation details
4. Keep only the behavioral contract

## Strategy 3: Phantom Tooling Replacement
**Problem**: SKILL.md references scripts/binaries that don't exist in the package
**Solution**: Replace tool references with equivalent inline instructions
**When to use**: Phantom tooling detected during eval

**Steps**:
1. List all referenced but missing tools
2. For each: write equivalent step-by-step instructions
3. Replace tool call with inline instructions
4. Add graceful fallback for missing dependencies

## Strategy 4: Overhead Routing
**Problem**: High time/token cost with marginal quality gain
**Solution**: Add fast-path routing for simple cases
**When to use**: Quality delta ≈ 0 but cost delta >2x

**Steps**:
1. Identify which use cases are simple (can be handled by baseline)
2. Add conditional: "If [simple case], skip full workflow"
3. Add lightweight alternative for simple cases
4. Keep full workflow only for complex cases

## Strategy 5: Assertion-Aligned Rewrite
**Problem**: Skill fails specific assertions that represent real regressions
**Solution**: Rewrite specifically to pass the failed assertions
**When to use**: Score < 7 due to specific, fixable failures

**Steps**:
1. List all failed assertions with their root cause
2. For each failure: identify which part of SKILL.md causes it
3. Rewrite that section to satisfy the assertion
4. Re-run assertions — if still failing, dig deeper into root cause

## Anti-Pattern Recovery

| Anti-Pattern | Detection | Fix |
|--------------|-----------|-----|
| Reference Manual | Body >200 lines, no behavioral verbs | Delete 70%+, add MUST/ALWAYS |
| Library-as-Skill | Code in SKILL.md | Convert to instructions |
| Phantom Tooling | Missing scripts | Replace with inline instructions |
| Framework Inflation | Cost >2x, delta ≈ 0 | Add quick-mode routing |
| Unsubstantiated Claims | "7.8x faster" without evidence | Remove or add evidence |
FILE:knowledge/lessons.md
# Evaluation Knowledge Base

## Lessons Learned (Accumulated Eval Wisdom)

### General
- Skills that rely on external credentials (dependency-gated) should be marked as such and evaluated separately from quality signals
- Phantom tooling (missing scripts) should be flagged separately from framework/template value
- Unsubstantiated marketing claims should be noted but not used in scoring

### Assertion Design
- Assertions that pass in both with-skill and without-skill are non-discriminating — always include skill-specific assertions
- Banned-word checks are highly discriminating for style-constrained skills (100% delta observed)
- For technical correctness tasks, baseline models are already strong — focus on methodology and structure, not correctness
- Bilingual keyword variants (Chinese + English) reduce false negatives for multilingual skills

### Category Patterns
- **Capability uplift skills**: target structural elements the model CAN produce but doesn't by default (excellent discriminators)
- **Framework-heavy skills**: can justify 50-90% time overhead IF they consistently improve actionability and formatting
- **CLI wrapper skills**: assert tool invocation, meaningful output delta, graceful dependency handling
- **Reference manual anti-pattern**: >200 lines of educational content — should be behavioral mandates instead
- **Library-as-skill anti-pattern**: code in SKILL.md instead of instructions — convert to behavioral guidance

### Efficiency
- Quality delta ≈ 0 but cost delta >2x should heavily penalize efficiency score
- Quick-mode routing reduces overhead for simple use cases
FILE:references/rubric.md
# Skill Evaluation Rubric

Full multi-framework rubric with concrete criteria per score level.
Automated checks (via `scripts/eval-skill.py`) cover structure, trigger, docs, scripts, and security.
This rubric adds the **manual assessment** dimensions that require judgment.

## Scoring: 0–4 per criterion

| Score | Meaning | Guideline |
|-------|---------|-----------|
| 0 | Fail | Missing or broken — blocks publishing |
| 1 | Poor | Present but inadequate — significant rework needed |
| 2 | Acceptable | Works but has notable gaps — should improve |
| 3 | Good | Solid with minor issues — publishable |
| 4 | Excellent | Best-in-class — no meaningful improvements |

---

## 1. Functional Suitability (ISO 25010)

### 1.1 Completeness
*Does it cover the domain adequately?*
- **4:** Covers all common operations + edge cases. A user would rarely need to go outside the skill.
- **3:** Covers core operations. Missing some advanced/niche features.
- **2:** Covers basic operations. Missing several expected features.
- **1:** Only covers a fraction of the domain. Major gaps.
- **0:** Stub or non-functional.

### 1.2 Correctness
*Do operations produce correct results?*
- **4:** All operations tested and verified. Edge cases handled.
- **3:** Core operations correct. Minor bugs in edge cases.
- **2:** Mostly correct but some operations produce wrong results.
- **1:** Several operations are broken or produce incorrect output.
- **0:** Fundamentally broken.

### 1.3 Appropriateness
*Is the technical approach suitable?*
- **4:** Zero external deps, portable, follows platform conventions perfectly.
- **3:** Minimal deps, good approach, minor awkwardness.
- **2:** Unnecessary deps or suboptimal approach for some operations.
- **1:** Wrong tool for the job or heavy unnecessary dependencies.
- **0:** Fundamentally misdesigned.

---

## 2. Reliability (ISO 25010)

### 2.1 Fault Tolerance
*How does it handle failures?*
- **4:** Retries transient errors, graceful fallbacks, partial operations succeed even if some fail.
- **3:** Catches and reports errors clearly. Some retry or fallback logic.
- **2:** Catches errors but dies on first failure. No retry.
- **1:** Unhandled exceptions for common error cases.
- **0:** Crashes with tracebacks on normal error conditions.

### 2.2 Error Reporting
*Are errors actionable?*
- **4:** Structured errors (JSON in --json mode), consistent stderr, actionable messages with fix suggestions.
- **3:** Clear error messages to stderr. User knows what went wrong.
- **2:** Error messages exist but are vague or inconsistent (mixed stdout/stderr).
- **1:** Raw tracebacks or cryptic messages.
- **0:** Silent failures.

### 2.3 Recoverability
*Can interrupted operations resume?*
- **4:** Checkpoint/resume for batch operations. Idempotent by default.
- **3:** No checkpoint but operations are idempotent (safe to re-run).
- **2:** Re-running is safe but redundant work is done.
- **1:** Re-running may cause duplicates or errors.
- **0:** Interrupted operations leave corrupted state.

---

## 3. Performance / Context Efficiency

### 3.1 SKILL.md Token Cost
*Is the SKILL.md appropriately sized for its value?*
- **4:** <150 lines. Progressive disclosure via references. Every line earns its tokens.
- **3:** 150-250 lines. Some content could move to references.
- **2:** 250-400 lines. Verbose explanations an AI doesn't need.
- **1:** 400+ lines. Wastes context window. Should be restructured.
- **0:** So large it crowds out other context.

### 3.2 Execution Efficiency
*Does the script avoid unnecessary work?*
- **4:** Paginated, filtered, incremental where possible. Rate-limited politely.
- **3:** Handles pagination. Some unnecessary full-scans.
- **2:** Works but does redundant API calls or full-library scans.
- **1:** Extremely slow or wasteful for normal operations.
- **0:** Unusably slow.

---

## 4. Usability — AI Agent (Shneiderman, Gerhardt-Powals)

### 4.1 Cold-Start Learnability
*Can a fresh AI instance use this skill correctly on first try?*
- **4:** SKILL.md + examples are sufficient. Troubleshooting reference covers errors. Agent never needs to read source.
- **3:** SKILL.md covers core usage. Some edge cases require experimentation.
- **2:** Agent needs to read source code to understand some operations.
- **1:** SKILL.md is insufficient. Agent will struggle without source access.
- **0:** No documentation. Agent must reverse-engineer everything.

### 4.2 Consistency
*Are patterns uniform across commands/operations?*
- **4:** Same flag names, same output format, same error handling everywhere.
- **3:** Mostly consistent. One or two exceptions.
- **2:** Inconsistent flags, mixed output formats, or uneven error handling.
- **1:** Every command behaves differently.
- **0:** No discernible pattern.

### 4.3 Feedback Quality
*Does the skill communicate progress and results clearly?*
- **4:** Progress indicators, summary reports, emoji-scannable output, JSON mode.
- **3:** Clear success/failure messages. Batch progress shown.
- **2:** Basic output. Hard to tell success from failure in batch operations.
- **1:** Minimal output. Agent has to infer outcomes.
- **0:** Silent or cryptic.

### 4.4 Error Prevention
*Does the skill prevent mistakes before they happen?*
- **4:** Input validation, safe defaults (recoverable > permanent), dry-run for destructive ops, duplicate detection.
- **3:** Some validation and safe defaults. Destructive ops have confirmation.
- **2:** Minimal validation. Unsafe defaults for some operations.
- **1:** No validation. Easy to accidentally destroy data.
- **0:** Actively dangerous defaults.

---

## 5. Usability — Human End User (Tognazzini, Norman)

### 5.1 Discoverability
*Can a human figure out what's available?*
- **4:** --help on all commands with examples. SKILL.md documents everything.
- **3:** --help works. SKILL.md covers core operations.
- **2:** Basic --help. Some features undocumented.
- **1:** Minimal help text. Must read source.
- **0:** No help or documentation.

### 5.2 Forgiveness / Undo
*Can mistakes be reversed?*
- **4:** Destructive ops default to recoverable (trash). Undo available. Confirmation prompts.
- **3:** Confirmation on destructive ops. Some recoverability.
- **2:** Confirmation exists but permanent is the default.
- **1:** No confirmation on destructive operations.
- **0:** Destructive operations with no warning.

---

## 6. Security (ISO 25010 + OSS Hygiene)

### 6.1 Credential Handling
*How are secrets managed?*
- **4:** All secrets via env vars. No personal data in source. Documented.
- **3:** Secrets via env vars. Minor issues (undocumented optional var).
- **2:** Mostly env vars but some hardcoded values.
- **1:** Credentials in source code.
- **0:** Credentials in source AND committed to git.

### 6.2 Input Validation
*Are inputs sanitized?*
- **4:** All user inputs validated with helpful error messages.
- **3:** Key inputs validated. Some edge cases unchecked.
- **2:** Minimal validation. Bad input causes API errors.
- **1:** No validation. Garbage in, garbage out.
- **0:** Injection-vulnerable.

### 6.3 Data Safety
*Are write operations safe by default?*
- **4:** Dry-run defaults, confirmation prompts, safe defaults (trash > delete).
- **3:** Most write ops have confirmation. Safe defaults.
- **2:** Some write ops unprotected.
- **1:** Write ops fire without confirmation.
- **0:** Silent data destruction possible.

---

## 7. Maintainability (ISO 25010)

### 7.1 Modularity
*Is the code well-organized?*
- **4:** Clear separation of concerns. Helpers extracted. Easy to add features.
- **3:** Functions organized by command. Some helpers extracted.
- **2:** Monolithic but navigable. Would benefit from refactoring.
- **1:** Spaghetti. Hard to find things.
- **0:** Incomprehensible.

### 7.2 Modifiability
*How easy is it to change?*
- **4:** Adding a new command is copy-paste-modify. Patterns are clear.
- **3:** Clear patterns. Some globals or tight coupling.
- **2:** Can be modified but requires understanding implicit dependencies.
- **1:** Changes break other things. No clear patterns.
- **0:** Effectively frozen — too fragile to touch.

### 7.3 Testability
*Can it be tested?*
- **4:** Test suite exists. Functions return values. Mockable API layer.
- **3:** No test suite but functions are testable (return values, pure helpers).
- **2:** Some testable functions. Others depend on live API or sys.exit.
- **1:** Everything depends on live API. sys.exit everywhere.
- **0:** Untestable architecture.

---

## 8. Agent-Specific Heuristics

### 8.1 Trigger Precision
*Does the description activate the skill reliably?*
- **4:** Specific domain + action words + "Use when..." contexts. Low false positive/negative risk.
- **3:** Good domain keywords. Some ambiguity with similar skills.
- **2:** Too broad or too narrow. Frequent false triggers or missed triggers.
- **1:** Description doesn't match what the skill does.
- **0:** No description.

### 8.2 Progressive Disclosure
*Is context loaded efficiently?*
- **4:** 3+ levels: description → SKILL.md → references. Agent loads only what's needed.
- **3:** 2 levels: description → SKILL.md. Everything in one file but concise.
- **2:** Everything in SKILL.md. No reference files despite complex domain.
- **1:** Requires reading source code for basic usage.
- **0:** No disclosure hierarchy.

### 8.3 Composability
*Can it work with other tools in a pipeline?*
- **4:** --json for all commands, stdin input, proper exit codes, stderr for errors.
- **3:** --json for key commands. Good exit codes.
- **2:** Limited machine-readable output. Some commands.
- **1:** Human-readable only. No structured output.
- **0:** Output is unparseable.

### 8.4 Idempotency
*Is it safe to re-run?*
- **4:** All operations are idempotent. Re-running produces same state.
- **3:** Most operations idempotent. Some may create duplicates.
- **2:** Core operations idempotent. Batch ops may duplicate.
- **1:** Re-running causes problems.
- **0:** Re-running corrupts state.

### 8.5 Escape Hatches
*Can the agent/user override default behavior?*
- **4:** --force, --dry-run, --verbose, --quiet, --json, custom API URLs.
- **3:** --force and --dry-run available. Good flag coverage.
- **2:** Some overrides. Missing key flags.
- **1:** Minimal overrides. Hard to customize behavior.
- **0:** No overrides. Take it or leave it.

---

## Scoring Template

Copy this for your evaluation:

```
| Category | Subcriteria | Score | Notes |
|----------|-------------|-------|-------|
| 1.1 Completeness | | /4 | |
| 1.2 Correctness | | /4 | |
| 1.3 Appropriateness | | /4 | |
| 2.1 Fault Tolerance | | /4 | |
| 2.2 Error Reporting | | /4 | |
| 2.3 Recoverability | | /4 | |
| 3.1 Token Cost | | /4 | |
| 3.2 Execution Efficiency | | /4 | |
| 4.1 Learnability | | /4 | |
| 4.2 Consistency | | /4 | |
| 4.3 Feedback Quality | | /4 | |
| 4.4 Error Prevention | | /4 | |
| 5.1 Discoverability | | /4 | |
| 5.2 Forgiveness | | /4 | |
| 6.1 Credential Handling | | /4 | |
| 6.2 Input Validation | | /4 | |
| 6.3 Data Safety | | /4 | |
| 7.1 Modularity | | /4 | |
| 7.2 Modifiability | | /4 | |
| 7.3 Testability | | /4 | |
| 8.1 Trigger Precision | | /4 | |
| 8.2 Progressive Disclosure | | /4 | |
| 8.3 Composability | | /4 | |
| 8.4 Idempotency | | /4 | |
| 8.5 Escape Hatches | | /4 | |
| **TOTAL** | | **/100** | |
```

FILE:scripts/eval-skill.py
#!/usr/bin/env python3
"""Automated skill evaluation — structural and heuristic checks.

Usage:
    python3 eval-skill.py <path-to-skill-directory>
    python3 eval-skill.py <path-to-skill-directory> --json
    python3 eval-skill.py <path-to-skill-directory> --verbose

Checks file structure, SKILL.md quality, script health, and agent-specific heuristics.
Outputs a report with pass/warn/fail per check and an overall structural score.

This covers the AUTOMATABLE portion of the evaluation. The full rubric
(references/rubric.md) includes manual assessment criteria that require
human or agent judgment.
"""

import argparse
import ast
import json
import os
import re
import sys
import yaml


# --- Check infrastructure ---

class CheckResult:
    PASS = "pass"
    WARN = "warn"
    FAIL = "fail"

    def __init__(self, name, status, message="", category=""):
        self.name = name
        self.status = status
        self.message = message
        self.category = category

    def to_dict(self):
        return {
            "name": self.name,
            "status": self.status,
            "message": self.message,
            "category": self.category,
        }


def check(name, category="general"):
    """Decorator to register a check function."""
    def decorator(func):
        func._check_name = name
        func._check_category = category
        return func
    return decorator


# --- File structure checks ---

@check("SKILL.md exists", "structure")
def check_skill_md_exists(skill_path):
    path = os.path.join(skill_path, "SKILL.md")
    if os.path.isfile(path):
        return CheckResult("SKILL.md exists", CheckResult.PASS, category="structure")
    return CheckResult("SKILL.md exists", CheckResult.FAIL,
                       "SKILL.md is required", category="structure")


@check("SKILL.md has valid frontmatter", "structure")
def check_frontmatter(skill_path):
    path = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(path):
        return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
                           "SKILL.md not found", category="structure")
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()

    # Extract YAML frontmatter
    match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
    if not match:
        return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
                           "No YAML frontmatter found (must start with ---)", category="structure")

    try:
        fm = yaml.safe_load(match.group(1))
    except Exception as e:
        return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
                           f"Invalid YAML: {e}", category="structure")

    if not isinstance(fm, dict):
        return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
                           "Frontmatter must be a YAML mapping", category="structure")

    missing = []
    if not fm.get("name"):
        missing.append("name")
    if not fm.get("description"):
        missing.append("description")

    if missing:
        return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
                           f"Missing required fields: {', '.join(missing)}", category="structure")

    return CheckResult("SKILL.md has valid frontmatter", CheckResult.PASS, category="structure")


@check("Skill name matches directory", "structure")
def check_name_matches_dir(skill_path):
    dir_name = os.path.basename(os.path.abspath(skill_path))
    path = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(path):
        return CheckResult("Skill name matches directory", CheckResult.FAIL,
                           "SKILL.md not found", category="structure")
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
    match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
    if not match:
        return CheckResult("Skill name matches directory", CheckResult.WARN,
                           "No frontmatter to check", category="structure")
    try:
        fm = yaml.safe_load(match.group(1))
        name = fm.get("name", "")
    except Exception:
        return CheckResult("Skill name matches directory", CheckResult.WARN,
                           "Could not parse frontmatter", category="structure")

    if name == dir_name:
        return CheckResult("Skill name matches directory", CheckResult.PASS, category="structure")
    return CheckResult("Skill name matches directory", CheckResult.WARN,
                       f"name='{name}' but directory='{dir_name}'", category="structure")


@check("No extraneous files", "structure")
def check_no_extraneous(skill_path):
    """Check for files that shouldn't be in a skill (README, CHANGELOG, etc.)."""
    bad_files = {"README.md", "CHANGELOG.md", "INSTALLATION_GUIDE.md",
                 "QUICK_REFERENCE.md", "LICENSE", "LICENSE.md"}
    found = []
    for f in os.listdir(skill_path):
        if f.upper() in {b.upper() for b in bad_files}:
            found.append(f)
    if found:
        return CheckResult("No extraneous files", CheckResult.WARN,
                           f"Found: {', '.join(found)} — skills shouldn't include these",
                           category="structure")
    return CheckResult("No extraneous files", CheckResult.PASS, category="structure")


@check("Resource directories are non-empty", "structure")
def check_resource_dirs(skill_path):
    """If scripts/, references/, or assets/ exist, they should contain files."""
    empty = []
    for d in ("scripts", "references", "assets"):
        dp = os.path.join(skill_path, d)
        if os.path.isdir(dp):
            contents = [f for f in os.listdir(dp) if not f.startswith(".") and f != "__pycache__"]
            if not contents:
                empty.append(d)
    if empty:
        return CheckResult("Resource directories are non-empty", CheckResult.WARN,
                           f"Empty directories: {', '.join(empty)}", category="structure")
    return CheckResult("Resource directories are non-empty", CheckResult.PASS, category="structure")


# --- Description quality ---

@check("Description length adequate", "trigger")
def check_description_length(skill_path):
    fm = _get_frontmatter(skill_path)
    if not fm:
        return CheckResult("Description length adequate", CheckResult.FAIL,
                           "No frontmatter", category="trigger")
    desc = fm.get("description", "")
    words = len(desc.split())
    if words < 15:
        return CheckResult("Description length adequate", CheckResult.FAIL,
                           f"Description is only {words} words — too short for reliable triggering",
                           category="trigger")
    if words < 30:
        return CheckResult("Description length adequate", CheckResult.WARN,
                           f"Description is {words} words — consider adding trigger contexts",
                           category="trigger")
    if words > 150:
        return CheckResult("Description length adequate", CheckResult.WARN,
                           f"Description is {words} words — may be too long (wastes context in metadata)",
                           category="trigger")
    return CheckResult("Description length adequate", CheckResult.PASS,
                       f"{words} words", category="trigger")


@check("Description includes trigger contexts", "trigger")
def check_trigger_contexts(skill_path):
    """Good descriptions say WHEN to use the skill, not just WHAT it does."""
    fm = _get_frontmatter(skill_path)
    if not fm:
        return CheckResult("Description includes trigger contexts", CheckResult.FAIL,
                           "No frontmatter", category="trigger")
    desc = fm.get("description", "").lower()
    trigger_phrases = ["use when", "use for", "use if", "when the user",
                       "when asked", "when you need", "for tasks like",
                       "such as", "e.g.", "for example"]
    found = [p for p in trigger_phrases if p in desc]
    if not found:
        return CheckResult("Description includes trigger contexts", CheckResult.WARN,
                           "No trigger phrases found — add 'Use when...' to improve activation",
                           category="trigger")
    return CheckResult("Description includes trigger contexts", CheckResult.PASS,
                       f"Found: {', '.join(found[:3])}", category="trigger")


# --- SKILL.md body quality ---

@check("SKILL.md body length", "documentation")
def check_body_length(skill_path):
    path = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(path):
        return CheckResult("SKILL.md body length", CheckResult.FAIL, "Not found", category="documentation")
    with open(path, "r") as f:
        lines = f.readlines()
    # Count lines after frontmatter
    in_fm = False
    body_lines = 0
    fm_ended = False
    for line in lines:
        if line.strip() == "---":
            if not fm_ended:
                in_fm = not in_fm
                if not in_fm:
                    fm_ended = True
                continue
        if fm_ended:
            body_lines += 1

    if body_lines < 10:
        return CheckResult("SKILL.md body length", CheckResult.FAIL,
                           f"Only {body_lines} lines — too short to be useful", category="documentation")
    if body_lines > 500:
        return CheckResult("SKILL.md body length", CheckResult.WARN,
                           f"{body_lines} lines — consider splitting into reference files", category="documentation")
    return CheckResult("SKILL.md body length", CheckResult.PASS,
                       f"{body_lines} lines", category="documentation")


@check("References are linked from SKILL.md", "documentation")
def check_references_linked(skill_path):
    """If references/ exists, SKILL.md should link to the files."""
    ref_dir = os.path.join(skill_path, "references")
    if not os.path.isdir(ref_dir):
        return CheckResult("References are linked from SKILL.md", CheckResult.PASS,
                           "No references/ directory", category="documentation")
    ref_files = [f for f in os.listdir(ref_dir) if not f.startswith(".")]
    if not ref_files:
        return CheckResult("References are linked from SKILL.md", CheckResult.PASS,
                           "No reference files", category="documentation")

    skill_md = os.path.join(skill_path, "SKILL.md")
    with open(skill_md, "r") as f:
        content = f.read()

    unlinked = [f for f in ref_files if f not in content]
    if unlinked:
        return CheckResult("References are linked from SKILL.md", CheckResult.WARN,
                           f"Unlinked references: {', '.join(unlinked)}", category="documentation")
    return CheckResult("References are linked from SKILL.md", CheckResult.PASS, category="documentation")


# --- Script health ---

@check("Python scripts parse without errors", "scripts")
def check_python_syntax(skill_path):
    scripts_dir = os.path.join(skill_path, "scripts")
    if not os.path.isdir(scripts_dir):
        return CheckResult("Python scripts parse without errors", CheckResult.PASS,
                           "No scripts/ directory", category="scripts")

    errors = []
    checked = 0
    for f in os.listdir(scripts_dir):
        if f.endswith(".py"):
            checked += 1
            fpath = os.path.join(scripts_dir, f)
            try:
                with open(fpath, "r") as fh:
                    ast.parse(fh.read(), filename=f)
            except SyntaxError as e:
                errors.append(f"{f}:{e.lineno}: {e.msg}")

    if not checked:
        return CheckResult("Python scripts parse without errors", CheckResult.PASS,
                           "No Python scripts", category="scripts")
    if errors:
        return CheckResult("Python scripts parse without errors", CheckResult.FAIL,
                           "\n".join(errors), category="scripts")
    return CheckResult("Python scripts parse without errors", CheckResult.PASS,
                       f"{checked} script(s) OK", category="scripts")


@check("Scripts use no external dependencies", "scripts")
def check_no_ext_deps(skill_path):
    """Check that Python scripts only import stdlib modules."""
    scripts_dir = os.path.join(skill_path, "scripts")
    if not os.path.isdir(scripts_dir):
        return CheckResult("Scripts use no external dependencies", CheckResult.PASS,
                           "No scripts/", category="scripts")

    # Common stdlib modules (not exhaustive, but covers common ones)
    stdlib = {
        "abc", "argparse", "ast", "base64", "bisect", "calendar", "cgi",
        "cmd", "codecs", "collections", "configparser", "contextlib", "copy",
        "csv", "dataclasses", "datetime", "decimal", "difflib", "email",
        "enum", "errno", "fnmatch", "fractions", "ftplib", "functools",
        "getopt", "getpass", "glob", "gzip", "hashlib", "heapq", "hmac",
        "html", "http", "imaplib", "importlib", "inspect", "io", "ipaddress",
        "itertools", "json", "keyword", "linecache", "locale", "logging",
        "lzma", "math", "mimetypes", "multiprocessing", "operator", "os",
        "pathlib", "pickle", "platform", "plistlib", "pprint", "profile",
        "queue", "random", "re", "readline", "reprlib", "secrets",
        "select", "shelve", "shlex", "shutil", "signal", "smtplib",
        "socket", "sqlite3", "ssl", "stat", "statistics", "string",
        "struct", "subprocess", "sys", "syslog", "tarfile", "tempfile",
        "textwrap", "threading", "time", "timeit", "token", "tokenize",
        "tomllib", "traceback", "types", "typing", "unicodedata",
        "unittest", "urllib", "uuid", "venv", "warnings", "weakref",
        "webbrowser", "xml", "xmlrpc", "zipfile", "zipimport", "zlib",
        "_thread", "__future__",
    }

    ext_deps = []
    for f in os.listdir(scripts_dir):
        if not f.endswith(".py"):
            continue
        fpath = os.path.join(scripts_dir, f)
        try:
            with open(fpath, "r") as fh:
                tree = ast.parse(fh.read())
        except SyntaxError:
            continue

        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    top = alias.name.split(".")[0]
                    if top not in stdlib:
                        ext_deps.append(f"{f}: import {alias.name}")
            elif isinstance(node, ast.ImportFrom):
                if node.module:
                    top = node.module.split(".")[0]
                    if top not in stdlib:
                        ext_deps.append(f"{f}: from {node.module}")

    if ext_deps:
        return CheckResult("Scripts use no external dependencies", CheckResult.WARN,
                           f"Possible external deps: {'; '.join(ext_deps[:5])}",
                           category="scripts")
    return CheckResult("Scripts use no external dependencies", CheckResult.PASS, category="scripts")


@check("No hardcoded credentials or emails", "security")
def check_no_hardcoded_secrets(skill_path):
    """Scan for common patterns: API keys, emails, tokens in source files."""
    patterns = [
        (r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', "email address"),
        (r'(?:api[_-]?key|token|secret|password)\s*[=:]\s*["\'][^"\']{8,}', "possible credential"),
        (r'sk-[a-zA-Z0-9]{20,}', "OpenAI-style API key"),
        (r'ghp_[a-zA-Z0-9]{36}', "GitHub PAT"),
    ]
    findings = []

    for root, dirs, files in os.walk(skill_path):
        dirs[:] = [d for d in dirs if d not in ("__pycache__", ".git", "node_modules")]
        for f in files:
            if f.endswith((".py", ".md", ".sh", ".js", ".ts", ".yaml", ".yml", ".json", ".toml")):
                fpath = os.path.join(root, f)
                rel = os.path.relpath(fpath, skill_path)
                try:
                    with open(fpath, "r", encoding="utf-8", errors="ignore") as fh:
                        for i, line in enumerate(fh, 1):
                            for pattern, desc in patterns:
                                if re.search(pattern, line):
                                    if desc == "email address":
                                        match = re.search(pattern, line)
                                        if match:
                                            email = match.group().lower()
                                            # Skip well-known safe patterns
                                            safe_domains = (
                                                "example.com", "example.org",
                                                "outlook.com", "hotmail.com",
                                                "googlegroups.com",
                                                "iam.gserviceaccount.com",
                                                "placeholder", "your",
                                            )
                                            if any(d in email for d in safe_domains):
                                                continue
                                            if "your" in email:
                                                continue
                                    findings.append(f"{rel}:{i}: {desc}")
                except (IOError, UnicodeDecodeError):
                    pass

    if findings:
        # Deduplicate
        unique = list(dict.fromkeys(findings))[:10]
        return CheckResult("No hardcoded credentials or emails", CheckResult.WARN,
                           f"Found {len(findings)} potential issues:\n" + "\n".join(unique),
                           category="security")
    return CheckResult("No hardcoded credentials or emails", CheckResult.PASS, category="security")


@check("Environment variables documented", "security")
def check_env_vars_documented(skill_path):
    """If scripts reference os.environ, SKILL.md should document those vars."""
    scripts_dir = os.path.join(skill_path, "scripts")
    if not os.path.isdir(scripts_dir):
        return CheckResult("Environment variables documented", CheckResult.PASS,
                           "No scripts/", category="security")

    env_vars = set()
    for f in os.listdir(scripts_dir):
        if not f.endswith(".py"):
            continue
        fpath = os.path.join(scripts_dir, f)
        with open(fpath, "r", encoding="utf-8", errors="ignore") as fh:
            content = fh.read()
        # Match os.environ.get("VAR") and os.environ["VAR"]
        for m in re.finditer(r'os\.environ(?:\.get)?\s*[\[(]\s*["\'](\w+)', content):
            var = m.group(1)
            # Skip obviously dummy/example variable names
            if var not in ("VAR", "KEY", "VALUE", "NAME", "ENV"):
                env_vars.add(var)

    if not env_vars:
        return CheckResult("Environment variables documented", CheckResult.PASS,
                           "No env vars found in scripts", category="security")

    skill_md = os.path.join(skill_path, "SKILL.md")
    with open(skill_md, "r") as f:
        skill_content = f.read()

    undocumented = [v for v in env_vars if v not in skill_content]
    if undocumented:
        return CheckResult("Environment variables documented", CheckResult.WARN,
                           f"Undocumented env vars: {', '.join(sorted(undocumented))}",
                           category="security")
    return CheckResult("Environment variables documented", CheckResult.PASS,
                       f"All {len(env_vars)} env vars documented", category="security")


# --- Helpers ---

def _get_frontmatter(skill_path):
    path = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(path):
        return None
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
    match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
    if not match:
        return None
    try:
        return yaml.safe_load(match.group(1))
    except Exception:
        return None


# --- Runner ---

ALL_CHECKS = [
    check_skill_md_exists,
    check_frontmatter,
    check_name_matches_dir,
    check_no_extraneous,
    check_resource_dirs,
    check_description_length,
    check_trigger_contexts,
    check_body_length,
    check_references_linked,
    check_python_syntax,
    check_no_ext_deps,
    check_no_hardcoded_secrets,
    check_env_vars_documented,
]


def run_checks(skill_path, verbose=False):
    results = []
    for check_fn in ALL_CHECKS:
        try:
            result = check_fn(skill_path)
            results.append(result)
        except Exception as e:
            results.append(CheckResult(
                check_fn._check_name, CheckResult.FAIL,
                f"Check crashed: {e}", check_fn._check_category
            ))
    return results


def print_report(results, skill_path, verbose=False):
    counts = {CheckResult.PASS: 0, CheckResult.WARN: 0, CheckResult.FAIL: 0}
    by_category = {}

    for r in results:
        counts[r.status] += 1
        by_category.setdefault(r.category, []).append(r)

    skill_name = os.path.basename(os.path.abspath(skill_path))
    print(f"\n📋 Skill Evaluation: {skill_name}")
    print(f"{'=' * 50}")
    print(f"Path: {os.path.abspath(skill_path)}")
    print()

    icons = {CheckResult.PASS: "✅", CheckResult.WARN: "⚠️ ", CheckResult.FAIL: "❌"}

    for cat in ["structure", "trigger", "documentation", "scripts", "security"]:
        if cat not in by_category:
            continue
        print(f"  [{cat.upper()}]")
        for r in by_category[cat]:
            icon = icons[r.status]
            print(f"    {icon} {r.name}")
            if r.message and (verbose or r.status != CheckResult.PASS):
                for line in r.message.split("\n"):
                    print(f"       {line}")
        print()

    print(f"{'=' * 50}")
    print(f"  ✅ Pass: {counts[CheckResult.PASS]}  "
          f"⚠️  Warn: {counts[CheckResult.WARN]}  "
          f"❌ Fail: {counts[CheckResult.FAIL]}")

    total = len(results)
    score = counts[CheckResult.PASS] / total * 100 if total else 0
    print(f"  Structural score: {score:.0f}% ({counts[CheckResult.PASS]}/{total} checks passed)")
    print()
    print("  ⓘ  This covers automated/structural checks only.")
    print("     For the full evaluation, use the manual rubric in references/rubric.md")
    print()


def main():
    parser = argparse.ArgumentParser(description="Evaluate a Clawdbot skill")
    parser.add_argument("path", help="Path to skill directory")
    parser.add_argument("--json", action="store_true", help="JSON output")
    parser.add_argument("--verbose", "-v", action="store_true", help="Show details for passing checks too")
    args = parser.parse_args()

    if not os.path.isdir(args.path):
        print(f"Error: '{args.path}' is not a directory", file=sys.stderr)
        sys.exit(1)

    results = run_checks(args.path, verbose=args.verbose)

    if args.json:
        output = {
            "skill": os.path.basename(os.path.abspath(args.path)),
            "path": os.path.abspath(args.path),
            "checks": [r.to_dict() for r in results],
            "summary": {
                "pass": sum(1 for r in results if r.status == CheckResult.PASS),
                "warn": sum(1 for r in results if r.status == CheckResult.WARN),
                "fail": sum(1 for r in results if r.status == CheckResult.FAIL),
            }
        }
        print(json.dumps(output, indent=2))
    else:
        print_report(results, args.path, verbose=args.verbose)


if __name__ == "__main__":
    main()

FILE:scripts/generate_leaderboard.py
#!/usr/bin/env python3
"""
Generate an HTML leaderboard from skill cards.

Usage:
    python3 generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html

This script reads all skill card markdown files from the cards directory
and generates a sortable HTML leaderboard.
"""

import argparse
import os
import re
import sys
from pathlib import Path


def parse_skill_card(filepath):
    """Parse a skill card markdown file and extract metadata."""
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            content = f.read()
    except (IOError, UnicodeDecodeError):
        return None
    
    card = {
        "file": os.path.basename(filepath),
        "path": str(filepath),
        "name": "Unknown",
        "slug": "unknown",
        "eval_date": "N/A",
        "model": "N/A",
        "overall_score": 0,
        "quality": 0,
        "delta": 0,
        "efficiency": 0,
        "recommendation": "N/A",
        "strengths": [],
        "weaknesses": [],
    }
    
    # Extract skill name
    match = re.search(r'\*\*Skill\*\*:\s*(.+)', content)
    if match:
        card["name"] = match.group(1).strip()
    
    # Extract slug
    match = re.search(r'\*\*Slug\*\*:\s*(.+)', content)
    if match:
        card["slug"] = match.group(1).strip()
    
    # Extract eval date
    match = re.search(r'\*\*Eval Date\*\*:\s*(.+)', content)
    if match:
        card["eval_date"] = match.group(1).strip()
    
    # Extract model
    match = re.search(r'\*\*Model\*\*:\s*(.+)', content)
    if match:
        card["model"] = match.group(1).strip()
    
    # Extract overall score
    match = re.search(r'\*\*Overall:\s*(\d+)/10\*\*', content)
    if match:
        card["overall_score"] = int(match.group(1))
    
    # Extract component scores
    score_matches = re.findall(r'\|\s*(\w+)\s*\|\s*(\d+)\s*\|', content)
    for name, score in score_matches:
        if name == "Quality":
            card["quality"] = int(score)
        elif name == "Delta":
            card["delta"] = int(score)
        elif name == "Efficiency":
            card["efficiency"] = int(score)
    
    # Extract recommendation
    match = re.search(r'#{1,3}\s*[✅⚠️⚡❌]\s*\*\*(Recommended|Conditional|Marginal|Not Recommended)\*\*', content)
    if match:
        card["recommendation"] = match.group(1)
    
    # Extract strengths
    strengths_section = re.search(r'## Strengths\s*\n\s*\n(.*?)(?=\n##|\n---)', content, re.DOTALL)
    if strengths_section:
        strengths = re.findall(r'✅\s*(.+)', strengths_section.group(1))
        card["strengths"] = [s.strip() for s in strengths]
    
    # Extract weaknesses
    weaknesses_section = re.search(r'## Weaknesses\s*\n\s*\n(.*?)(?=\n##|\n---)', content, re.DOTALL)
    if weaknesses_section:
        weaknesses = re.findall(r'❌\s*(.+)', weaknesses_section.group(1))
        card["weaknesses"] = [w.strip() for w in weaknesses]
    
    return card


def generate_html(cards, args):
    """Generate HTML leaderboard from cards data."""
    
    # Sort cards by overall score (descending)
    cards.sort(key=lambda x: x.get("overall_score", 0), reverse=True)
    
    html = []
    html.append("<!DOCTYPE html>")
    html.append("<html lang='zh-CN'>")
    html.append("<head>")
    html.append("  <meta charset='UTF-8'>")
    html.append("  <meta name='viewport' content='width=device-width, initial-scale=1.0'>")
    html.append(f"  <title>Skill Leaderboard | {args.title}</title>")
    html.append("  <style>")
    html.append("    * { box-sizing: border-box; margin: 0; padding: 0; }")
    html.append("    body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;")
    html.append("          background: #f5f5f5; color: #333; padding: 20px; }")
    html.append("    .container { max-width: 1200px; margin: 0 auto; }")
    html.append("    h1 { text-align: center; margin-bottom: 30px; color: #1a1a2e; }")
    html.append("    .stats { display: flex; justify-content: center; gap: 40px; margin-bottom: 30px; }")
    html.append("    .stat { text-align: center; }")
    html.append("    .stat-value { font-size: 2em; font-weight: bold; color: #4361ee; }")
    html.append("    .stat-label { color: #666; font-size: 0.9em; }")
    html.append("    table { width: 100%; background: white; border-radius: 10px;")
    html.append("           box-shadow: 0 2px 10px rgba(0,0,0,0.1); overflow: hidden; }")
    html.append("    th { background: #4361ee; color: white; padding: 15px; text-align: left; }")
    html.append("    td { padding: 12px 15px; border-bottom: 1px solid #eee; }")
    html.append("    tr:hover { background: #f8f9fa; }")
    html.append("    .score { font-weight: bold; font-size: 1.2em; }")
    html.append("    .score-high { color: #22c55e; }")
    html.append("    .score-mid { color: #f59e0b; }")
    html.append("    .score-low { color: #ef4444; }")
    html.append("    .badge { display: inline-block; padding: 4px 10px; border-radius: 20px;")
    html.append("             font-size: 0.8em; font-weight: 500; }")
    html.append("    .badge-recommended { background: #dcfce7; color: #166534; }")
    html.append("    .badge-conditional { background: #fef9c3; color: #854d0e; }")
    html.append("    .badge-marginal { background: #fed7aa; color: #9a3412; }")
    html.append("    .badge-not-recommended { background: #fee2e2; color: #991b1b; }")
    html.append("    .strengths, .weaknesses { font-size: 0.85em; color: #666; }")
    html.append("    .sort-btn { background: none; border: none; color: inherit; cursor: pointer;")
    html.append("              font-weight: inherit; padding: 0; }")
    html.append("    .sort-btn:hover { color: white; }")
    html.append("    .footer { text-align: center; margin-top: 20px; color: #666; font-size: 0.9em; }")
    html.append("  </style>")
    html.append("</head>")
    html.append("<body>")
    html.append("  <div class='container'>")
    html.append(f"    <h1>{args.title}</h1>")
    html.append("    <div class='stats'>")
    html.append(f"      <div class='stat'><div class='stat-value'>{len(cards)}</div><div class='stat-label'>已评估技能</div></div>")
    
    recommended = sum(1 for c in cards if c.get("recommendation") == "Recommended")
    html.append(f"      <div class='stat'><div class='stat-value'>{recommended}</div><div class='stat-label'>推荐使用</div></div>")
    
    avg_score = sum(c.get("overall_score", 0) for c in cards) / max(len(cards), 1)
    html.append(f"      <div class='stat'><div class='stat-value'>{avg_score:.1f}</div><div class='stat-label'>平均分</div></div>")
    
    html.append("    </div>")
    html.append("    <table id='leaderboard'>")
    html.append("      <thead>")
    html.append("        <tr>")
    html.append("          <th>#</th>")
    html.append("          <th>技能名称</th>")
    html.append("          <th>Slug</th>")
    html.append("          <th>总分</th>")
    html.append("          <th>质量</th>")
    html.append("          <th>提升</th>")
    html.append("          <th>效率</th>")
    html.append("          <th>推荐</th>")
    html.append("          <th>评估日期</th>")
    html.append("          <th>优劣势</th>")
    html.append("        </tr>")
    html.append("      </thead>")
    html.append("      <tbody>")
    
    for i, card in enumerate(cards, 1):
        score = card.get("overall_score", 0)
        if score >= 7:
            score_class = "score-high"
        elif score >= 5:
            score_class = "score-mid"
        else:
            score_class = "score-low"
        
        rec = card.get("recommendation", "N/A")
        badge_class = f"badge-{rec.lower().replace(' ', '-')}"
        
        strengths = ", ".join(card.get("strengths", [])[:2]) if card.get("strengths") else "-"
        weaknesses = ", ".join(card.get("weaknesses", [])[:2]) if card.get("weaknesses") else "-"
        
        html.append("        <tr>")
        html.append(f"          <td>{i}</td>")
        html.append(f"          <td><strong>{card.get('name', 'Unknown')}</strong></td>")
        html.append(f"          <td>{card.get('slug', 'N/A')}</td>")
        html.append(f"          <td class='score {score_class}'>{score}/10</td>")
        html.append(f"          <td>{card.get('quality', 0)}/5</td>")
        html.append(f"          <td>{card.get('delta', 0)}/3</td>")
        html.append(f"          <td>{card.get('efficiency', 0)}/2</td>")
        html.append(f"          <td><span class='badge {badge_class}'>{rec}</span></td>")
        html.append(f"          <td>{card.get('eval_date', 'N/A')}</td>")
        html.append(f"          <td><div class='strengths'>👍 {strengths}</div><div class='weaknesses'>👎 {weaknesses}</div></td>")
        html.append("        </tr>")
    
    html.append("      </tbody>")
    html.append("    </table>")
    html.append(f"    <div class='footer'>Generated by multi-skill-eval v1.0.0 | {len(cards)} skills evaluated</div>")
    html.append("  </div>")
    html.append("</body>")
    html.append("</html>")
    
    return "\n".join(html)


def main():
    parser = argparse.ArgumentParser(
        description="Generate an HTML leaderboard from skill cards"
    )
    parser.add_argument(
        "--cards-dir", "-c", required=True,
        help="Directory containing skill card markdown files"
    )
    parser.add_argument(
        "--output", "-o", required=True,
        help="Output HTML file path"
    )
    parser.add_argument(
        "--title", "-t", default="OpenClaw Skill Leaderboard",
        help="Leaderboard title"
    )
    args = parser.parse_args()
    
    cards_dir = Path(args.cards_dir)
    if not cards_dir.exists():
        print(f"Error: Cards directory not found: {cards_dir}", file=sys.stderr)
        sys.exit(1)
    
    # Find all markdown files
    card_files = list(cards_dir.glob("*.md"))
    
    if not card_files:
        print(f"Warning: No skill card files found in {cards_dir}", file=sys.stderr)
        # Generate empty leaderboard
        cards = []
    else:
        # Parse each card
        cards = []
        for filepath in card_files:
            card = parse_skill_card(filepath)
            if card:
                cards.append(card)
    
    # Generate HTML
    html = generate_html(cards, args)
    
    # Write output
    output_path = Path(args.output)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html)
    
    print(f"Leaderboard generated: {args.output}")
    print(f"  Total skills: {len(cards)}")


if __name__ == "__main__":
    main()

FILE:scripts/generate_skill_card.py
#!/usr/bin/env python3
"""
Generate a standardized Skill Card from benchmark results.

Usage:
    python3 generate_skill_card.py \
        --workspace /path/to/results \
        --skill-name "My Skill" \
        --skill-slug my-skill \
        --eval-model claude-sonnet-4 \
        --output skill-cards/my-skill-v1.md

Skill Card Contents:
- Metadata: name, source, eval date, model, engine version
- Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
- With-skill vs without-skill comparison table
- Per-test-case breakdown with assertions, timing, grading
- Strengths / Weaknesses
- Recommendation: Recommended / Conditional / Marginal / Not Recommended
"""

import argparse
import json
import os
import sys
import time
from datetime import datetime
from pathlib import Path


def load_benchmark_data(workspace):
    """Load all benchmark data from workspace."""
    data = {
        "grading": None,
        "benchmark": None,
        "preflight": None,
    }
    
    # Load grading.json
    grading_path = os.path.join(workspace, "grading.json")
    if os.path.exists(grading_path):
        with open(grading_path, "r", encoding="utf-8") as f:
            data["grading"] = json.load(f)
    
    # Load benchmark aggregation if exists
    benchmark_path = os.path.join(workspace, "benchmark-aggregation.json")
    if os.path.exists(benchmark_path):
        with open(benchmark_path, "r", encoding="utf-8") as f:
            data["benchmark"] = json.load(f)
    
    # Load preflight results if exists
    preflight_path = os.path.join(workspace, "preflight.json")
    if os.path.exists(preflight_path):
        with open(preflight_path, "r", encoding="utf-8") as f:
            data["preflight"] = json.load(f)
    
    return data


def calculate_overall_score(grading, benchmark, preflight):
    """
    Calculate overall score 0-10:
    - Quality: 0-5 based on grading pass rate
    - Delta: 0-3 based on with_skill vs without_skill improvement
    - Efficiency: 0-2 based on time/token cost
    """
    quality = 0
    delta = 0
    efficiency = 2  # Default to max efficiency
    
    # Quality score from grading (0-5)
    if grading and grading.get("summary", {}).get("total", 0) > 0:
        pass_rate = grading["summary"]["pass_rate"]
        quality = int(pass_rate * 5)
    
    # Delta score from benchmark (0-3)
    if benchmark:
        benchmark_delta = benchmark.get("delta", {})
        pass_rate_delta = benchmark_delta.get("pass_rate", "+0.00")
        
        # Parse delta value (e.g., "+0.15" -> 0.15)
        try:
            delta_value = float(pass_rate_delta.replace("+", ""))
            if delta_value >= 0.3:
                delta = 3
            elif delta_value >= 0.2:
                delta = 2
            elif delta_value >= 0.1:
                delta = 1
            else:
                delta = 0
        except (ValueError, AttributeError):
            delta = 0
        
        # Efficiency penalty for high-overhead skills
        time_delta = benchmark_delta.get("time", "1x")
        try:
            time_ratio = float(time_delta.replace("x", "").replace("+", ""))
            if time_ratio > 3:
                efficiency = 0
            elif time_ratio > 2:
                efficiency = 1
        except (ValueError, AttributeError):
            pass
    
    overall = quality + delta + efficiency
    return min(overall, 10), {
        "quality": quality,
        "delta": delta,
        "efficiency": efficiency,
        "overall": min(overall, 10)
    }


def get_recommendation(overall_score, grading, benchmark, preflight):
    """Determine recommendation based on overall score."""
    if overall_score >= 7:
        return "Recommended"
    elif overall_score >= 5:
        return "Conditional"
    elif overall_score >= 3:
        return "Marginal"
    else:
        return "Not Recommended"


def detect_strengths(grading, benchmark, preflight):
    """Identify skill strengths from data."""
    strengths = []
    
    if grading:
        # Check for high pass rate areas
        for assertion in grading.get("assertions", []):
            if assertion.get("passed"):
                cat = assertion.get("category", "")
                if cat == "quality":
                    strengths.append(f"输出质量达标")
                elif cat == "format":
                    strengths.append(f"格式规范")
                elif cat == "security":
                    strengths.append(f"安全性良好")
    
    if benchmark:
        # Check for significant improvements
        if benchmark.get("with_skill", {}).get("pass_rate", 0) > 0.8:
            strengths.append("高任务完成率")
    
    if preflight:
        # Check dependencies
        deps = preflight.get("dependencies", {})
        if deps.get("all_available", True):
            strengths.append("依赖完整可用")
    
    return list(set(strengths))[:5]  # Deduplicate, max 5


def detect_weaknesses(grading, benchmark, preflight):
    """Identify skill weaknesses from data."""
    weaknesses = []
    
    if grading:
        # Check for failed assertions
        for assertion in grading.get("assertions", []):
            if not assertion.get("passed"):
                cat = assertion.get("category", "")
                if cat == "format":
                    weaknesses.append("格式不规范")
                elif cat == "security":
                    weaknesses.append("安全性问题")
                elif cat == "completeness":
                    weaknesses.append("输出不完整")
    
    if preflight:
        # Check for phantom tooling
        phantom = preflight.get("phantom_tooling", [])
        if phantom:
            weaknesses.append(f"幽灵工具: {', '.join(phantom[:2])}")
        
        # Check for unverified claims
        claims = preflight.get("unverified_claims", [])
        if claims:
            weaknesses.append("存在未验证的营销声明")
    
    if benchmark:
        # Check for high overhead
        delta = benchmark.get("delta", {})
        time_ratio = delta.get("time", "1x")
        try:
            ratio = float(time_ratio.replace("x", "").replace("+", ""))
            if ratio > 2:
                weaknesses.append(f"开销较高: {time_ratio}")
        except:
            pass
    
    return list(set(weaknesses))[:5]  # Deduplicate, max 5


def generate_skill_card(args):
    """Generate the complete skill card."""
    workspace = os.path.abspath(args.workspace)
    data = load_benchmark_data(workspace)
    
    # Calculate scores
    overall_score, score_breakdown = calculate_overall_score(
        data.get("grading"),
        data.get("benchmark"),
        data.get("preflight")
    )
    
    recommendation = get_recommendation(
        overall_score,
        data.get("grading"),
        data.get("benchmark"),
        data.get("preflight")
    )
    
    strengths = detect_strengths(
        data.get("grading"),
        data.get("benchmark"),
        data.get("preflight")
    )
    
    weaknesses = detect_weaknesses(
        data.get("grading"),
        data.get("benchmark"),
        data.get("preflight")
    )
    
    # Build skill card
    card = []
    card.append("# Skill Card")
    card.append("")
    card.append(f"**Skill**: {args.skill_name}")
    card.append(f"**Slug**: {args.skill_slug}")
    card.append(f"**Eval Date**: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    card.append(f"**Model**: {args.eval_model}")
    card.append("")
    card.append("---")
    card.append("")
    card.append("## Metadata")
    card.append("")
    card.append(f"| Field | Value |")
    card.append(f"|-------|-------|")
    card.append(f"| Skill Name | {args.skill_name} |")
    card.append(f"| Skill Slug | {args.skill_slug} |")
    card.append(f"| Evaluation Date | {datetime.now().strftime('%Y-%m-%d')} |")
    card.append(f"| Evaluator Model | {args.eval_model} |")
    card.append(f"| Engine Version | multi-skill-eval v1.0.0 |")
    if args.source:
        card.append(f"| Source | {args.source} |")
    card.append("")
    card.append("## Overall Score")
    card.append("")
    card.append(f"**Overall: {overall_score}/10**")
    card.append("")
    card.append(f"| Component | Score | Max |")
    card.append(f"|-----------|-------|-----|")
    card.append(f"| Quality | {score_breakdown['quality']} | 5 |")
    card.append(f"| Delta | {score_breakdown['delta']} | 3 |")
    card.append(f"| Efficiency | {score_breakdown['efficiency']} | 2 |")
    card.append(f"| **Total** | **{overall_score}** | **10** |")
    card.append("")
    card.append("### Score Breakdown")
    card.append("")
    card.append("- **Quality (0-5)**: Based on grading pass rate")
    card.append("- **Delta (0-3)**: Based on with-skill vs without-skill improvement")
    card.append("- **Efficiency (0-2)**: Based on time/token cost ratio")
    card.append("")
    card.append("---")
    card.append("")
    card.append("## With-Skill vs Without-Skill Comparison")
    card.append("")
    
    if data.get("benchmark"):
        bm = data["benchmark"]
        card.append(f"| Metric | With-Skill | Without-Skill | Delta |")
        card.append(f"|---------|-----------|---------------|-------|")
        card.append(f"| Pass Rate | {bm.get('with_skill', {}).get('pass_rate', 'N/A')} | {bm.get('without_skill', {}).get('pass_rate', 'N/A')} | {bm.get('delta', {}).get('pass_rate', 'N/A')} |")
        card.append(f"| Avg Time | {bm.get('with_skill', {}).get('avg_time', 'N/A')} | {bm.get('without_skill', {}).get('avg_time', 'N/A')} | {bm.get('delta', {}).get('time', 'N/A')} |")
        card.append(f"| Avg Tokens | {bm.get('with_skill', {}).get('avg_tokens', 'N/A')} | {bm.get('without_skill', {}).get('avg_tokens', 'N/A')} | {bm.get('delta', {}).get('tokens', 'N/A')} |")
    else:
        card.append("*No benchmark data available.*")
    card.append("")
    card.append("---")
    card.append("")
    card.append("## Per-Test-Case Breakdown")
    card.append("")
    
    if data.get("grading"):
        assertions = data["grading"].get("assertions", [])
        by_category = {}
        for a in assertions:
            cat = a.get("category", "other")
            by_category.setdefault(cat, []).append(a)
        
        for cat, items in by_category.items():
            card.append(f"### {cat.upper()}")
            card.append("")
            for item in items:
                status = "✅" if item.get("passed") else "❌"
                card.append(f"- {status} {item.get('text', '')}")
                if item.get('evidence'):
                    card.append(f"  - Evidence: {item.get('evidence')}")
            card.append("")
    else:
        card.append("*No per-test data available.*")
        card.append("")
    
    card.append("---")
    card.append("")
    card.append("## Strengths")
    card.append("")
    if strengths:
        for s in strengths:
            card.append(f"- ✅ {s}")
    else:
        card.append("*No specific strengths identified.*")
    card.append("")
    card.append("## Weaknesses")
    card.append("")
    if weaknesses:
        for w in weaknesses:
            card.append(f"- ❌ {w}")
    else:
        card.append("*No specific weaknesses identified.*")
    card.append("")
    card.append("---")
    card.append("")
    card.append("## Recommendation")
    card.append("")
    
    rec_emoji = {
        "Recommended": "✅",
        "Conditional": "⚠️",
        "Marginal": "⚡",
        "Not Recommended": "❌"
    }
    rec_text = {
        "Recommended": "该技能已通过基准测试，推荐使用。",
        "Conditional": "该技能在特定场景下可用，需注意已知问题。",
        "Marginal": "该技能存在明显局限，建议修复后再使用。",
        "Not Recommended": "该技能未通过基准测试，暂不推荐使用。"
    }
    
    card.append(f"### {rec_emoji.get(recommendation, '❓')} **{recommendation}**")
    card.append("")
    card.append(rec_text.get(recommendation, ""))
    card.append("")
    
    # Dependencies section
    if data.get("preflight"):
        card.append("---")
        card.append("")
        card.append("## Dependencies")
        card.append("")
        deps = data["preflight"].get("dependencies", {})
        if deps:
            card.append(f"- Required: {', '.join(deps.get('required', []))}")
            card.append(f"- Optional: {', '.join(deps.get('optional', []))}")
            card.append(f"- All Available: {'Yes' if deps.get('all_available') else 'No'}")
        else:
            card.append("*No dependency information available.*")
        card.append("")
    
    return "\n".join(card)


def main():
    parser = argparse.ArgumentParser(
        description="Generate a standardized Skill Card from benchmark results"
    )
    parser.add_argument(
        "--workspace", "-w", required=True,
        help="Path to benchmark results workspace"
    )
    parser.add_argument(
        "--skill-name", "-n", required=True,
        help="Name of the skill"
    )
    parser.add_argument(
        "--skill-slug", "-s", required=True,
        help="Slug/identifier for the skill"
    )
    parser.add_argument(
        "--eval-model", "-m", default="minimax/MiniMax-M2",
        help="Model used for evaluation"
    )
    parser.add_argument(
        "--source",
        help="Source of the skill (URL, path, etc.)"
    )
    parser.add_argument(
        "--output", "-o",
        help="Output file path (default: stdout)"
    )
    args = parser.parse_args()
    
    card = generate_skill_card(args)
    
    if args.output:
        os.makedirs(os.path.dirname(args.output), exist_ok=True)
        with open(args.output, "w", encoding="utf-8") as f:
            f.write(card)
        print(f"Skill card generated: {args.output}")
    else:
        print(card)


if __name__ == "__main__":
    main()

FILE:scripts/grade-assertions.py
#!/usr/bin/env python3
"""
Assertion-based grading for skill benchmark results.

Usage:
    python3 grade-assertions.py --workspace /path/to/results
    python3 grade-assertions.py --workspace /path/to/results --verbose

This script reads benchmark execution results from a workspace directory
and grades them against defined assertions.

Workspace structure expected:
    /path/to/results/
        grading-config.json      # Assertion definitions
        iteration-1/
            test-name/
                with_skill/outputs/
                    result.md
                without_skill/outputs/
                    result.md

Output:
    grading.json with pass/fail per assertion and summary statistics.
"""

import argparse
import json
import os
import re
import sys
from pathlib import Path


# --- Assertion Types ---

class AssertionResult:
    def __init__(self, assertion_type, text, passed, evidence="", category=""):
        self.type = assertion_type
        self.text = text
        self.passed = passed
        self.evidence = evidence
        self.category = category

    def to_dict(self):
        return {
            "type": self.type,
            "text": self.text,
            "passed": self.passed,
            "evidence": self.evidence,
            "category": self.category,
        }


def check_keyword_absent(content, keyword, case_sensitive=False):
    """Assertion: a banned keyword should NOT appear in output."""
    if case_sensitive:
        found = keyword in content
    else:
        found = keyword.lower() in content.lower()
    return not found


def check_keyword_present(content, keyword, case_sensitive=False):
    """Assertion: a required keyword MUST appear in output."""
    if case_sensitive:
        found = keyword in content
    else:
        found = keyword.lower() in content.lower()
    return found


def check_required_section(content, section_marker):
    """Assertion: a required section header must appear in output."""
    # Match markdown headers like "## Section Name" or "### Section"
    pattern = rf"^#+\s+{re.escape(section_marker)}"
    return bool(re.search(pattern, content, re.MULTILINE))


def check_file_exists(filepath):
    """Assertion: a file must exist."""
    return os.path.isfile(filepath)


def check_format_json(content):
    """Assertion: content must be valid JSON."""
    try:
        json.loads(content)
        return True
    except (json.JSONDecodeError, TypeError):
        return False


def check_format_markdown(content, min_lines=5):
    """Assertion: content must be valid markdown with minimum lines."""
    lines = content.strip().split("\n")
    has_headers = any(re.match(r"^#+ ", line) for line in lines)
    return len(lines) >= min_lines and has_headers


def check_bilingual_keywords(content, zh_keyword, en_keyword):
    """Assertion: at least one language variant of a keyword must appear."""
    has_zh = zh_keyword in content
    has_en = en_keyword.lower() in content.lower()
    return has_zh or has_en


def check_no_hardcoded_credentials(content):
    """Assertion: no API keys, tokens, or credentials in content."""
    patterns = [
        r'sk-[a-zA-Z0-9]{20,}',  # OpenAI key
        r'ghp_[a-zA-Z0-9]{36}',  # GitHub PAT
        r'api[_-]?key["\']?\s*[=:]\s*["\'][^"\']{8,}',
        r'token["\']?\s*[=:]\s*["\'][^"\']{8,}',
    ]
    for pattern in patterns:
        if re.search(pattern, content, re.IGNORECASE):
            return False
    return True


def check_mention_count(content, keyword, min_count=1):
    """Assertion: a keyword must appear at least min_count times."""
    return content.lower().count(keyword.lower()) >= min_count


# --- Grading Engine ---

AVAILABLE_CHECKS = {
    "keyword_absent": check_keyword_absent,
    "keyword_present": check_keyword_present,
    "required_section": check_required_section,
    "file_exists": check_file_exists,
    "format_json": check_format_json,
    "format_markdown": check_format_markdown,
    "bilingual_keywords": check_bilingual_keywords,
    "no_hardcoded_credentials": check_no_hardcoded_credentials,
    "mention_count": check_mention_count,
}


def load_grading_config(workspace):
    """Load grading configuration from grading-config.json or generate defaults."""
    config_path = os.path.join(workspace, "grading-config.json")
    if os.path.exists(config_path):
        with open(config_path, "r", encoding="utf-8") as f:
            return json.load(f)
    
    # Default: grade all output files for basic quality
    return {
        "Assertions": [],
        "default_checks": {
            "format_markdown": True,
            "no_hardcoded_credentials": True,
        }
    }


def find_output_files(workspace):
    """Find all output files in the workspace."""
    outputs = []
    for root, dirs, files in os.walk(workspace):
        # Skip the workspace root and non-output directories
        if "outputs" in root:
            for f in files:
                if f.endswith((".md", ".txt", ".json", ".html")):
                    outputs.append(os.path.join(root, f))
    return outputs


def grade_file(filepath, assertions):
    """Grade a single output file against assertions."""
    results = []
    
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            content = f.read()
    except (IOError, UnicodeDecodeError):
        return [AssertionResult(
            "file_exists", f"File: {filepath}", False,
            f"Could not read file", "file"
        )]
    
    # Check each assertion
    for assertion in assertions:
        check_type = assertion.get("type", "")
        params = assertion.get("params", {})
        category = assertion.get("category", "general")
        description = assertion.get("description", f"{check_type} check")
        
        if check_type not in AVAILABLE_CHECKS:
            results.append(AssertionResult(
                check_type, description, False,
                f"Unknown assertion type: {check_type}", category
            ))
            continue
        
        check_fn = AVAILABLE_CHECKS[check_type]
        
        try:
            # Handle different check signatures
            if check_type == "keyword_absent":
                passed = check_fn(content, params.get("keyword"), params.get("case_sensitive", False))
                evidence = f"'{params.get('keyword')}' {'found' if not passed else 'not found'}"
            elif check_type == "keyword_present":
                passed = check_fn(content, params.get("keyword"), params.get("case_sensitive", False))
                evidence = f"'{params.get('keyword')}' {'found' if passed else 'not found'}"
            elif check_type == "required_section":
                passed = check_fn(content, params.get("section"))
                evidence = f"Section '{params.get('section')}' {'found' if passed else 'not found'}"
            elif check_type == "file_exists":
                passed = check_fn(filepath)
                evidence = f"File exists: {passed}"
            elif check_type == "format_json":
                passed = check_fn(content)
                evidence = f"Valid JSON: {passed}"
            elif check_type == "format_markdown":
                passed = check_fn(content, params.get("min_lines", 5))
                evidence = f"Valid markdown with {len(content.split())} lines: {passed}"
            elif check_type == "bilingual_keywords":
                passed = check_fn(content, params.get("zh"), params.get("en"))
                evidence = f"Keywords '{params.get('zh')}' or '{params.get('en')}' found: {passed}"
            elif check_type == "no_hardcoded_credentials":
                passed = check_fn(content)
                evidence = f"No credentials found: {passed}"
            elif check_type == "mention_count":
                passed = check_fn(content, params.get("keyword"), params.get("min_count", 1))
                evidence = f"'{params.get('keyword')}' appears {content.lower().count(params.get('keyword').lower())} times (min: {params.get('min_count', 1)})"
            else:
                passed = False
                evidence = "Unknown check type"
            
            results.append(AssertionResult(
                check_type, description, passed, evidence, category
            ))
        except Exception as e:
            results.append(AssertionResult(
                check_type, description, False, f"Check error: {e}", category
            ))
    
    return results


def grade_workspace(workspace, verbose=False):
    """Grade all outputs in a workspace against assertions."""
    workspace = os.path.abspath(workspace)
    
    if not os.path.exists(workspace):
        return {
            "error": f"Workspace not found: {workspace}",
            "expectations": [],
            "summary": {"passed": 0, "failed": 0, "total": 0, "pass_rate": 0}
        }
    
    config = load_grading_config(workspace)
    assertions = config.get("Assertions", [])
    
    # If no assertions defined, use default checks
    if not assertions:
        assertions = _get_default_assertions()
    
    output_files = find_output_files(workspace)
    
    if verbose:
        print(f"Grading workspace: {workspace}")
        print(f"Found {len(output_files)} output files")
        print(f"Running {len(assertions)} assertions per file")
    
    all_results = []
    
    for filepath in output_files:
        if verbose:
            print(f"  Grading: {os.path.relpath(filepath, workspace)}")
        results = grade_file(filepath, assertions)
        all_results.extend(results)
    
    # Deduplicate by text
    seen = set()
    unique_results = []
    for r in all_results:
        if r.text not in seen:
            seen.add(r.text)
            unique_results.append(r)
    
    return {
        "workspace": workspace,
        "files_checked": len(output_files),
        "assertions": [r.to_dict() for r in unique_results],
        "expectations": [r.to_dict() for r in unique_results],
        "summary": {
            "passed": sum(1 for r in unique_results if r.passed),
            "failed": sum(1 for r in unique_results if not r.passed),
            "total": len(unique_results),
            "pass_rate": sum(1 for r in unique_results if r.passed) / max(len(unique_results), 1),
        }
    }


def _get_default_assertions():
    """Get default assertions for markdown-based outputs."""
    return [
        {
            "type": "format_markdown",
            "description": "Output should be valid markdown with headers",
            "category": "format",
            "params": {"min_lines": 5}
        },
        {
            "type": "no_hardcoded_credentials",
            "description": "Output should not contain hardcoded credentials",
            "category": "security",
            "params": {}
        },
        {
            "type": "keyword_present",
            "description": "Output should contain substantive content",
            "category": "quality",
            "params": {"keyword": "##", "case_sensitive": False}
        },
    ]


def main():
    parser = argparse.ArgumentParser(
        description="Grade benchmark results against assertions"
    )
    parser.add_argument(
        "--workspace", "-w", required=True,
        help="Path to benchmark results workspace"
    )
    parser.add_argument(
        "--verbose", "-v", action="store_true",
        help="Show detailed grading output"
    )
    parser.add_argument(
        "--json", action="store_true",
        help="Output JSON format"
    )
    args = parser.parse_args()
    
    result = grade_workspace(args.workspace, verbose=args.verbose)
    
    # Save grading.json
    grading_path = os.path.join(args.workspace, "grading.json")
    with open(grading_path, "w", encoding="utf-8") as f:
        json.dump(result, f, indent=2, ensure_ascii=False)
    
    if args.json:
        print(json.dumps(result, indent=2, ensure_ascii=False))
    else:
        summary = result["summary"]
        total = summary["total"]
        passed = summary["passed"]
        failed = summary["failed"]
        rate = summary["pass_rate"]
        
        print(f"\n📊 Grading Results")
        print(f"{'=' * 50}")
        print(f"Files checked: {result.get('files_checked', 0)}")
        print(f"Total assertions: {total}")
        print(f"  ✅ Passed: {passed}")
        print(f"  ❌ Failed: {failed}")
        print(f"  Pass rate: {rate:.1%}")
        print(f"{'=' * 50}")
        
        if failed > 0 and verbose:
            print("\nFailed assertions:")
            for a in result["assertions"]:
                if not a["passed"]:
                    print(f"  ❌ {a['text']}")
                    print(f"     Evidence: {a['evidence']}")
        
        print(f"\n📁 Results saved to: {grading_path}")


if __name__ == "__main__":
    main()

FILE:scripts/static-analyze.py
#!/usr/bin/env python3
"""Static analysis for skill-assessment (Method 1).

Usage:
    python3 static-analyze.py <path-to-skill>
    python3 static-analyze.py <path-to-skill> --json
    python3 static-analyze.py <path-to-skill> --problems-only
    python3 static-analyze.py --compare <skill-a> <skill-b>
    python3 static-analyze.py --all
"""

import argparse
import json
import os
import re
import sys
import yaml


def check_documentation(skill_path):
    issues = []
    score = 100

    # Check SKILL.md exists
    skill_md = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(skill_md):
        issues.append({"severity": "error", "check": "SKILL.md exists", "msg": "SKILL.md not found"})
        return {"score": 0, "issues": issues}

    # Parse frontmatter
    with open(skill_md, "r", encoding="utf-8") as f:
        content = f.read()

    match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
    if not match:
        issues.append({"severity": "error", "check": "frontmatter", "msg": "No YAML frontmatter"})
        score -= 30
    else:
        try:
            fm = yaml.safe_load(match.group(1))
            if not isinstance(fm, dict):
                issues.append({"severity": "error", "check": "frontmatter", "msg": "Frontmatter must be YAML dict"})
                score -= 20
            else:
                if not fm.get("name"):
                    issues.append({"severity": "error", "check": "name field", "msg": "Missing 'name' in frontmatter"})
                    score -= 15
                if not fm.get("description"):
                    issues.append({"severity": "warning", "check": "description field", "msg": "Missing 'description' in frontmatter"})
                    score -= 10
                elif len(fm["description"].split()) < 15:
                    issues.append({"severity": "warning", "check": "description length", "msg": f"Description too short ({len(fm['description'].split())} words)"})
                    score -= 5

                # Check trigger contexts
                desc_lower = fm.get("description", "").lower()
                trigger_phrases = ["use when", "use for", "when the user", "when you need", "such as", "for example"]
                has_trigger = any(p in desc_lower for p in trigger_phrases)
                if not has_trigger:
                    issues.append({"severity": "info", "check": "trigger contexts", "msg": "No trigger phrases found — add 'Use when...' to improve activation"})
                    score -= 5
        except Exception as e:
            issues.append({"severity": "error", "check": "frontmatter parse", "msg": f"YAML parse error: {e}"})
            score -= 20

    # Check body length
    lines = content.split("\n")
    body_started = False
    body_lines = 0
    for line in lines:
        if line.strip() == "---" and not body_started:
            body_started = True
            continue
        if body_started:
            body_lines += 1

    if body_lines < 10:
        issues.append({"severity": "error", "check": "body length", "msg": f"Body only {body_lines} lines — too short"})
        score -= 15
    elif body_lines > 500:
        issues.append({"severity": "info", "check": "body length", "msg": f"Body {body_lines} lines — consider splitting"})

    # Check for usage examples in body
    example_indicators = ["example", "usage", "```bash", "```python", "$ "]
    has_examples = any(indicator in content.lower() for indicator in example_indicators)
    if not has_examples:
        issues.append({"severity": "warning", "check": "examples", "msg": "No usage examples found"})
        score -= 10

    return {"score": max(0, score), "issues": issues}


def check_code_quality(skill_path):
    issues = []
    score = 100

    scripts_dir = os.path.join(skill_path, "scripts")
    if not os.path.isdir(scripts_dir):
        return {"score": 100, "issues": [{"severity": "info", "check": "scripts", "msg": "No scripts/ directory"}]}

    py_files = [f for f in os.listdir(scripts_dir) if f.endswith(".py")]
    if not py_files:
        return {"score": 100, "issues": []}

    for f in py_files:
        fpath = os.path.join(scripts_dir, f)
        try:
            with open(fpath, "r", encoding="utf-8") as fh:
                code = fh.read()

            # Syntax check
            try:
                compile(code, fpath, "exec")
            except SyntaxError as e:
                issues.append({"severity": "error", "check": f"syntax ({f})", "msg": f"Syntax error at line {e.lineno}: {e.msg}"})
                score -= 10

            # Check for hardcoded secrets
            secret_patterns = [
                r'api[_-]?key\s*[=:]\s*["\'][^"\']{8,}',
                r'token\s*[=:]\s*["\'][^"\']{8,}',
                r'sk-[a-zA-Z0-9]{20,}',
                r'ghp_[a-zA-Z0-9]{36}',
            ]
            for pattern in secret_patterns:
                if re.search(pattern, code, re.IGNORECASE):
                    issues.append({"severity": "error", "check": f"secrets ({f})", "msg": f"Possible hardcoded credential in {f}"})
                    score -= 15

            # Note: "dangerous functions" (os.system, eval, exec) check removed.
            # Presence of these functions is not a security issue without
            # understanding data flow. Real security issues are caught above.

        except Exception as e:
            issues.append({"severity": "error", "check": "read error", "msg": f"Could not read {f}: {e}"})
            score -= 5

    return {"score": max(0, score), "issues": issues}


def check_configuration(skill_path):
    issues = []
    score = 100

    skill_md = os.path.join(skill_path, "SKILL.md")
    if not os.path.isfile(skill_md):
        return {"score": 0, "issues": [{"severity": "error", "check": "SKILL.md", "msg": "SKILL.md not found"}]}

    with open(skill_md, "r", encoding="utf-8") as f:
        content = f.read()

    # Check env vars referenced
    env_refs = re.findall(r'\$\{?([A-Z_][A-Z0-9_]*)\}?', content)
    if env_refs:
        # Check they are documented
        env_section = "env" in content.lower() or "environment" in content.lower() or "variable" in content.lower()
        if not env_section:
            issues.append({"severity": "warning", "check": "env vars", "msg": f"Env vars found but not documented: {', '.join(set(env_refs[:5]))}"})
            score -= 10

    # Check for clear command examples
    has_commands = bool(re.search(r'```(bash|sh|shell|python)', content))
    if not has_commands:
        issues.append({"severity": "info", "check": "command examples", "msg": "No command examples found"})
        score -= 5

    return {"score": max(0, score), "issues": issues}


def check_maintenance(skill_path):
    issues = []
    score = 100

    # Check for versioning
    skill_md = os.path.join(skill_path, "SKILL.md")
    if os.path.isfile(skill_md):
        with open(skill_md, "r", encoding="utf-8") as f:
            content = f.read()
        if "version" not in content.lower():
            issues.append({"severity": "info", "check": "version", "msg": "No version info found in SKILL.md"})
            score -= 5

    # Check for update date patterns
    update_indicators = re.findall(r'(updated?|modified?|changed?|revised?):?\s*\d{4}-\d{2}-\d{2}', content, re.IGNORECASE)
    if update_indicators:
        issues.append({"severity": "info", "check": "update date", "msg": "Found update timestamps"})

    # Check directory structure
    for subdir in ["scripts", "references", "assets", "knowledge"]:
        dp = os.path.join(skill_path, subdir)
        if os.path.isdir(dp):
            contents = [f for f in os.listdir(dp) if not f.startswith(".") and f != "__pycache__"]
            if not contents:
                issues.append({"severity": "warning", "check": f"{subdir}/ empty", "msg": f"{subdir}/ is empty"})
                score -= 5

    return {"score": max(0, score), "issues": issues}


def analyze(skill_path, json_output=False, problems_only=False):
    if not os.path.isdir(skill_path):
        print(f"Error: {skill_path} is not a directory", file=sys.stderr)
        sys.exit(1)

    doc = check_documentation(skill_path)
    code = check_code_quality(skill_path)
    config = check_configuration(skill_path)
    maint = check_maintenance(skill_path)

    total_score = (doc["score"] * 0.3 + code["score"] * 0.3 + config["score"] * 0.2 + maint["score"] * 0.2)

    all_issues = doc["issues"] + code["issues"] + config["issues"] + maint["issues"]

    if json_output:
        result = {
            "skill": os.path.basename(os.path.abspath(skill_path)),
            "path": os.path.abspath(skill_path),
            "scores": {
                "documentation": doc["score"],
                "code_quality": code["score"],
                "configuration": config["score"],
                "maintenance": maint["score"],
            },
            "overall": round(total_score, 1),
            "issues": all_issues,
        }
        print(json.dumps(result, indent=2))
    else:
        skill_name = os.path.basename(os.path.abspath(skill_path))
        print(f"\n📊 Skill Assessment: {skill_name}")
        print(f"{'=' * 50}")
        print(f"  Documentation:    {doc['score']}/100")
        print(f"  Code Quality:    {code['score']}/100")
        print(f"  Configuration:   {config['score']}/100")
        print(f"  Maintenance:     {maint['score']}/100")
        print(f"  ─────────────────────────")
        print(f"  Overall:         {total_score:.0f}/100")
        print()

        if problems_only:
            filtered = [i for i in all_issues if i["severity"] != "info"]
        else:
            filtered = all_issues

        if filtered:
            print("  Issues:")
            for issue in filtered:
                icon = {"error": "❌", "warning": "⚠️ ", "info": "ℹ️ "}[issue["severity"]]
                print(f"    {icon} [{issue['check']}] {issue['msg']}")
        else:
            print("  ✅ No issues found")
        print()


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Static skill assessment")
    parser.add_argument("path", nargs="?", help="Path to skill directory")
    parser.add_argument("--json", action="store_true", help="JSON output")
    parser.add_argument("--problems-only", action="store_true", help="Show only errors/warnings")
    parser.add_argument("--compare", nargs=2, metavar=("SKILL-A", "SKILL-B"), help="Compare two skills")
    parser.add_argument("--all", action="store_true", help="Analyze all skills in ~/.openclaw/skills/")
    args = parser.parse_args()

    if args.compare:
        for p in args.compare:
            analyze(p, args.json, args.problems_only)
    elif args.all:
        skills_dir = os.path.expanduser("~/.openclaw/skills/")
        if os.path.isdir(skills_dir):
            for d in os.listdir(skills_dir):
                full = os.path.join(skills_dir, d)
                if os.path.isdir(full) and not d.startswith("."):
                    analyze(full, args.json, args.problems_only)
        else:
            print(f"Skills dir not found: {skills_dir}")
    elif args.path:
        analyze(args.path, args.json, args.problems_only)
    else:
        parser.print_help()

ClawHub Automation Documentation+2

W@clawhub-wangzairong-b68a57dde8

Multi Find Skills（技能搜索全能版）

Skill

技能搜索全能版 | Multi Find Skills 触发场景：用户询问"有什么技能可以帮我..."、"找一个能做X的技能"、"有没有技能可以..."、"帮我搜一下XXX相关的技能"、"search for skill"、"找技能" 功能：三生态(ClawHub+LobeHub+skills.sh)搜索+质量核...

---
name: multi-find-skills
description: |
  技能搜索全能版 | Multi Find Skills
  触发场景：用户询问"有什么技能可以帮我..."、"找一个能做X的技能"、"有没有技能可以..."、"帮我搜一下XXX相关的技能"、"search for skill"、"找技能"
  功能：三生态(ClawHub+LobeHub+skills.sh)搜索+质量核验+安装验证+生命周期管理
  触发禁用：安装skill用clawhub install、管理已安装用openclaw skills list、创建新skill用skill-creator
author: "wagnzairong (https://github.com/wangzairong)"
evolution:
  created_at: "2026-04-23"
  version: "2.6.0"
  domain: "#engineering"
  success_rate: 0
  parent_skill: ""
  last_used: ""
  use_count: 0
metadata:
  openclaw:
    emoji: "🔍"
    requires:
      bins: [clawhub]
---

# 🔍 Multi Find Skills（技能搜索全能版）v2.6.0

> 同步支持 **ClawHub**、**LobeHub**、**skills.sh** 三大生态的技能搜索工具，具备质量核验、安装验证与完整生命周期管理。

---

## 触发词

当用户说以下内容时，激活此技能：

- "有什么技能可以帮我..."
- "有没有技能可以..."
- "找一个能做 X 的技能"
- "有没有 X 相关的技能"
- "我需要一个能...的技能"
- "帮我搜一下 XXX 相关的技能"
- "X 有什么技能吗"
- "search for skill"

## 触发判断

### When to Use ✅

- 用户询问"有什么 技能 可以帮我做 X"
- 用户说"找一个能做 Y 的技能"
- 用户询问某个功能是否已有现成的技能
- 用户表示希望扩展能力（如设计、测试、部署等）

### When NOT to Use ❌

- 安装已知名称的技能 → 直接用 `clawhub install <skill-name>`
- 管理已安装的技能 → 用 `openclaw skills list`
- 创建新技能 → 用 `skill-creator`
- 单纯执行与技能搜索无关的任务 → 直接执行

---

## 核心功能

1. **三生态搜索** — ClawHub + LobeHub + skills.sh 多来源
2. **质量核验** — 安装量门槛 + Stars + 来源声誉
3. **安装验证** — 安装后自动验证文件存在性
4. **安全重跑** — 幂等设计，支持 --force 重试
5. **生命周期管理** — update / uninstall / list

---

## 工作流程

### 快速流程

```
用户需求 → 提取关键词 → 全源搜索 → 质量筛选 → 排序输出 → 展示结果 → 提供安装命令 → 安装验证
```

### 详细流程（8步）

```
第1步：理解用户需求
  ├─ 识别领域（React/测试/设计/部署）
  ├─ 识别具体任务（写测试/创建动画/PR审查）
  ├─ 匹配 quick-reference.md 关键词（识别领域关键字，未匹配则跳过）
  └─ 判断是否适合搜索技能

第2步：关键词处理
  ├─ 合并领域 + 具体任务 → 关键词
  └─ 中文 → 翻译成英文

第3步：全源并行搜索
  ├─ skills.sh：npx skills find "<关键词>"
  ├─ ClawHub：clawhub search "<关键词>"
  ├─ LobeHub：npx -y @lobehub/market-cli skills search --q "<关键词>"
  └─ 并行执行，汇总结果

第4步：质量筛选
  ├─ 安装量 <100 → ⚠️
  ├─ Stars < 5 且来源不明 → ⚠️
  └─ 官方/知名厂商 → 🌟 优先

第5步：排序输出
  └─ 按安装量降序

第6步：展示结果（中文格式）
  └─ 名称 + 描述 + 安装量 + 安装命令 + ✅/❌

第7步：提供安装命令
  └─ 根据来源提供对应命令

第8步：安装后验证
  └─ ls ~/.openclaw/skills/<skill-name>/SKILL.md
```

---

## 搜索命令速查

| 来源 | 命令 | 说明 |
|------|------|------|
| skills.sh | `npx skills find "<关键词>"` | 1️⃣ 首选，质量高更新快 |
| ClawHub | `clawhub search "<关键词>"` | 2️⃣ 次选，官方生态质控严 |
| ClawHub 热门 | `clawhub explore --sort installs --limit 20` | 按安装量浏览热门 |
| LobeHub | `npx -y @lobehub/market-cli skills search --q "<关键词>"` | 3️⃣ 补充，社区贡献丰富 |
| OpenClaw Directory | https://www.openclawdirectory.dev/skills | 网页分类浏览 |
| GitHub | 网页搜索 `site:github.com "openclaw skill"` | 补充来源，手动筛选 |

**搜索顺序**：skills.sh → ClawHub → LobeHub → GitHub

**超时处理**：单个来源超时 → 继续其他来源；所有超时 → 建议手动访问网站。

详见 `references/sources.md`（含搜索技巧、skills.sh 详细说明）

---

## 质量核验

### 安装量门槛

| 等级 | 安装量 | 说明 |
|------|--------|------|
| 🌟 优先推荐 | ≥1,000 | 成熟度高，社区验证充分 |
| ✅ 最低门槛 | ≥100 | 可用，需提示风险 |
| ⚠️ 谨慎使用 | <100 | 社区验证有限 |

**硬性门槛**：Stars ≥ 5，有明确作者/来源

### 推荐优先级

1. 🌟 官方/知名厂商 + 高安装量（≥100K 顶级，≥10K 热门）
   - vercel-labs、anthropics、microsoft、openclaw 有例行安全审计
2. ✅ 社区技能，安装量 ≥1K，有明确作者
3. ⚠️ 安装量 <100 或来源不明，慎用

> skills.sh 安装量基于匿名遥测数据（用户安装时自动上报），详见 `references/sources.md`

---

## 安装命令

| 来源 | 命令 |
|------|------|
| **ClawHub** | `clawhub install <skill-name>` |
| **LobeHub** | `npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw` |
| **skills.sh** | `npx skills add <owner/repo@skill> -g -y` |
| **GitHub** | `npx skills add <owner/repo> -g -y` |

详见 `references/sources.md`（含单个 skill vs 仓库安装说明）

### 中国镜像（ClawHub 安装失败时）

```bash
clawhub config set registry https://cn.clawhub-mirror.com
clawhub install <skill-name> --registry https://cn.clawhub-mirror.com
```

---

## 安装验证

```bash
ls ~/.openclaw/skills/<skill-name>/SKILL.md
# 文件存在 = 安装成功
```

- 文件存在 → ✅ 安装成功
- 文件不存在 → ❌ 安装失败，尝试 `clawhub install <skill-name> --force`

---

## 生命周期管理

### 更新技能

```bash
clawhub install <skill-name> --force
npx skills add <owner/repo@skill> -g -y --force
```

### 卸载技能

```bash
rm -rf ~/.openclaw/skills/<skill-name>/
# 验证：ls ~/.openclaw/skills/<skill-name>/SKILL.md  # 应返回空
```

### 查看已安装

```bash
openclaw skills list
ls ~/.openclaw/skills/
```

---

## 输出格式

```
🔍 "【需求关键词】" 搜索结果（共 X 个来源）：

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📦 skills.sh（高质量首选）
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
   总安装量：XXX,XXX | 24h 安装量：X,XXX ↑ | 排名：#XX
   安装命令：npx skills add owner/repo@skill -g -y

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📦 ClawHub（官方生态）
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
   下载：X,XXX | Stars: XX | ✅ 已安装 / ❌ 未安装
   安装命令：clawhub install skill-name

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🌐 LobeHub（社区市场）
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
   安装量：XX,XXX
   安装命令：npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🐙 GitHub（补充来源）
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【owner/repo】
   描述：【描述】
   Stars: XX | 安装：npx skills add owner/repo -g -y

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 推荐：【最推荐的技能】，理由：【一句话】
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

**说明**：
- 总安装量：All Time 累计安装量（越大越成熟）
- 24h 安装量：Trending 24h 新增安装（越大越热门）
- 排名：当前在 skills.sh 排行榜的位置

详见 `references/sources.md`（含安装量字段说明、安装命令语法）

---

## 安全机制

### 输入验证

```bash
# 最大长度 100 字符
[[ #keyword -gt 100 ]] && echo "关键词过长" && exit 1

# 禁止危险字符（防止注入）
[[ "$keyword" =~ [[:space:]]*[\;\|\`\"\'\\] ]] && echo "包含非法字符" && exit 1

# 最小长度 2 字符
[[ #keyword -lt 2 ]] && echo "关键词至少2个字符" && exit 1
```

### 幂等性 & 安全重跑

| 操作 | 幂等性 | 安全重跑 |
|------|--------|---------|
| 搜索 | ✅ 天然幂等 | ✅ 安全，可重复 |
| 安装 | ✅ 幂等（--force） | ✅ 支持，使用--force |
| 验证 | ✅ 幂等 | ✅ 安全，只读 |

---

## 故障排除

| 问题 | 解决方案 |
|------|---------|
| ClawHub 无结果 | 检查网络，尝试英文关键词，用 `clawhub explore` 浏览热门 |
| skills.sh 无结果 | 确认 `npx` 可用，检查关键词拼写 |
| 速率限制 | 等待1小时，使用替代来源 |
| 安装失败 | 检查路径，确认 `~/.openclaw/skills/` 可写 |
| 验证失败 | 重新安装：`clawhub install <skill-name> --force` |

详细方案见 `references/troubleshooting.md`

---

## 找不到合适技能时

```
我搜索了 "【关键词】" 相关的 skills，但没有找到合适的匹配。

我仍然可以直接帮你完成这个任务！要我试试吗？

如果经常需要这个功能，可以考虑创建自定义 skill：
npx skills init my-custom-skill
```

---

## 相关技能

- `clawhub` - ClawHub CLI 工具
- `skill-creator` - 创建新技能
- `healthcheck` - 系统健康检查

---

*🔍 技能搜索全能版 v2.6.0*
FILE:CLAWHUB_PUBLISH.md
# CLAWHUB_PUBLISH.md

## 发布信息

| 字段 | 值 |
|------|-----|
| **Slug** | multi-find-skills |
| **名称** | Multi Find Skills（技能搜索全能版） |
| **版本** | 2.6.0 |
| **作者** | wagnzairong |
| **标签** | search, skill, finder, clawhub, skills-sh, openclaw, multi-source |
| **分类** | utilities |

## 描述（ClawHub 展示用）

中文：同时支持 ClawHub 和 skills.sh 两大生态的技能搜索工具，具备质量核验、安装验证和完整生命周期管理功能。支持6个搜索来源。

English: All-in-One Skill Finder for ClawHub and skills.sh ecosystems with quality checks, install verification, and full lifecycle management. Supports 6 search sources.

## 中国区说明

本技能在中国区使用 clawhub mirror 安装：

```bash
clawhub install multi-find-skills --registry https://cn.clawhub-mirror.com
```

或设置默认镜像：

```bash
clawhub config set registry https://cn.clawhub-mirror.com
```

## 依赖说明

本技能依赖以下 CLI 工具：

| 工具 | 安装方式 | 必选 |
|------|---------|------|
| `clawhub` | `npm i -g clawhub` | ✅ |
| `npx` | Node.js 内置 | ✅ |

如在中国区安装 clawhub 遇到问题，请使用镜像站。

## 质量门槛说明

本技能文档中的安装量门槛（≥100 / ≥1,000）和 Stars 门槛（≥5）是经验性指导值，非平台强制校验。实际选择时请结合具体 skill 的更新时间、社区反馈等因素综合判断。

## 功能特性

- **双生态搜索**：ClawHub + skills.sh 6个来源并行
- **质量核验**：安装量门槛 + Stars + 来源声誉
- **安装验证**：安装后自动验证文件存在性
- **安全重跑**：幂等设计，支持 --force 安全重试
- **生命周期管理**：update / uninstall / list
- **输入验证**：关键词长度+危险字符校验

## 支持

- **GitHub Issues**: https://github.com/wangzairong/skills
- **Discord**: https://discord.com/invite/clawd
- **ClawHub**: https://clawhub.ai/skills/multi-find-skills

## 更新日志

### v2.6.0
- 优化：quick-reference.md 全面升级（新增热门技能速查、3个新分类、扩展关键词）
- 优化：匹配 quick-reference.md 关键词移至第1步（理解需求时同步匹配），未匹配则跳过
- 优化：术语统一（搜索词 → 关键词），全流程一致
- 优化：搜索命令速查体现搜索顺序（1️⃣2️⃣3️⃣标注）
- 新增：输出格式中 skills.sh 新增"总安装量/24h安装量/排名"三字段
- 新增：输出格式新增 GitHub 渠道格式（owner/repo + Stars + 安装命令）
- 优化：LobeHub 输出格式改用安装命令替代链接
- 精简：整合 skills.sh 排名机制到推荐优先级判断
- 迁移：搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 更新：SKILL.md 正文只保留快速参考，详细内容引导至 sources.md

### v2.0.0
- 优化token成本：正文精简，详细内容移至 references/
- 新增生命周期管理：update/uninstall/list 指令
- 新增安全重跑机制：幂等性说明 + --force 安全重跑
- 新增输入验证：关键词长度+危险字符校验
- 触发词扩展到 frontmatter description
- 保留原始完整触发词、工作流、全6来源输出格式
- 补回搜索策略（按功能/提供商/热度）+ 查看详情 Step 3
- 补回核心/集成/Agent技能分类推荐列表
- 补回最佳实践（5条指导原则）

### v1.2.0
- 扩充搜索来源为 6 个独立来源
- 新增"搜索策略"章节（按功能/提供商/热度）
- 新增"最佳实践"、"安装提示"章节

### v1.1.0
- 统一安装路径为 `~/.openclaw/skills/`
- 添加质量门槛标注（可灵活调整）

### v1.0.0
- 双生态搜索（ClawHub + skills.sh）
- 质量核验流程
- 安装验证机制
- 整合自 eaton/find-skills-v2 + fangkelvin/find-skills-skill + guohongbin-git/skill-finder-cn

---

*本文件为 ClawHub 发布专用补充文档，不影响技能核心功能。*

FILE:references/changelog.md
# 更新日志

## v2.6.0
- 📝 优化：quick-reference.md 全面升级（新增热门技能速查、3个新分类、扩展关键词）
- 📝 优化：匹配 quick-reference.md 关键词移至第1步（理解需求时同步匹配，未匹配则跳过）
- 📝 优化：术语统一（搜索词 → 关键词），全流程一致
- 📝 优化：搜索命令速查体现搜索顺序（1️⃣2️⃣3️⃣标注）
- 📝 新增：输出格式中 skills.sh 新增"总安装量/24h安装量/排名"三字段
- 📝 新增：输出格式新增 GitHub 渠道格式（owner/repo + Stars + 安装命令）
- 📝 优化：LobeHub 输出格式改用安装命令替代链接
- 📝 精简：整合 skills.sh 排名机制到推荐优先级判断
- 📝 迁移：搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 📝 更新：SKILL.md 正文只保留快速参考，详细内容引导至 sources.md

## v2.5.0
- 📝 优化：LobeHub 输出格式改用安装命令替代链接
- 📝 精简：整合 skills.sh 排名机制到推荐优先级判断
- 📝 迁移：搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 📝 更新：SKILL.md 正文只保留快速参考，详细内容引导至 sources.md
- 📝 精简：合并重复内容（搜索策略章节精简）
- 📝 优化：安装命令章节增加"单个 vs 仓库"对比
- 📝 整理：分区更清晰（触发判断/核心功能/工作流程/搜索命令/质量核验/安装/生命周期/输出格式/安全机制/故障排除）

## v2.4.0
- 📝 重构：工作流程全步骤详解（第2-8步）
- 📝 新增：第2步详解（搜索词构成：领域+任务组合）
- 📝 新增：第3步详解（搜索顺序建议、超时处理）
- 📝 新增：第4步详解（筛选规则、检查项阈值）
- 📝 新增：第5-7步详解（输出格式示例）
- 📝 新增：第8步详解（安装验证命令与结果判断）

## v2.3.0
- 📝 重构：工作流程细化，详解第1步（理解用户需求）
- 📝 新增：判断是否搜索技能的决策表
- 📝 新增：关键词匹配优先级（参考 quick-reference.md）
- 📝 新增：用户需求 → 搜索关键词 转换示例

## v2.2.0
- 📝 新增：skills.sh 高级用法（--list 预览、仓库 vs 单个 skill）
- 📝 新增：skills.sh 排行榜分类说明（All Time / Trending 24h / Hot）
- 📝 新增：skills.sh 排名机制解释（匿名遥测数据）
- 📝 更新：质量核验章节，区分 ClawHub vs skills.sh 评估标准

## v2.1.0
- 🔧 修复：`clawhub search --sort` 不存在，改为 `clawhub explore --sort`
- 🔧 修复：LobeHub 命令从 `skills search "keyword"` 改为 `skills search --q "keyword"`
- 🗑️ 删除：无效引用 `find-skills (eaton)`（该技能不存在）
- 🗑️ 删除：排行榜优先逻辑（skills.sh leaderboard 返回 HTML 无法解析）
- 🗑️ 删除：装饰性技能分类列表（核心/集成/Agent技能）
- 📝 精简：移除与 references/ 重复的内容，正文只保留快速参考
- 📝 更新：搜索来源说明，明确标注 CLI vs 网页搜索

## v2.0.0
- ✅ 优化token成本：正文精简，详细内容移至references/
- ✅ 新增生命周期管理：update/uninstall/list指令
- ✅ 新增安全重跑机制：幂等性说明 + --force安全重跑
- ✅ 新增输入验证：关键词长度+危险字符校验
- ✅ 触发词扩展到frontmatter description
- ✅ 保留原始完整触发词、工作流、全6来源输出格式
- ✅ 补回搜索策略（按功能/提供商/热度）+ 查看详情Step 3
- ✅ 补回核心/集成/Agent技能分类推荐列表
- ✅ 补回最佳实践（5条指导原则）

## v1.2.0
- ✅ 扩充"技能搜索来源"为6个独立来源
- ✅ 新增"搜索策略"章节（按功能/提供商/热度）
- ✅ 新增"最佳实践"章节
- ✅ 新增"相关技能"章节
- ✅ 新增"安装提示"章节

## v1.1.0
- ✅ 统一安装路径为 `~/.openclaw/skills/`
- ✅ 添加 "When NOT to Use" 段落
- ✅ 添加 "找不到时" 指引
- ✅ 质量门槛标注"可灵活调整"

## v1.0.0
- ✅ 双生态搜索（ClawHub + skills.sh）
- ✅ 质量核验流程（安装量/Stars/来源）
- ✅ 安装验证机制
- ✅ 双语输出格式

FILE:references/quick-reference.md
# 常见需求关键词速查表

> 本表用于第1步"匹配 quick-reference.md 关键词"阶段，帮助快速识别用户需求的领域关键字。

---

## 🔥 热门技能速查（高安装量优先）

| 技能名 | 安装量 | 来源 | 对应需求 |
|--------|--------|------|----------|
| `find-skills` | 1.2M | vercel-labs | "找技能"、"搜索 skill" |
| `frontend-design` | 340K | anthropics | 前端开发、UI 设计 |
| `soultrace` | 318K | soultrace-ai | Agent 个性、记忆管理 |
| `web-design-guidelines` | 280K | vercel-labs | Web 设计、设计规范 |
| `vercel-react-best-practices` | 351K | vercel-labs | React 最佳实践 |

---

## 前端 / React

| 需求 | 搜索关键词 |
|------|-----------|
| React开发 | `react`, `react hooks`, `react component` |
| Next.js | `nextjs`, `next.js`, `ssr` |
| 前端样式 | `css`, `tailwind`, `styled-components` |
| 前端测试 | `react testing`, `playwright`, `e2e` |
| 响应式设计 | `responsive`, `mobile`, `adaptive` |
| 前端设计 | `frontend-design`, `ui design` |
| React最佳实践 | `vercel-react-best-practices` |

---

## 测试

| 需求 | 搜索关键词 |
|------|-----------|
| 单元测试 | `testing`, `jest`, `vitest` |
| E2E测试 | `playwright`, `cypress`, `e2e testing` |
| 集成测试 | `integration testing`, `api testing` |
| 性能测试 | `performance testing`, `load testing`, `k6` |
| 自动化测试 | `automation testing`, `test automation` |

---

## DevOps / 部署

| 需求 | 搜索关键词 |
|------|-----------|
| Docker | `docker`, `container`, `dockerfile` |
| Kubernetes | `kubernetes`, `k8s`, `helm` |
| CI/CD | `deploy`, `ci-cd`, `github actions`, `gitlab ci` |
| 监控 | `monitoring`, `logging`, `prometheus` |
| 云部署 | `aws`, `azure`, `gcp`, `vercel`, `netlify` |
| Azure | `microsoft-foundry`, `azure-ai` |
| 自动化部署 | `deployment automation`, `cd pipeline` |

---

## 文档生成

| 需求 | 搜索关键词 |
|------|-----------|
| README生成 | `readme`, `documentation`, `docs generator` |
| API文档 | `api-docs`, `swagger`, `openapi` |
| Changelog | `changelog`, `release notes` |
| 代码文档 | `jsdoc`, `typedoc`, `sphinx` |
| 技术文档 | `technical writing`, `docs as code` |

---

## 代码质量

| 需求 | 搜索关键词 |
|------|-----------|
| 代码审查 | `code review`, `lint`, `eslint` |
| 代码重构 | `refactor`, `clean code` |
| 代码格式化 | `prettier`, `format`, `formatter` |
| 静态分析 | `static analysis`, `sonarqube` |
| 类型检查 | `typescript`, `type checking`, `flow` |

---

## 设计 / UX

| 需求 | 搜索关键词 |
|------|-----------|
| UI设计 | `ui design`, `component library`, `design system` |
| UX优化 | `ux`, `user experience`, `usability` |
| 可访问性 | `accessibility`, `a11y`, `wcag` |
| 图标 | `icon`, `emoji`, `illustration` |
| 设计规范 | `web-design-guidelines`, `design guidelines` |
| Figma | `figma`, `figma design`, `ui prototyping` |

---

## SEO / 营销

| 需求 | 搜索关键词 |
|------|-----------|
| SEO审计 | `seo audit`, `seo analysis` |
| SEO优化 | `seo`, `search engine optimization` |
| 内容营销 | `content marketing`, `copywriting` |
| 社交媒体 | `social media`, `twitter`, `discord` |
| 广告投放 | `paid ads`, `google ads`, `facebook ads` |

---

## 数据库 / 后端

| 需求 | 搜索关键词 |
|------|-----------|
| 数据库设计 | `database design`, `sql`, `postgresql` |
| API开发 | `api`, `rest api`, `graphql` |
| 认证授权 | `auth`, `oauth`, `jwt`, `authentication` |
| 缓存 | `cache`, `redis`, `memcached` |
| 后端框架 | `nodejs`, `express`, `fastapi`, `django` |
| ORM | `orm`, `prisma`, `sqlalchemy` |

---

## 移动端

| 需求 | 搜索关键词 |
|------|-----------|
| React Native | `react native`, `mobile development` |
| Flutter | `flutter`, `cross-platform` |
| iOS开发 | `ios`, `swift`, `xcode` |
| Android开发 | `android`, `kotlin`, `java` |
| 移动端测试 | `mobile testing`, `appium` |

---

## AI / Agent

| 需求 | 搜索关键词 |
|------|-----------|
| LLM集成 | `llm`, `openai`, `claude`, `gpt` |
| Prompt工程 | `prompt engineering`, `prompt template` |
| Agent开发 | `agent`, `autonomous`, `multi-agent` |
| 向量数据库 | `vector db`, `pinecone`, `chroma` |
| RAG | `rag`, `retrieval`, `knowledge retrieval` |
| Agent个性 | `soultrace`, `agent personality`, `memory` |

---

## 效率工具 / 自动化

| 需求 | 搜索关键词 |
|------|-----------|
| 效率工具 | `productivity`, `automation`, `workflow` |
| 任务管理 | `task management`, `todo`, `project` |
| 笔记 | `note taking`, `knowledge base`, `obsidian` |
| 定时任务 | `cron`, `scheduler`, `job queue` |
| 工作流自动化 | `workflow automation`, `n8n`, `make` |

---

## 视频 / 媒体

| 需求 | 搜索关键词 |
|------|-----------|
| 视频生成 | `video generation`, `remotion`, `animation` |
| 图片生成 | `image generation`, `midjourney`, `dalle` |
| 音频处理 | `audio`, `speech to text`, `tts` |
| 媒体处理 | `media processing`, `ffmpeg` |

---

## 数据处理

| 需求 | 搜索关键词 |
|------|-----------|
| 数据分析 | `data analysis`, `pandas`, `jupyter` |
| 数据可视化 | `visualization`, `charts`, `dashboard` |
| ETL | `etl`, `data pipeline`, `airflow` |
| 数据工程 | `data engineering`, `spark`, `kafka` |

---

## 安全

| 需求 | 搜索关键词 |
|------|-----------|
| 安全审计 | `security audit`, `penetration testing` |
| 密钥管理 | `secret management`, `vault`, `aws secrets` |
| 依赖扫描 | `dependency scanning`, `snyk`, `renovate` |
| 安全加固 | `security hardening`, `owasp` |

---

## 平台集成

| 平台 | 搜索关键词 |
|------|-----------|
| GitHub | `github`, `github api`, `github actions` |
| 飞书 | `feishu`, `lark`, `飞书` |
| Notion | `notion`, `notion api` |
| Slack | `slack`, `slack bot` |
| Discord | `discord`, `discord bot` |
| Twitter/X | `twitter`, `x api`, `tweet` |
| 微信 | `wechat`, `wechat bot`, `wechat mp` |

---

## 多 Agent / 协作

| 需求 | 搜索关键词 |
|------|-----------|
| 多Agent系统 | `multi-agent`, `agent orchestration` |
| Agent通信 | `agent communication`, `agent protocol` |
| 任务编排 | `task orchestration`, `workflow` |
| Agent协作 | `agent collaboration`, `multi-agent ai` |

---

## 搜索策略总结

1. **从具体到泛化**：先搜索具体关键词，无结果再扩大范围
2. **多语言尝试**：中文需求先翻译成英文搜索
3. **同义词扩展**：一个概念多个同义词都试试
4. **官方+社区组合**：官方包优先，社区包补充
5. **优先热门技能**：优先匹配高安装量技能（见顶部🔥热门技能速查）

---

## 关键词 → 技能速查映射

| 用户需求关键词 | 对应技能 |
|----------------|----------|
| `find skill`, `搜技能` | `find-skills` (1.2M) |
| `frontend`, `前端` | `frontend-design` (340K) |
| `react best`, `React最佳实践` | `vercel-react-best-practices` (351K) |
| `web design`, `Web设计` | `web-design-guidelines` (280K) |
| `agent personality`, `Agent个性` | `soultrace` (318K) |
| `video`, `动画` | `remotion-best-practices` (269K) |
| `azure`, `microsoft云` | `microsoft-foundry` (254K) |
FILE:references/sources.md
# 搜索来源详细说明

## 1. ClawHub（官方生态）

```bash
# 搜索 skills（向量搜索）
clawhub search "<关键词>"

# 浏览热门技能（支持排序）
clawhub explore --sort installs --limit 20   # 按安装量
clawhub explore --sort rating --limit 20     # 按评分
clawhub explore --sort trending --limit 20   # 按趋势

# 查看详情
clawhub inspect <skill-name>
```

**特点**：官方生态，质控严格，SKILL.md 格式规范

## 2. skills.sh（Vercel 生态，高质量首选）

### 基础命令

```bash
# 搜索 skills
npx skills find "<关键词>"

# 浏览所有可用技能
npx skills add vercel-labs/agent-skills --list
```

### 高级用法

| 命令 | 说明 |
|------|------|
| `npx skills find "<关键词>"` | 搜索 skills |
| `npx skills add owner/repo --list` | 预览可用技能（不安装） |
| `npx skills list` | 查看已安装 |
| `npx skills list -g` | 查看全局安装 |
| `npx skills find "popular"` | 浏览热门技能 |

### 排行榜分类

| 分类 | 说明 | 适用场景 |
|------|------|---------|
| **All Time** | 总安装量排名 | 发现经过时间验证的成熟技能 |
| **Trending (24h)** | 24小时内安装量 | 发现最新热门趋势 |
| **Hot** | 实时热度 | 追逐最新潮流 |

**访问** https://skills.sh 查看完整排行榜

### 排名机制

skills.sh 安装量基于 **匿名遥测数据**（用户安装时自动上报，聚合统计，不追踪个人信息）：

- ≥100K → 顶级 | ≥10K → 热门
- 官方技能（vercel-labs、anthropics、microsoft）有例行安全审计

**特点**：Vercel 出品，社区活跃，更新频繁

## 3. LobeHub（社区市场）

```bash
# 搜索 skills
npx -y @lobehub/market-cli skills search --q "<关键词>"
```

**特点**：社区贡献，市场丰富，适合发现创新/实验性技能

## 4. OpenClaw Directory

- 网站：https://www.openclawdirectory.dev/skills
- 支持按分类、人气或关键词搜索
- 可直接访问详情页查看安装量、描述

## 5. GitHub（补充来源）

```bash
# 网页搜索
site:github.com "openclaw skill"
site:github.com "SKILL.md"
```

**推荐来源**：
- `vercel-labs/agent-skills` - Vercel 官方
- `anthropics/` - Anthropic 相关
- `microsoft/` - Microsoft 相关
- `openclaw/` - OpenClaw 官方

## 6. 社区论坛

- SitePoint：https://www.sitepoint.com/community/
- Discord：https://discord.com/invite/clawd

## 搜索技巧

1. **使用具体关键词**：`"react testing"` 比单独 `"testing"` 效果更好
2. **尝试替代术语**：如果 `"deploy"` 没结果，试试 `"deployment"`
3. **检查热门来源**：vercel-labs、anthropics、microsoft、openclaw
4. **中文关键词翻译**：先用中文理解需求，再翻译成英文搜索
5. **搜索顺序**：skills.sh（首选）→ ClawHub → LobeHub → GitHub
6. **超时处理**：单个来源超时 → 继续其他来源；所有超时 → 建议手动访问网站

---

## 附录：输出格式字段说明

### skills.sh 安装量字段

| 字段 | 说明 |
|------|------|
| 总安装量 | All Time 累计安装量（越大越成熟） |
| 24h 安装量 | Trending 24h 新增安装（越大越热门） |
| 排名 | 当前在 skills.sh 排行榜的位置 |

### 安装命令语法

```bash
# 安装特定 skill
npx skills add owner/repo@skill-name -g -y

# 安装整个技能库
npx skills add owner/repo -g -y

# 参数说明
# -g：全局安装（用户级）
# -y：跳过确认提示
# @skill-name：指定特定 skill（不指定则安装整个仓库）
```

### LobeHub 安装命令

```bash
npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw
```
FILE:references/troubleshooting.md
# 故障排除指南

## ClawHub 搜索无结果

**原因**：
- 网络连接问题
- 关键词不够具体
- 速率限制触发

**解决方案**：
```bash
# 1. 检查网络
curl -I https://clawhub.ai

# 2. 尝试不同关键词（同义词/英文）
clawhub search "weather"          # 尝试英文
clawhub search "tavily search"    # 尝试具体名称

# 3. 浏览热门技能发现新选项
clawhub explore --sort installs --limit 20

# 4. 手动访问网站搜索
open https://clawhub.ai
```

## skills.sh 搜索无结果

**原因**：
- `npx` 不可用
- 网络问题
- 包名/关键词不正确

**解决方案**：
```bash
# 1. 确认 npx 可用
npx --version

# 2. 尝试更通用的关键词
npx skills find "web"

# 3. 使用完整包名
npx skills find "vercel-labs/agent-skills@seo-best-practices"
```

## LobeHub 搜索无结果

**原因**：
- `npx` 不可用
- 网络问题

**解决方案**：
```bash
# 1. 确认 npx 可用
npx --version

# 2. 尝试网页搜索
open https://lobehub.com/skills?q=<关键词>
```

## 速率限制

**原因**：ClawHub API 触发速率限制

**解决方案**：
```bash
# 1. 等待 1 小时后再试
sleep 3600

# 2. 使用替代来源（网站搜索）
open https://clawhub.ai/search?q=<关键词>

# 3. 使用 GitHub 手动搜索
open https://github.com/search?q=openclaw+skill+<关键词>
```

## 安装失败

**原因**：
- 网络问题
- 权限问题
- 包名错误
- 安装到错误路径

**解决方案**：
```bash
# 1. 检查 ~/.openclaw/skills/ 是否可写
ls -la ~/.openclaw/skills/

# 2. 使用 --force 重试
clawhub install <skill-name> --force

# 3. 确认包名正确
clawhub search "<关键词>"  # 确认精确包名

# 4. 使用中国镜像（网络问题）
clawhub install <skill-name> --registry https://cn.clawhub-mirror.com
```

## 安装后验证失败

**原因**：
- 安装实际失败但无报错
- 安装到了错误路径

**解决方案**：
```bash
# 1. 检查是否安装到了正确路径
ls ~/.openclaw/skills/<skill-name>/SKILL.md  # 正确路径
ls ~/.openclaw/<skill-name>/SKILL.md         # 错误路径

# 2. 如果安装到错误路径，手动移动
mv ~/.openclaw/<skill-name> ~/.openclaw/skills/<skill-name>/

# 3. 重新安装
clawhub install <skill-name> --force
```

## 搜索结果质量低

**原因**：
- 关键词太泛
- 相关技能确实不存在

**解决方案**：
```bash
# 1. 使用更具体的关键词
clawhub search "seo audit"       # 比 "seo" 更具体
clawhub search "web scraping"     # 比 "web" 更具体

# 2. 尝试多个相关关键词
clawhub search "weather"
clawhub search "weather forecast"
clawhub search "weather api"

# 3. 浏览热门技能发现灵感
clawhub explore --sort installs --limit 20

# 4. 如果确实没有合适匹配，明确告知用户
```

## 命令参数错误

**症状**：执行命令时报 `unknown option '--sort'` 等错误

**原因**：使用了不存在的命令参数

**解决方案**：
```bash
# 排序应使用 explore 命令，不是 search
clawhub explore --sort installs --limit 20  # ✅ 正确

# 不是
clawhub search --sort installs              # ❌ 错误
```

ClawHub Frontend Testing+2

W@clawhub-wangzairong-b68a57dde8