@clawhub-wangzairong-b68a57dde8
集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。 触发词(中文): 评估技...
---
name: multi-skill-eval
description: |
集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。
触发词(中文): 评估技能、技能评分、技能审计、对比技能、批量评估、技能质量检查、静态分析、基准测试、触发词检测、幽灵工具检测
触发词(English): evaluate skill, compare skills, audit skill, benchmark skill, static analysis, skill quality, skill assessment
Use when you need to evaluate, audit, benchmark, or improve OpenClaw skills.
---
# Multi-Skill-Eval v1.0.0
## Integrated Multi-Method Skill Evaluation System
Combines three evaluation approaches into one unified system:
1. **Skill Assessment** — lightweight static analysis (fast, automated)
2. **Skill Evaluator** — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
3. **Skill-Eval** — autonomous benchmark evaluation with skill card generation
---
## 🚀 快速开始 / Quick Start
```bash
# 完整评估(三种方法)
multi-skill-eval ~/.openclaw/skills/my-skill
# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick
# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full
# 对比两个技能
multi-skill-eval --compare skill-a skill-b
# 批量评估所有本地技能
multi-skill-eval --all
# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2
```
---
## Three Evaluation Methods
### 方法一:静态分析 (快速 — 约30秒)
轻量级自动化检查,覆盖4个维度:
```bash
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json # 机器可读格式
```
**检查项目:**
- 文档完整性(SKILL.md、描述质量、示例)
- 代码质量与安全信号(脚本语法、错误处理)
- 配置友好性(环境变量文档化、默认值清晰)
- 维护性信号(版本管理、近期更新)
**输出:** 0-100分数 + 按严重性分类的问题列表。
---
### 方法二:Rubric打分 (详细 — 约10分钟)
25项标准,覆盖8个类别。自动化检查 + 手动评审结合。
**运行自动化结构检查:**
```bash
python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose
```
**然后使用** `references/rubric.md` **进行手动评分**
#### The 25 Criteria (8 Categories)
| # | Category | Framework | Criteria |
|---|----------|-----------|----------|
| 1 | Functional Suitability | ISO 25010 | Completeness, Correctness, Appropriateness |
| 2 | Reliability | ISO 25010 | Fault Tolerance, Error Reporting, Recoverability |
| 3 | Performance / Context | ISO 25010 + Agent | Token Cost, Execution Efficiency |
| 4 | Usability — AI Agent | Shneiderman, Gerhardt-Powals | Learnability, Consistency, Feedback, Error Prevention |
| 5 | Usability — Human | Tognazzini, Norman | Discoverability, Forgiveness |
| 6 | Security | ISO 25010 + OpenSSF | Credentials, Input Validation, Data Safety |
| 7 | Maintainability | ISO 25010 | Modularity, Modifiability, Testability |
| 8 | Agent-Specific | Novel | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches |
**Scoring:** Each criterion 0–4. Total 100 max.
| Score | Verdict | Action |
|-------|---------|--------|
| 90–100 | Excellent | Publish confidently |
| 80–89 | Good | Publishable, note known issues |
| 70–79 | Acceptable | Fix P0s before publishing |
| 60–69 | Needs Work | Fix P0+P1 before publishing |
| <60 | Not Ready | Significant rework needed |
#### Rubric Score Sheet
Copy `assets/EVAL-TEMPLATE.md` to the skill directory as `EVAL.md`.
**P0 Issues (blocks publishing):**
- Missing SKILL.md or invalid frontmatter
- Hardcoded credentials or secrets
- Phantom tooling (referenced scripts not in package)
- No description or description < 50 chars
**P1 Issues (should fix):**
- No usage examples
- No error handling in scripts
- Missing dependency documentation
- Unclear trigger conditions
---
### 方法三:自主基准测试 (深度 — 约30分钟/技能)
Full multi-phase evaluation with multi-model support. **Requires AI agent execution.**
```bash
# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4
```
> ⚠️ **Note**: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.
> 📋 **Planned**: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.
#### Phase 1: Pre-flight Analysis
1. Read `SKILL.md` — understand claims, dependencies, target use cases
2. Classify skill type:
- **Capability uplift** — teaches the agent something it can't do well
- **Encoded preference** — sequences steps according to specific process
3. Dependency check:
- Required CLI tools, API keys, env vars
- Mark `dependency-gated` if credentials missing (skip eval, not fault of skill)
- Check for **phantom tooling** (referenced scripts not in package)
4. Marketing claims check: flag any metrics ("7.8x faster") without evidence
5. Read knowledge base: `knowledge/lessons.md`, `eval-patterns.md`, `failures.md`
6. Check prior evaluations: `knowledge/skill-profiles/<slug>.md`
#### Phase 2: Test Case Design
Design 2-3 test prompts across four categories:
- **Outcome** — Did the task complete correctly?
- **Process** — Did the agent follow the skill's intended steps?
- **Style** — Does output follow skill-claimed conventions?
- **Efficiency** — Reasonable time/token usage?
**Assertion design (two layers):**
*Layer 1: Deterministic checks* (fast, reproducible)
- File existence, word counts, keyword presence
- Format compliance (valid JSON, SQL, markdown)
- Programmatic verification (run tests, check syntax)
*Layer 2: Rubric-based quality assessment* (LLM-as-judge)
- Judge model (NOT execution model) grades output against specific rubric
- Structured scoring, not pass/fail
**Key assertion patterns:**
- Banned-word checks for style-constrained skills (highly discriminating)
- Methodology/structure assertions for technical domains (baseline already strong on correctness)
- Output-floor assertions: required sections must appear even in error/fallback paths
- Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)
#### Phase 3: Execution
For each test case, spawn two subagents:
**With-skill:**
```
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
```
**Without-skill (baseline):**
```
[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
```
**Multi-model mode:** Run same skill across multiple models to check cross-model consistency.
#### Phase 4: Grading
Programmatic grading for deterministic checks. LLM-based grading for qualitative:
```bash
python3 scripts/grade-assertions.py --workspace /path/to/results
```
Save to `grading.json`:
```json
{
"expectations": [
{"text": "assertion text", "passed": true, "evidence": "..."}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
```
#### Phase 5: Benchmark Aggregation
```json
{
"with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"delta": {"pass_rate": "+0.XX", "time": "+Xx"},
"model_used": "claude-sonnet-4",
"verdict": "Recommended"
}
```
**Efficiency flags:** Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").
#### Phase 6: Skill Card Generation
```bash
python3 scripts/generate_skill_card.py \
--workspace /path/to/results \
--skill-name "My Skill" \
--skill-slug my-skill \
--eval-model claude-sonnet-4 \
--output skill-cards/my-skill-v1.md
```
**Skill Card Contents:**
- Metadata: name, source, eval date, model, engine version
- Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
- With-skill vs without-skill comparison table
- Per-test-case breakdown with assertions, timing, grading
- Strengths / Weaknesses
- Recommendation: Recommended / Conditional / Marginal / Not Recommended
#### Phase 7: Leaderboard Update
```bash
python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html
```
---
## Self-Evolution Improvement Engine
> ⚠️ **Planned — Not Yet Implemented**
>
> The self-evolution improvement engine is designed but not yet implemented. The knowledge base (`knowledge/improve/`) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.
### Planned Improvement Process (Phase 7-12)
1. **Read knowledge base:**
- `knowledge/improve/lessons.md` — proven strategies
- `knowledge/improve/patterns.md` — category-specific playbooks
- `knowledge/improve/failures.md` — what NOT to try
2. **Diagnose root cause:**
- Skill too vague? (Doesn't specify enough to change model behavior)
- Skill redundant? (Teaches things model already knows)
- Skill too heavy? (Adds overhead without proportional quality gain)
- Missing structure? (No clear output format)
- Phantom tooling? (References tools that don't exist)
- Reference manual anti-pattern? (>200 lines of educational content)
- Library-as-skill anti-pattern? (Contains code instead of instructions)
3. **Select improvement strategy** from patterns:
- Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
- Library-to-Instructions: Convert code to behavioral instructions
- Phantom Tooling Replacement: Replace missing tool references with inline instructions
- Overhead Routing: Add quick-mode vs full-framework routing
- Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions
4. **Rewrite SKILL.md** with selected strategy:
- Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
- Add specific, enforceable conventions
- Remove redundant content model already handles
- Save as `SKILL-improved.md`
5. **Update assertions** to match improved skill
6. **Re-evaluate** with improved version
### Planned Re-Eval (Phase 10-11)
Run same eval against `SKILL-improved.md`:
- Score improved by >= 1.5 points → Success
- Less than 50% of previously-failed assertions fixed → Document limitation, move on
### Planned Improvement Knowledge Update (Phase 12)
After each improvement batch:
- Update `knowledge/improve/lessons.md` with what worked
- Update `knowledge/improve/patterns.md` with reusable patterns
- Update `knowledge/improve/failures.md` with failed attempts
- Fold proven patterns back into this SKILL.md
---
## Scoring Summary
| Method | Speed | Coverage | Best For |
|--------|-------|---------|----------|
| Static Analysis | ~30s | 4 dimensions | Quick comparison, batch scan |
| Rubric Scoring | ~10min | 25 criteria | Pre-publish audit, detailed report |
| Benchmark Eval | ~30min | Full + self-evolution | Production evaluation, skill improvement |
| Overall Score | Verdict |
|--------------|---------|
| 7-10 | Recommended |
| 5-6.9 | Conditional |
| 3-4.9 | Marginal |
| 0-2.9 | Not Recommended |
---
## Anti-Patterns to Detect
- **Reference manual anti-pattern**: SKILL.md >200 lines of educational content (not behavioral instructions)
- **Library-as-skill anti-pattern**: SKILL.md contains Python/JS class definitions instead of instructions
- **Phantom tooling**: SKILL.md references scripts/binaries not in the package
- **Phantom tooling framework skills**: Evaluate template/output structure separately from real data execution
- **Unsubstantiated claims**: Skill claims specific metrics without evidence — do not use self-reported numbers
- **High-overhead framework inflation**: Quality delta ≈ 0 but cost delta >2x — penalize efficiency
---
## Deeper Security Scanning
For thorough security audits, complement with [SkillLens](https://www.npmjs.com/package/skilllens):
```bash
npx skilllens scan /path/to/skill
```
Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.
---
## Dependencies
- Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
- PyYAML (`pip install pyyaml`) — frontmatter parsing
- Node.js (for SkillLens security scanning)
FILE:CLAWHUB_PUBLISH.md
# ClawhHub Publish Metadata
## multi-skill-eval v1.0.1
**Slug**: multi-skill-eval
**Name**: Multi-Skill-Eval
**Version**: 1.0.1
---
## 变更日志 (Changelog)
### v1.0.1 (当前版本)
- 添加中文触发词和中文使用场景
- 修复 grade-assertions.py 与 eval-skill.py 相同问题
- 补全缺失的 generate_skill_card.py 和 generate_leaderboard.py
- 移除 static-analyze.py 中导致自我矛盾的 dangerous 函数检测
- 标注 benchmark 需要 AI agent 执行,self-evolution 为计划中功能
### v1.0.0
- 初始版本
- 集成静态分析、Rubric打分、基准测试三种评估方法
---
## 发布命令
```bash
clawhub publish /Users/metaclaw/.openclaw/skills/multi-skill-eval \
--slug multi-skill-eval \
--name "Multi-Skill-Eval" \
--version 1.0.1 \
--changelog "v1.0.1: 添加中文触发词和中文使用场景;修复grade-assertions.py与eval-skill.py相同问题;补全缺失的generate_skill_card.py和generate_leaderboard.py;移除static-analyze.py中导致自我矛盾的dangerous函数检测;标注benchmark需要AI agent执行,self-evolution为计划中功能"
```
---
## ClawhHub 链接
- 技能页面: https://clawhub.com/multi-skill-eval
- 发布ID: k971zvy4szs4a4yqdgssgsqrmx85h3qa
---
## 标签 (Tags)
- evaluation
- skill-assessment
- rubric
- benchmark
- multi-model
- openclaw
- skill-quality
- auditing
- Chinese-supported
FILE:assets/EVAL-TEMPLATE.md
# [SKILL_NAME] Evaluation
**Date:** YYYY-MM-DD
**Evaluator:** [agent/human]
**Skill version:** [version or commit]
**Automated score:** [run eval-skill.py and paste summary]
---
## Automated Checks
```
[paste eval-skill.py output here]
```
## Manual Assessment
| # | Criterion | Score | Notes |
|---|-----------|-------|-------|
| 1.1 | Completeness | /4 | |
| 1.2 | Correctness | /4 | |
| 1.3 | Appropriateness | /4 | |
| 2.1 | Fault Tolerance | /4 | |
| 2.2 | Error Reporting | /4 | |
| 2.3 | Recoverability | /4 | |
| 3.1 | Token Cost | /4 | |
| 3.2 | Execution Efficiency | /4 | |
| 4.1 | Learnability | /4 | |
| 4.2 | Consistency | /4 | |
| 4.3 | Feedback Quality | /4 | |
| 4.4 | Error Prevention | /4 | |
| 5.1 | Discoverability | /4 | |
| 5.2 | Forgiveness | /4 | |
| 6.1 | Credential Handling | /4 | |
| 6.2 | Input Validation | /4 | |
| 6.3 | Data Safety | /4 | |
| 7.1 | Modularity | /4 | |
| 7.2 | Modifiability | /4 | |
| 7.3 | Testability | /4 | |
| 8.1 | Trigger Precision | /4 | |
| 8.2 | Progressive Disclosure | /4 | |
| 8.3 | Composability | /4 | |
| 8.4 | Idempotency | /4 | |
| 8.5 | Escape Hatches | /4 | |
| | **TOTAL** | **/100** | |
## Priority Fixes
### P0 — Fix Before Publishing
1.
### P1 — Should Fix
1.
### P2 — Nice to Have
1.
## Revision History
| Date | Score | Notes |
|------|-------|-------|
| | /100 | Baseline |
FILE:knowledge/eval-patterns.md
# Reusable Evaluation Patterns
## Assertion Templates
### Banned-Word Check (Style-Constrained)
```python
{"type": "keyword_absent", "keyword": "filler_word"}
```
- Highly discriminating for writing skills
- Base model uses common filler words freely
- Skill with banned list produces 100% delta
### Output-Floor Assertion (Required Sections)
```python
{"type": "required_section", "section": "source", "example": "Sources: [...]"}
```
- Skills with required output sections must assert them even in error/fallback paths
- Template compliance drift under data outages is a known failure mode
### Bilingual Keyword Variant
```python
{"type": "either_keyword", "keywords": ["索引", "index"]}
```
- Reduces false negatives when model responds in Chinese
### Methodology Adherence (Technical Skills)
```python
{"type": "methodology_check", "requires": ["EXPLAIN", "ANALYZE"]}
```
- For technical domains where baseline is already strong on correctness
- Target process, not just output correctness
## Category-Specific Patterns
### Writing/Style Skills
- Always add banned-word assertions for skill-defined forbidden vocabulary
- Assert output structure (sections, headers, format)
### Technical Analysis (SQL, Debugging)
- Prefer methodology/structure assertions over correctness
- Baseline models are already strong on correctness
- Focus on systematic process adherence
### CLI Wrapper Skills
- Assert: tool is actually invoked
- Assert: output meaningfully differs from baseline
- Assert: graceful handling when dependency missing
### Reference Manual Skills
- Detect: >200 lines of educational content
- Rewrite: delete 70%+, add MUST/ALWAYS/NEVER behavioral mandates
- Add quick-mode routing for simple use cases
FILE:knowledge/failures.md
# Failure Mode Catalog
## Evaluation Failures
### Non-Discriminating Assertions
- **Symptom**: Assertion passes with AND without skill
- **Root cause**: Assertion tests generic capability baseline already has
- **Fix**: Replace with skill-specific behavior assertion
### Phantom Tooling
- **Symptom**: SKILL.md references script X but file doesn't exist
- **Root cause**: Skill author documented aspirational tooling
- **Fix**: Separate framework/template evaluation from operational readiness
### False Negatives (Multilingual)
- **Symptom**: Chinese-language skill fails assertions in English
- **Root cause**: Model responds in Chinese, keyword checks miss it
- **Fix**: Add bilingual keyword variants
### Self-Grading Bias
- **Symptom**: Same model grades its own output
- **Root cause**: Execution model = judge model
- **Fix**: Use separate judge model from different provider
## Skill Quality Failures
### Reference Manual Anti-Pattern
- **Symptom**: SKILL.md is 200+ lines of educational content
- **Root cause**: Author wrote a textbook, not a skill
- **Fix**: Delete 70%+, add behavioral mandates (MUST/ALWAYS/NEVER)
### Library-as-Skill Anti-Pattern
- **Symptom**: SKILL.md contains Python/JS class definitions
- **Root cause**: Author wrote code instead of instructions
- **Fix**: Convert to behavioral instructions describing what the agent should do
### High-Overhead Framework Inflation
- **Symptom**: Quality delta ≈ 0 but time/token cost >2x baseline
- **Root cause**: Framework adds overhead without proportional benefit
- **Fix**: Add quick-mode routing or reject as not publishable
FILE:knowledge/improve/failures.md
# Failed Improvement Attempts
## What Doesn't Work
### Incremental content addition
- Adding more explanation to a skill that already has structural problems doesn't help
- Root cause: wrong paradigm, not insufficient detail
### Rewriting without diagnosis
- Blindly rewriting SKILL.md without first identifying root cause of failures wastes cycles
- Always diagnose before rewriting
### Trying to "improve" content that should be deleted
- Skills with reference-manual anti-pattern: adding more content (even "better" content) doesn't fix the problem
- The issue is verbosity and lack of behavioral directives, not content quality
### Over-engineering simple skills
- Simple CLI wrapper skills do not benefit from complex multi-step workflows
- Keep simple skills simple
## Known Failure Modes
1. **Same model grades own output** → confirmation bias, inflated scores
2. **>3 improvement cycles** → diminishing returns, need to document and move on
3. **Complex assertions on simple skills** → overhead without quality gain
4. **Ignoring phantom tooling** → operational failures in production
5. **Unsubstantiated claims as evidence** → circular self-referencing validation
FILE:knowledge/improve/lessons.md
# Improvement Lessons (What Actually Works)
## Verified Improvements
### Deletion-first rewriting
- Skills that simply deleted 60-80% of content and added strong directives consistently improved more than those that tried to "improve" existing content
- The model has too much context already — less is more for behavioral control
### MUST/ALWAYS/NEVER directives work
- Vague guidance ("consider doing X") has near-zero behavioral effect
- Strong directives ("MUST do X before Y") significantly change agent behavior
- Best results: 1 directive per major step, max 10 directives total
### Quick-mode routing reduces overhead
- Adding a simple "if [simple case], use this fast path" reduced time cost by 40-60% for simple tasks
- No quality degradation observed for the targeted simple cases
### Bilingual assertions reduce false negatives
- Adding Chinese keyword variants alongside English reduced false negatives by ~30% for Chinese-language skills
- Cost: assertion complexity increases, but worth it for non-English skills
## Partial Success Patterns
### Framework-to-instructions conversion
- Partially effective — some code patterns converted well, others lost nuance
- Best for: CLI wrappers, data transformation helpers
- Poor for: complex stateful logic
### Phantom tooling replacement
- Works when the missing tool is simple (single command)
- Doesn't work when the tool encapsulates complex logic (rewrite just shifts complexity elsewhere)
## Failed Experiments
### Adding more examples to vague skills
- Adding examples to a skill that had structural problems (wrong paradigm) did not improve scores
- Fix the structure first, then add examples
### Multi-round iterative improvement
- Attempting >3 improvement cycles on the same skill rarely yields additional gains
- After 2 cycles with <0.5 point improvement, document limitations and move on
### Using the same model for improvement + eval
- Self-improvement introduces confirmation bias
- Best practice: use different model for improvement vs. grading
FILE:knowledge/improve/patterns.md
# Improvement Strategies (Proven Playbooks)
## Strategy 1: Reference Manual Slim-Down
**Problem**: >200 lines of educational content, agent skips most of it
**Solution**: Delete 70%+ redundant content, replace with behavioral mandates
**When to use**: SKILL.md reads like a textbook, not a skill
**Steps**:
1. Identify which paragraphs describe WHAT the agent should do vs WHY
2. Delete all "why" and "background" content (keep <30% of original)
3. Add MUST/ALWAYS/NEVER directives
4. Add quick-mode routing for simple cases
**Example transformation**:
Before: "The agent should first analyze the database schema, considering the relationships between tables, then carefully examine the indexes, taking into account the query patterns..."
After: "MUST analyze schema before querying. ALWAYS check index coverage for WHERE clauses."
## Strategy 2: Library-to-Instructions
**Problem**: SKILL.md contains Python/JS class definitions instead of behavioral instructions
**Solution**: Convert code to step-by-step behavioral guidance
**When to use**: Skill is a "library" with classes and methods
**Steps**:
1. Identify each function/class purpose
2. Rewrite as: "When [trigger], do [step 1] → [step 2] → [step 3]"
3. Remove implementation details
4. Keep only the behavioral contract
## Strategy 3: Phantom Tooling Replacement
**Problem**: SKILL.md references scripts/binaries that don't exist in the package
**Solution**: Replace tool references with equivalent inline instructions
**When to use**: Phantom tooling detected during eval
**Steps**:
1. List all referenced but missing tools
2. For each: write equivalent step-by-step instructions
3. Replace tool call with inline instructions
4. Add graceful fallback for missing dependencies
## Strategy 4: Overhead Routing
**Problem**: High time/token cost with marginal quality gain
**Solution**: Add fast-path routing for simple cases
**When to use**: Quality delta ≈ 0 but cost delta >2x
**Steps**:
1. Identify which use cases are simple (can be handled by baseline)
2. Add conditional: "If [simple case], skip full workflow"
3. Add lightweight alternative for simple cases
4. Keep full workflow only for complex cases
## Strategy 5: Assertion-Aligned Rewrite
**Problem**: Skill fails specific assertions that represent real regressions
**Solution**: Rewrite specifically to pass the failed assertions
**When to use**: Score < 7 due to specific, fixable failures
**Steps**:
1. List all failed assertions with their root cause
2. For each failure: identify which part of SKILL.md causes it
3. Rewrite that section to satisfy the assertion
4. Re-run assertions — if still failing, dig deeper into root cause
## Anti-Pattern Recovery
| Anti-Pattern | Detection | Fix |
|--------------|-----------|-----|
| Reference Manual | Body >200 lines, no behavioral verbs | Delete 70%+, add MUST/ALWAYS |
| Library-as-Skill | Code in SKILL.md | Convert to instructions |
| Phantom Tooling | Missing scripts | Replace with inline instructions |
| Framework Inflation | Cost >2x, delta ≈ 0 | Add quick-mode routing |
| Unsubstantiated Claims | "7.8x faster" without evidence | Remove or add evidence |
FILE:knowledge/lessons.md
# Evaluation Knowledge Base
## Lessons Learned (Accumulated Eval Wisdom)
### General
- Skills that rely on external credentials (dependency-gated) should be marked as such and evaluated separately from quality signals
- Phantom tooling (missing scripts) should be flagged separately from framework/template value
- Unsubstantiated marketing claims should be noted but not used in scoring
### Assertion Design
- Assertions that pass in both with-skill and without-skill are non-discriminating — always include skill-specific assertions
- Banned-word checks are highly discriminating for style-constrained skills (100% delta observed)
- For technical correctness tasks, baseline models are already strong — focus on methodology and structure, not correctness
- Bilingual keyword variants (Chinese + English) reduce false negatives for multilingual skills
### Category Patterns
- **Capability uplift skills**: target structural elements the model CAN produce but doesn't by default (excellent discriminators)
- **Framework-heavy skills**: can justify 50-90% time overhead IF they consistently improve actionability and formatting
- **CLI wrapper skills**: assert tool invocation, meaningful output delta, graceful dependency handling
- **Reference manual anti-pattern**: >200 lines of educational content — should be behavioral mandates instead
- **Library-as-skill anti-pattern**: code in SKILL.md instead of instructions — convert to behavioral guidance
### Efficiency
- Quality delta ≈ 0 but cost delta >2x should heavily penalize efficiency score
- Quick-mode routing reduces overhead for simple use cases
FILE:references/rubric.md
# Skill Evaluation Rubric
Full multi-framework rubric with concrete criteria per score level.
Automated checks (via `scripts/eval-skill.py`) cover structure, trigger, docs, scripts, and security.
This rubric adds the **manual assessment** dimensions that require judgment.
## Scoring: 0–4 per criterion
| Score | Meaning | Guideline |
|-------|---------|-----------|
| 0 | Fail | Missing or broken — blocks publishing |
| 1 | Poor | Present but inadequate — significant rework needed |
| 2 | Acceptable | Works but has notable gaps — should improve |
| 3 | Good | Solid with minor issues — publishable |
| 4 | Excellent | Best-in-class — no meaningful improvements |
---
## 1. Functional Suitability (ISO 25010)
### 1.1 Completeness
*Does it cover the domain adequately?*
- **4:** Covers all common operations + edge cases. A user would rarely need to go outside the skill.
- **3:** Covers core operations. Missing some advanced/niche features.
- **2:** Covers basic operations. Missing several expected features.
- **1:** Only covers a fraction of the domain. Major gaps.
- **0:** Stub or non-functional.
### 1.2 Correctness
*Do operations produce correct results?*
- **4:** All operations tested and verified. Edge cases handled.
- **3:** Core operations correct. Minor bugs in edge cases.
- **2:** Mostly correct but some operations produce wrong results.
- **1:** Several operations are broken or produce incorrect output.
- **0:** Fundamentally broken.
### 1.3 Appropriateness
*Is the technical approach suitable?*
- **4:** Zero external deps, portable, follows platform conventions perfectly.
- **3:** Minimal deps, good approach, minor awkwardness.
- **2:** Unnecessary deps or suboptimal approach for some operations.
- **1:** Wrong tool for the job or heavy unnecessary dependencies.
- **0:** Fundamentally misdesigned.
---
## 2. Reliability (ISO 25010)
### 2.1 Fault Tolerance
*How does it handle failures?*
- **4:** Retries transient errors, graceful fallbacks, partial operations succeed even if some fail.
- **3:** Catches and reports errors clearly. Some retry or fallback logic.
- **2:** Catches errors but dies on first failure. No retry.
- **1:** Unhandled exceptions for common error cases.
- **0:** Crashes with tracebacks on normal error conditions.
### 2.2 Error Reporting
*Are errors actionable?*
- **4:** Structured errors (JSON in --json mode), consistent stderr, actionable messages with fix suggestions.
- **3:** Clear error messages to stderr. User knows what went wrong.
- **2:** Error messages exist but are vague or inconsistent (mixed stdout/stderr).
- **1:** Raw tracebacks or cryptic messages.
- **0:** Silent failures.
### 2.3 Recoverability
*Can interrupted operations resume?*
- **4:** Checkpoint/resume for batch operations. Idempotent by default.
- **3:** No checkpoint but operations are idempotent (safe to re-run).
- **2:** Re-running is safe but redundant work is done.
- **1:** Re-running may cause duplicates or errors.
- **0:** Interrupted operations leave corrupted state.
---
## 3. Performance / Context Efficiency
### 3.1 SKILL.md Token Cost
*Is the SKILL.md appropriately sized for its value?*
- **4:** <150 lines. Progressive disclosure via references. Every line earns its tokens.
- **3:** 150-250 lines. Some content could move to references.
- **2:** 250-400 lines. Verbose explanations an AI doesn't need.
- **1:** 400+ lines. Wastes context window. Should be restructured.
- **0:** So large it crowds out other context.
### 3.2 Execution Efficiency
*Does the script avoid unnecessary work?*
- **4:** Paginated, filtered, incremental where possible. Rate-limited politely.
- **3:** Handles pagination. Some unnecessary full-scans.
- **2:** Works but does redundant API calls or full-library scans.
- **1:** Extremely slow or wasteful for normal operations.
- **0:** Unusably slow.
---
## 4. Usability — AI Agent (Shneiderman, Gerhardt-Powals)
### 4.1 Cold-Start Learnability
*Can a fresh AI instance use this skill correctly on first try?*
- **4:** SKILL.md + examples are sufficient. Troubleshooting reference covers errors. Agent never needs to read source.
- **3:** SKILL.md covers core usage. Some edge cases require experimentation.
- **2:** Agent needs to read source code to understand some operations.
- **1:** SKILL.md is insufficient. Agent will struggle without source access.
- **0:** No documentation. Agent must reverse-engineer everything.
### 4.2 Consistency
*Are patterns uniform across commands/operations?*
- **4:** Same flag names, same output format, same error handling everywhere.
- **3:** Mostly consistent. One or two exceptions.
- **2:** Inconsistent flags, mixed output formats, or uneven error handling.
- **1:** Every command behaves differently.
- **0:** No discernible pattern.
### 4.3 Feedback Quality
*Does the skill communicate progress and results clearly?*
- **4:** Progress indicators, summary reports, emoji-scannable output, JSON mode.
- **3:** Clear success/failure messages. Batch progress shown.
- **2:** Basic output. Hard to tell success from failure in batch operations.
- **1:** Minimal output. Agent has to infer outcomes.
- **0:** Silent or cryptic.
### 4.4 Error Prevention
*Does the skill prevent mistakes before they happen?*
- **4:** Input validation, safe defaults (recoverable > permanent), dry-run for destructive ops, duplicate detection.
- **3:** Some validation and safe defaults. Destructive ops have confirmation.
- **2:** Minimal validation. Unsafe defaults for some operations.
- **1:** No validation. Easy to accidentally destroy data.
- **0:** Actively dangerous defaults.
---
## 5. Usability — Human End User (Tognazzini, Norman)
### 5.1 Discoverability
*Can a human figure out what's available?*
- **4:** --help on all commands with examples. SKILL.md documents everything.
- **3:** --help works. SKILL.md covers core operations.
- **2:** Basic --help. Some features undocumented.
- **1:** Minimal help text. Must read source.
- **0:** No help or documentation.
### 5.2 Forgiveness / Undo
*Can mistakes be reversed?*
- **4:** Destructive ops default to recoverable (trash). Undo available. Confirmation prompts.
- **3:** Confirmation on destructive ops. Some recoverability.
- **2:** Confirmation exists but permanent is the default.
- **1:** No confirmation on destructive operations.
- **0:** Destructive operations with no warning.
---
## 6. Security (ISO 25010 + OSS Hygiene)
### 6.1 Credential Handling
*How are secrets managed?*
- **4:** All secrets via env vars. No personal data in source. Documented.
- **3:** Secrets via env vars. Minor issues (undocumented optional var).
- **2:** Mostly env vars but some hardcoded values.
- **1:** Credentials in source code.
- **0:** Credentials in source AND committed to git.
### 6.2 Input Validation
*Are inputs sanitized?*
- **4:** All user inputs validated with helpful error messages.
- **3:** Key inputs validated. Some edge cases unchecked.
- **2:** Minimal validation. Bad input causes API errors.
- **1:** No validation. Garbage in, garbage out.
- **0:** Injection-vulnerable.
### 6.3 Data Safety
*Are write operations safe by default?*
- **4:** Dry-run defaults, confirmation prompts, safe defaults (trash > delete).
- **3:** Most write ops have confirmation. Safe defaults.
- **2:** Some write ops unprotected.
- **1:** Write ops fire without confirmation.
- **0:** Silent data destruction possible.
---
## 7. Maintainability (ISO 25010)
### 7.1 Modularity
*Is the code well-organized?*
- **4:** Clear separation of concerns. Helpers extracted. Easy to add features.
- **3:** Functions organized by command. Some helpers extracted.
- **2:** Monolithic but navigable. Would benefit from refactoring.
- **1:** Spaghetti. Hard to find things.
- **0:** Incomprehensible.
### 7.2 Modifiability
*How easy is it to change?*
- **4:** Adding a new command is copy-paste-modify. Patterns are clear.
- **3:** Clear patterns. Some globals or tight coupling.
- **2:** Can be modified but requires understanding implicit dependencies.
- **1:** Changes break other things. No clear patterns.
- **0:** Effectively frozen — too fragile to touch.
### 7.3 Testability
*Can it be tested?*
- **4:** Test suite exists. Functions return values. Mockable API layer.
- **3:** No test suite but functions are testable (return values, pure helpers).
- **2:** Some testable functions. Others depend on live API or sys.exit.
- **1:** Everything depends on live API. sys.exit everywhere.
- **0:** Untestable architecture.
---
## 8. Agent-Specific Heuristics
### 8.1 Trigger Precision
*Does the description activate the skill reliably?*
- **4:** Specific domain + action words + "Use when..." contexts. Low false positive/negative risk.
- **3:** Good domain keywords. Some ambiguity with similar skills.
- **2:** Too broad or too narrow. Frequent false triggers or missed triggers.
- **1:** Description doesn't match what the skill does.
- **0:** No description.
### 8.2 Progressive Disclosure
*Is context loaded efficiently?*
- **4:** 3+ levels: description → SKILL.md → references. Agent loads only what's needed.
- **3:** 2 levels: description → SKILL.md. Everything in one file but concise.
- **2:** Everything in SKILL.md. No reference files despite complex domain.
- **1:** Requires reading source code for basic usage.
- **0:** No disclosure hierarchy.
### 8.3 Composability
*Can it work with other tools in a pipeline?*
- **4:** --json for all commands, stdin input, proper exit codes, stderr for errors.
- **3:** --json for key commands. Good exit codes.
- **2:** Limited machine-readable output. Some commands.
- **1:** Human-readable only. No structured output.
- **0:** Output is unparseable.
### 8.4 Idempotency
*Is it safe to re-run?*
- **4:** All operations are idempotent. Re-running produces same state.
- **3:** Most operations idempotent. Some may create duplicates.
- **2:** Core operations idempotent. Batch ops may duplicate.
- **1:** Re-running causes problems.
- **0:** Re-running corrupts state.
### 8.5 Escape Hatches
*Can the agent/user override default behavior?*
- **4:** --force, --dry-run, --verbose, --quiet, --json, custom API URLs.
- **3:** --force and --dry-run available. Good flag coverage.
- **2:** Some overrides. Missing key flags.
- **1:** Minimal overrides. Hard to customize behavior.
- **0:** No overrides. Take it or leave it.
---
## Scoring Template
Copy this for your evaluation:
```
| Category | Subcriteria | Score | Notes |
|----------|-------------|-------|-------|
| 1.1 Completeness | | /4 | |
| 1.2 Correctness | | /4 | |
| 1.3 Appropriateness | | /4 | |
| 2.1 Fault Tolerance | | /4 | |
| 2.2 Error Reporting | | /4 | |
| 2.3 Recoverability | | /4 | |
| 3.1 Token Cost | | /4 | |
| 3.2 Execution Efficiency | | /4 | |
| 4.1 Learnability | | /4 | |
| 4.2 Consistency | | /4 | |
| 4.3 Feedback Quality | | /4 | |
| 4.4 Error Prevention | | /4 | |
| 5.1 Discoverability | | /4 | |
| 5.2 Forgiveness | | /4 | |
| 6.1 Credential Handling | | /4 | |
| 6.2 Input Validation | | /4 | |
| 6.3 Data Safety | | /4 | |
| 7.1 Modularity | | /4 | |
| 7.2 Modifiability | | /4 | |
| 7.3 Testability | | /4 | |
| 8.1 Trigger Precision | | /4 | |
| 8.2 Progressive Disclosure | | /4 | |
| 8.3 Composability | | /4 | |
| 8.4 Idempotency | | /4 | |
| 8.5 Escape Hatches | | /4 | |
| **TOTAL** | | **/100** | |
```
FILE:scripts/eval-skill.py
#!/usr/bin/env python3
"""Automated skill evaluation — structural and heuristic checks.
Usage:
python3 eval-skill.py <path-to-skill-directory>
python3 eval-skill.py <path-to-skill-directory> --json
python3 eval-skill.py <path-to-skill-directory> --verbose
Checks file structure, SKILL.md quality, script health, and agent-specific heuristics.
Outputs a report with pass/warn/fail per check and an overall structural score.
This covers the AUTOMATABLE portion of the evaluation. The full rubric
(references/rubric.md) includes manual assessment criteria that require
human or agent judgment.
"""
import argparse
import ast
import json
import os
import re
import sys
import yaml
# --- Check infrastructure ---
class CheckResult:
PASS = "pass"
WARN = "warn"
FAIL = "fail"
def __init__(self, name, status, message="", category=""):
self.name = name
self.status = status
self.message = message
self.category = category
def to_dict(self):
return {
"name": self.name,
"status": self.status,
"message": self.message,
"category": self.category,
}
def check(name, category="general"):
"""Decorator to register a check function."""
def decorator(func):
func._check_name = name
func._check_category = category
return func
return decorator
# --- File structure checks ---
@check("SKILL.md exists", "structure")
def check_skill_md_exists(skill_path):
path = os.path.join(skill_path, "SKILL.md")
if os.path.isfile(path):
return CheckResult("SKILL.md exists", CheckResult.PASS, category="structure")
return CheckResult("SKILL.md exists", CheckResult.FAIL,
"SKILL.md is required", category="structure")
@check("SKILL.md has valid frontmatter", "structure")
def check_frontmatter(skill_path):
path = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(path):
return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
"SKILL.md not found", category="structure")
with open(path, "r", encoding="utf-8") as f:
content = f.read()
# Extract YAML frontmatter
match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
if not match:
return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
"No YAML frontmatter found (must start with ---)", category="structure")
try:
fm = yaml.safe_load(match.group(1))
except Exception as e:
return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
f"Invalid YAML: {e}", category="structure")
if not isinstance(fm, dict):
return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
"Frontmatter must be a YAML mapping", category="structure")
missing = []
if not fm.get("name"):
missing.append("name")
if not fm.get("description"):
missing.append("description")
if missing:
return CheckResult("SKILL.md has valid frontmatter", CheckResult.FAIL,
f"Missing required fields: {', '.join(missing)}", category="structure")
return CheckResult("SKILL.md has valid frontmatter", CheckResult.PASS, category="structure")
@check("Skill name matches directory", "structure")
def check_name_matches_dir(skill_path):
dir_name = os.path.basename(os.path.abspath(skill_path))
path = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(path):
return CheckResult("Skill name matches directory", CheckResult.FAIL,
"SKILL.md not found", category="structure")
with open(path, "r", encoding="utf-8") as f:
content = f.read()
match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
if not match:
return CheckResult("Skill name matches directory", CheckResult.WARN,
"No frontmatter to check", category="structure")
try:
fm = yaml.safe_load(match.group(1))
name = fm.get("name", "")
except Exception:
return CheckResult("Skill name matches directory", CheckResult.WARN,
"Could not parse frontmatter", category="structure")
if name == dir_name:
return CheckResult("Skill name matches directory", CheckResult.PASS, category="structure")
return CheckResult("Skill name matches directory", CheckResult.WARN,
f"name='{name}' but directory='{dir_name}'", category="structure")
@check("No extraneous files", "structure")
def check_no_extraneous(skill_path):
"""Check for files that shouldn't be in a skill (README, CHANGELOG, etc.)."""
bad_files = {"README.md", "CHANGELOG.md", "INSTALLATION_GUIDE.md",
"QUICK_REFERENCE.md", "LICENSE", "LICENSE.md"}
found = []
for f in os.listdir(skill_path):
if f.upper() in {b.upper() for b in bad_files}:
found.append(f)
if found:
return CheckResult("No extraneous files", CheckResult.WARN,
f"Found: {', '.join(found)} — skills shouldn't include these",
category="structure")
return CheckResult("No extraneous files", CheckResult.PASS, category="structure")
@check("Resource directories are non-empty", "structure")
def check_resource_dirs(skill_path):
"""If scripts/, references/, or assets/ exist, they should contain files."""
empty = []
for d in ("scripts", "references", "assets"):
dp = os.path.join(skill_path, d)
if os.path.isdir(dp):
contents = [f for f in os.listdir(dp) if not f.startswith(".") and f != "__pycache__"]
if not contents:
empty.append(d)
if empty:
return CheckResult("Resource directories are non-empty", CheckResult.WARN,
f"Empty directories: {', '.join(empty)}", category="structure")
return CheckResult("Resource directories are non-empty", CheckResult.PASS, category="structure")
# --- Description quality ---
@check("Description length adequate", "trigger")
def check_description_length(skill_path):
fm = _get_frontmatter(skill_path)
if not fm:
return CheckResult("Description length adequate", CheckResult.FAIL,
"No frontmatter", category="trigger")
desc = fm.get("description", "")
words = len(desc.split())
if words < 15:
return CheckResult("Description length adequate", CheckResult.FAIL,
f"Description is only {words} words — too short for reliable triggering",
category="trigger")
if words < 30:
return CheckResult("Description length adequate", CheckResult.WARN,
f"Description is {words} words — consider adding trigger contexts",
category="trigger")
if words > 150:
return CheckResult("Description length adequate", CheckResult.WARN,
f"Description is {words} words — may be too long (wastes context in metadata)",
category="trigger")
return CheckResult("Description length adequate", CheckResult.PASS,
f"{words} words", category="trigger")
@check("Description includes trigger contexts", "trigger")
def check_trigger_contexts(skill_path):
"""Good descriptions say WHEN to use the skill, not just WHAT it does."""
fm = _get_frontmatter(skill_path)
if not fm:
return CheckResult("Description includes trigger contexts", CheckResult.FAIL,
"No frontmatter", category="trigger")
desc = fm.get("description", "").lower()
trigger_phrases = ["use when", "use for", "use if", "when the user",
"when asked", "when you need", "for tasks like",
"such as", "e.g.", "for example"]
found = [p for p in trigger_phrases if p in desc]
if not found:
return CheckResult("Description includes trigger contexts", CheckResult.WARN,
"No trigger phrases found — add 'Use when...' to improve activation",
category="trigger")
return CheckResult("Description includes trigger contexts", CheckResult.PASS,
f"Found: {', '.join(found[:3])}", category="trigger")
# --- SKILL.md body quality ---
@check("SKILL.md body length", "documentation")
def check_body_length(skill_path):
path = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(path):
return CheckResult("SKILL.md body length", CheckResult.FAIL, "Not found", category="documentation")
with open(path, "r") as f:
lines = f.readlines()
# Count lines after frontmatter
in_fm = False
body_lines = 0
fm_ended = False
for line in lines:
if line.strip() == "---":
if not fm_ended:
in_fm = not in_fm
if not in_fm:
fm_ended = True
continue
if fm_ended:
body_lines += 1
if body_lines < 10:
return CheckResult("SKILL.md body length", CheckResult.FAIL,
f"Only {body_lines} lines — too short to be useful", category="documentation")
if body_lines > 500:
return CheckResult("SKILL.md body length", CheckResult.WARN,
f"{body_lines} lines — consider splitting into reference files", category="documentation")
return CheckResult("SKILL.md body length", CheckResult.PASS,
f"{body_lines} lines", category="documentation")
@check("References are linked from SKILL.md", "documentation")
def check_references_linked(skill_path):
"""If references/ exists, SKILL.md should link to the files."""
ref_dir = os.path.join(skill_path, "references")
if not os.path.isdir(ref_dir):
return CheckResult("References are linked from SKILL.md", CheckResult.PASS,
"No references/ directory", category="documentation")
ref_files = [f for f in os.listdir(ref_dir) if not f.startswith(".")]
if not ref_files:
return CheckResult("References are linked from SKILL.md", CheckResult.PASS,
"No reference files", category="documentation")
skill_md = os.path.join(skill_path, "SKILL.md")
with open(skill_md, "r") as f:
content = f.read()
unlinked = [f for f in ref_files if f not in content]
if unlinked:
return CheckResult("References are linked from SKILL.md", CheckResult.WARN,
f"Unlinked references: {', '.join(unlinked)}", category="documentation")
return CheckResult("References are linked from SKILL.md", CheckResult.PASS, category="documentation")
# --- Script health ---
@check("Python scripts parse without errors", "scripts")
def check_python_syntax(skill_path):
scripts_dir = os.path.join(skill_path, "scripts")
if not os.path.isdir(scripts_dir):
return CheckResult("Python scripts parse without errors", CheckResult.PASS,
"No scripts/ directory", category="scripts")
errors = []
checked = 0
for f in os.listdir(scripts_dir):
if f.endswith(".py"):
checked += 1
fpath = os.path.join(scripts_dir, f)
try:
with open(fpath, "r") as fh:
ast.parse(fh.read(), filename=f)
except SyntaxError as e:
errors.append(f"{f}:{e.lineno}: {e.msg}")
if not checked:
return CheckResult("Python scripts parse without errors", CheckResult.PASS,
"No Python scripts", category="scripts")
if errors:
return CheckResult("Python scripts parse without errors", CheckResult.FAIL,
"\n".join(errors), category="scripts")
return CheckResult("Python scripts parse without errors", CheckResult.PASS,
f"{checked} script(s) OK", category="scripts")
@check("Scripts use no external dependencies", "scripts")
def check_no_ext_deps(skill_path):
"""Check that Python scripts only import stdlib modules."""
scripts_dir = os.path.join(skill_path, "scripts")
if not os.path.isdir(scripts_dir):
return CheckResult("Scripts use no external dependencies", CheckResult.PASS,
"No scripts/", category="scripts")
# Common stdlib modules (not exhaustive, but covers common ones)
stdlib = {
"abc", "argparse", "ast", "base64", "bisect", "calendar", "cgi",
"cmd", "codecs", "collections", "configparser", "contextlib", "copy",
"csv", "dataclasses", "datetime", "decimal", "difflib", "email",
"enum", "errno", "fnmatch", "fractions", "ftplib", "functools",
"getopt", "getpass", "glob", "gzip", "hashlib", "heapq", "hmac",
"html", "http", "imaplib", "importlib", "inspect", "io", "ipaddress",
"itertools", "json", "keyword", "linecache", "locale", "logging",
"lzma", "math", "mimetypes", "multiprocessing", "operator", "os",
"pathlib", "pickle", "platform", "plistlib", "pprint", "profile",
"queue", "random", "re", "readline", "reprlib", "secrets",
"select", "shelve", "shlex", "shutil", "signal", "smtplib",
"socket", "sqlite3", "ssl", "stat", "statistics", "string",
"struct", "subprocess", "sys", "syslog", "tarfile", "tempfile",
"textwrap", "threading", "time", "timeit", "token", "tokenize",
"tomllib", "traceback", "types", "typing", "unicodedata",
"unittest", "urllib", "uuid", "venv", "warnings", "weakref",
"webbrowser", "xml", "xmlrpc", "zipfile", "zipimport", "zlib",
"_thread", "__future__",
}
ext_deps = []
for f in os.listdir(scripts_dir):
if not f.endswith(".py"):
continue
fpath = os.path.join(scripts_dir, f)
try:
with open(fpath, "r") as fh:
tree = ast.parse(fh.read())
except SyntaxError:
continue
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
top = alias.name.split(".")[0]
if top not in stdlib:
ext_deps.append(f"{f}: import {alias.name}")
elif isinstance(node, ast.ImportFrom):
if node.module:
top = node.module.split(".")[0]
if top not in stdlib:
ext_deps.append(f"{f}: from {node.module}")
if ext_deps:
return CheckResult("Scripts use no external dependencies", CheckResult.WARN,
f"Possible external deps: {'; '.join(ext_deps[:5])}",
category="scripts")
return CheckResult("Scripts use no external dependencies", CheckResult.PASS, category="scripts")
@check("No hardcoded credentials or emails", "security")
def check_no_hardcoded_secrets(skill_path):
"""Scan for common patterns: API keys, emails, tokens in source files."""
patterns = [
(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', "email address"),
(r'(?:api[_-]?key|token|secret|password)\s*[=:]\s*["\'][^"\']{8,}', "possible credential"),
(r'sk-[a-zA-Z0-9]{20,}', "OpenAI-style API key"),
(r'ghp_[a-zA-Z0-9]{36}', "GitHub PAT"),
]
findings = []
for root, dirs, files in os.walk(skill_path):
dirs[:] = [d for d in dirs if d not in ("__pycache__", ".git", "node_modules")]
for f in files:
if f.endswith((".py", ".md", ".sh", ".js", ".ts", ".yaml", ".yml", ".json", ".toml")):
fpath = os.path.join(root, f)
rel = os.path.relpath(fpath, skill_path)
try:
with open(fpath, "r", encoding="utf-8", errors="ignore") as fh:
for i, line in enumerate(fh, 1):
for pattern, desc in patterns:
if re.search(pattern, line):
if desc == "email address":
match = re.search(pattern, line)
if match:
email = match.group().lower()
# Skip well-known safe patterns
safe_domains = (
"example.com", "example.org",
"outlook.com", "hotmail.com",
"googlegroups.com",
"iam.gserviceaccount.com",
"placeholder", "your",
)
if any(d in email for d in safe_domains):
continue
if "your" in email:
continue
findings.append(f"{rel}:{i}: {desc}")
except (IOError, UnicodeDecodeError):
pass
if findings:
# Deduplicate
unique = list(dict.fromkeys(findings))[:10]
return CheckResult("No hardcoded credentials or emails", CheckResult.WARN,
f"Found {len(findings)} potential issues:\n" + "\n".join(unique),
category="security")
return CheckResult("No hardcoded credentials or emails", CheckResult.PASS, category="security")
@check("Environment variables documented", "security")
def check_env_vars_documented(skill_path):
"""If scripts reference os.environ, SKILL.md should document those vars."""
scripts_dir = os.path.join(skill_path, "scripts")
if not os.path.isdir(scripts_dir):
return CheckResult("Environment variables documented", CheckResult.PASS,
"No scripts/", category="security")
env_vars = set()
for f in os.listdir(scripts_dir):
if not f.endswith(".py"):
continue
fpath = os.path.join(scripts_dir, f)
with open(fpath, "r", encoding="utf-8", errors="ignore") as fh:
content = fh.read()
# Match os.environ.get("VAR") and os.environ["VAR"]
for m in re.finditer(r'os\.environ(?:\.get)?\s*[\[(]\s*["\'](\w+)', content):
var = m.group(1)
# Skip obviously dummy/example variable names
if var not in ("VAR", "KEY", "VALUE", "NAME", "ENV"):
env_vars.add(var)
if not env_vars:
return CheckResult("Environment variables documented", CheckResult.PASS,
"No env vars found in scripts", category="security")
skill_md = os.path.join(skill_path, "SKILL.md")
with open(skill_md, "r") as f:
skill_content = f.read()
undocumented = [v for v in env_vars if v not in skill_content]
if undocumented:
return CheckResult("Environment variables documented", CheckResult.WARN,
f"Undocumented env vars: {', '.join(sorted(undocumented))}",
category="security")
return CheckResult("Environment variables documented", CheckResult.PASS,
f"All {len(env_vars)} env vars documented", category="security")
# --- Helpers ---
def _get_frontmatter(skill_path):
path = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(path):
return None
with open(path, "r", encoding="utf-8") as f:
content = f.read()
match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
if not match:
return None
try:
return yaml.safe_load(match.group(1))
except Exception:
return None
# --- Runner ---
ALL_CHECKS = [
check_skill_md_exists,
check_frontmatter,
check_name_matches_dir,
check_no_extraneous,
check_resource_dirs,
check_description_length,
check_trigger_contexts,
check_body_length,
check_references_linked,
check_python_syntax,
check_no_ext_deps,
check_no_hardcoded_secrets,
check_env_vars_documented,
]
def run_checks(skill_path, verbose=False):
results = []
for check_fn in ALL_CHECKS:
try:
result = check_fn(skill_path)
results.append(result)
except Exception as e:
results.append(CheckResult(
check_fn._check_name, CheckResult.FAIL,
f"Check crashed: {e}", check_fn._check_category
))
return results
def print_report(results, skill_path, verbose=False):
counts = {CheckResult.PASS: 0, CheckResult.WARN: 0, CheckResult.FAIL: 0}
by_category = {}
for r in results:
counts[r.status] += 1
by_category.setdefault(r.category, []).append(r)
skill_name = os.path.basename(os.path.abspath(skill_path))
print(f"\n📋 Skill Evaluation: {skill_name}")
print(f"{'=' * 50}")
print(f"Path: {os.path.abspath(skill_path)}")
print()
icons = {CheckResult.PASS: "✅", CheckResult.WARN: "⚠️ ", CheckResult.FAIL: "❌"}
for cat in ["structure", "trigger", "documentation", "scripts", "security"]:
if cat not in by_category:
continue
print(f" [{cat.upper()}]")
for r in by_category[cat]:
icon = icons[r.status]
print(f" {icon} {r.name}")
if r.message and (verbose or r.status != CheckResult.PASS):
for line in r.message.split("\n"):
print(f" {line}")
print()
print(f"{'=' * 50}")
print(f" ✅ Pass: {counts[CheckResult.PASS]} "
f"⚠️ Warn: {counts[CheckResult.WARN]} "
f"❌ Fail: {counts[CheckResult.FAIL]}")
total = len(results)
score = counts[CheckResult.PASS] / total * 100 if total else 0
print(f" Structural score: {score:.0f}% ({counts[CheckResult.PASS]}/{total} checks passed)")
print()
print(" ⓘ This covers automated/structural checks only.")
print(" For the full evaluation, use the manual rubric in references/rubric.md")
print()
def main():
parser = argparse.ArgumentParser(description="Evaluate a Clawdbot skill")
parser.add_argument("path", help="Path to skill directory")
parser.add_argument("--json", action="store_true", help="JSON output")
parser.add_argument("--verbose", "-v", action="store_true", help="Show details for passing checks too")
args = parser.parse_args()
if not os.path.isdir(args.path):
print(f"Error: '{args.path}' is not a directory", file=sys.stderr)
sys.exit(1)
results = run_checks(args.path, verbose=args.verbose)
if args.json:
output = {
"skill": os.path.basename(os.path.abspath(args.path)),
"path": os.path.abspath(args.path),
"checks": [r.to_dict() for r in results],
"summary": {
"pass": sum(1 for r in results if r.status == CheckResult.PASS),
"warn": sum(1 for r in results if r.status == CheckResult.WARN),
"fail": sum(1 for r in results if r.status == CheckResult.FAIL),
}
}
print(json.dumps(output, indent=2))
else:
print_report(results, args.path, verbose=args.verbose)
if __name__ == "__main__":
main()
FILE:scripts/generate_leaderboard.py
#!/usr/bin/env python3
"""
Generate an HTML leaderboard from skill cards.
Usage:
python3 generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html
This script reads all skill card markdown files from the cards directory
and generates a sortable HTML leaderboard.
"""
import argparse
import os
import re
import sys
from pathlib import Path
def parse_skill_card(filepath):
"""Parse a skill card markdown file and extract metadata."""
try:
with open(filepath, "r", encoding="utf-8") as f:
content = f.read()
except (IOError, UnicodeDecodeError):
return None
card = {
"file": os.path.basename(filepath),
"path": str(filepath),
"name": "Unknown",
"slug": "unknown",
"eval_date": "N/A",
"model": "N/A",
"overall_score": 0,
"quality": 0,
"delta": 0,
"efficiency": 0,
"recommendation": "N/A",
"strengths": [],
"weaknesses": [],
}
# Extract skill name
match = re.search(r'\*\*Skill\*\*:\s*(.+)', content)
if match:
card["name"] = match.group(1).strip()
# Extract slug
match = re.search(r'\*\*Slug\*\*:\s*(.+)', content)
if match:
card["slug"] = match.group(1).strip()
# Extract eval date
match = re.search(r'\*\*Eval Date\*\*:\s*(.+)', content)
if match:
card["eval_date"] = match.group(1).strip()
# Extract model
match = re.search(r'\*\*Model\*\*:\s*(.+)', content)
if match:
card["model"] = match.group(1).strip()
# Extract overall score
match = re.search(r'\*\*Overall:\s*(\d+)/10\*\*', content)
if match:
card["overall_score"] = int(match.group(1))
# Extract component scores
score_matches = re.findall(r'\|\s*(\w+)\s*\|\s*(\d+)\s*\|', content)
for name, score in score_matches:
if name == "Quality":
card["quality"] = int(score)
elif name == "Delta":
card["delta"] = int(score)
elif name == "Efficiency":
card["efficiency"] = int(score)
# Extract recommendation
match = re.search(r'#{1,3}\s*[✅⚠️⚡❌]\s*\*\*(Recommended|Conditional|Marginal|Not Recommended)\*\*', content)
if match:
card["recommendation"] = match.group(1)
# Extract strengths
strengths_section = re.search(r'## Strengths\s*\n\s*\n(.*?)(?=\n##|\n---)', content, re.DOTALL)
if strengths_section:
strengths = re.findall(r'✅\s*(.+)', strengths_section.group(1))
card["strengths"] = [s.strip() for s in strengths]
# Extract weaknesses
weaknesses_section = re.search(r'## Weaknesses\s*\n\s*\n(.*?)(?=\n##|\n---)', content, re.DOTALL)
if weaknesses_section:
weaknesses = re.findall(r'❌\s*(.+)', weaknesses_section.group(1))
card["weaknesses"] = [w.strip() for w in weaknesses]
return card
def generate_html(cards, args):
"""Generate HTML leaderboard from cards data."""
# Sort cards by overall score (descending)
cards.sort(key=lambda x: x.get("overall_score", 0), reverse=True)
html = []
html.append("<!DOCTYPE html>")
html.append("<html lang='zh-CN'>")
html.append("<head>")
html.append(" <meta charset='UTF-8'>")
html.append(" <meta name='viewport' content='width=device-width, initial-scale=1.0'>")
html.append(f" <title>Skill Leaderboard | {args.title}</title>")
html.append(" <style>")
html.append(" * { box-sizing: border-box; margin: 0; padding: 0; }")
html.append(" body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;")
html.append(" background: #f5f5f5; color: #333; padding: 20px; }")
html.append(" .container { max-width: 1200px; margin: 0 auto; }")
html.append(" h1 { text-align: center; margin-bottom: 30px; color: #1a1a2e; }")
html.append(" .stats { display: flex; justify-content: center; gap: 40px; margin-bottom: 30px; }")
html.append(" .stat { text-align: center; }")
html.append(" .stat-value { font-size: 2em; font-weight: bold; color: #4361ee; }")
html.append(" .stat-label { color: #666; font-size: 0.9em; }")
html.append(" table { width: 100%; background: white; border-radius: 10px;")
html.append(" box-shadow: 0 2px 10px rgba(0,0,0,0.1); overflow: hidden; }")
html.append(" th { background: #4361ee; color: white; padding: 15px; text-align: left; }")
html.append(" td { padding: 12px 15px; border-bottom: 1px solid #eee; }")
html.append(" tr:hover { background: #f8f9fa; }")
html.append(" .score { font-weight: bold; font-size: 1.2em; }")
html.append(" .score-high { color: #22c55e; }")
html.append(" .score-mid { color: #f59e0b; }")
html.append(" .score-low { color: #ef4444; }")
html.append(" .badge { display: inline-block; padding: 4px 10px; border-radius: 20px;")
html.append(" font-size: 0.8em; font-weight: 500; }")
html.append(" .badge-recommended { background: #dcfce7; color: #166534; }")
html.append(" .badge-conditional { background: #fef9c3; color: #854d0e; }")
html.append(" .badge-marginal { background: #fed7aa; color: #9a3412; }")
html.append(" .badge-not-recommended { background: #fee2e2; color: #991b1b; }")
html.append(" .strengths, .weaknesses { font-size: 0.85em; color: #666; }")
html.append(" .sort-btn { background: none; border: none; color: inherit; cursor: pointer;")
html.append(" font-weight: inherit; padding: 0; }")
html.append(" .sort-btn:hover { color: white; }")
html.append(" .footer { text-align: center; margin-top: 20px; color: #666; font-size: 0.9em; }")
html.append(" </style>")
html.append("</head>")
html.append("<body>")
html.append(" <div class='container'>")
html.append(f" <h1>{args.title}</h1>")
html.append(" <div class='stats'>")
html.append(f" <div class='stat'><div class='stat-value'>{len(cards)}</div><div class='stat-label'>已评估技能</div></div>")
recommended = sum(1 for c in cards if c.get("recommendation") == "Recommended")
html.append(f" <div class='stat'><div class='stat-value'>{recommended}</div><div class='stat-label'>推荐使用</div></div>")
avg_score = sum(c.get("overall_score", 0) for c in cards) / max(len(cards), 1)
html.append(f" <div class='stat'><div class='stat-value'>{avg_score:.1f}</div><div class='stat-label'>平均分</div></div>")
html.append(" </div>")
html.append(" <table id='leaderboard'>")
html.append(" <thead>")
html.append(" <tr>")
html.append(" <th>#</th>")
html.append(" <th>技能名称</th>")
html.append(" <th>Slug</th>")
html.append(" <th>总分</th>")
html.append(" <th>质量</th>")
html.append(" <th>提升</th>")
html.append(" <th>效率</th>")
html.append(" <th>推荐</th>")
html.append(" <th>评估日期</th>")
html.append(" <th>优劣势</th>")
html.append(" </tr>")
html.append(" </thead>")
html.append(" <tbody>")
for i, card in enumerate(cards, 1):
score = card.get("overall_score", 0)
if score >= 7:
score_class = "score-high"
elif score >= 5:
score_class = "score-mid"
else:
score_class = "score-low"
rec = card.get("recommendation", "N/A")
badge_class = f"badge-{rec.lower().replace(' ', '-')}"
strengths = ", ".join(card.get("strengths", [])[:2]) if card.get("strengths") else "-"
weaknesses = ", ".join(card.get("weaknesses", [])[:2]) if card.get("weaknesses") else "-"
html.append(" <tr>")
html.append(f" <td>{i}</td>")
html.append(f" <td><strong>{card.get('name', 'Unknown')}</strong></td>")
html.append(f" <td>{card.get('slug', 'N/A')}</td>")
html.append(f" <td class='score {score_class}'>{score}/10</td>")
html.append(f" <td>{card.get('quality', 0)}/5</td>")
html.append(f" <td>{card.get('delta', 0)}/3</td>")
html.append(f" <td>{card.get('efficiency', 0)}/2</td>")
html.append(f" <td><span class='badge {badge_class}'>{rec}</span></td>")
html.append(f" <td>{card.get('eval_date', 'N/A')}</td>")
html.append(f" <td><div class='strengths'>👍 {strengths}</div><div class='weaknesses'>👎 {weaknesses}</div></td>")
html.append(" </tr>")
html.append(" </tbody>")
html.append(" </table>")
html.append(f" <div class='footer'>Generated by multi-skill-eval v1.0.0 | {len(cards)} skills evaluated</div>")
html.append(" </div>")
html.append("</body>")
html.append("</html>")
return "\n".join(html)
def main():
parser = argparse.ArgumentParser(
description="Generate an HTML leaderboard from skill cards"
)
parser.add_argument(
"--cards-dir", "-c", required=True,
help="Directory containing skill card markdown files"
)
parser.add_argument(
"--output", "-o", required=True,
help="Output HTML file path"
)
parser.add_argument(
"--title", "-t", default="OpenClaw Skill Leaderboard",
help="Leaderboard title"
)
args = parser.parse_args()
cards_dir = Path(args.cards_dir)
if not cards_dir.exists():
print(f"Error: Cards directory not found: {cards_dir}", file=sys.stderr)
sys.exit(1)
# Find all markdown files
card_files = list(cards_dir.glob("*.md"))
if not card_files:
print(f"Warning: No skill card files found in {cards_dir}", file=sys.stderr)
# Generate empty leaderboard
cards = []
else:
# Parse each card
cards = []
for filepath in card_files:
card = parse_skill_card(filepath)
if card:
cards.append(card)
# Generate HTML
html = generate_html(cards, args)
# Write output
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
f.write(html)
print(f"Leaderboard generated: {args.output}")
print(f" Total skills: {len(cards)}")
if __name__ == "__main__":
main()
FILE:scripts/generate_skill_card.py
#!/usr/bin/env python3
"""
Generate a standardized Skill Card from benchmark results.
Usage:
python3 generate_skill_card.py \
--workspace /path/to/results \
--skill-name "My Skill" \
--skill-slug my-skill \
--eval-model claude-sonnet-4 \
--output skill-cards/my-skill-v1.md
Skill Card Contents:
- Metadata: name, source, eval date, model, engine version
- Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
- With-skill vs without-skill comparison table
- Per-test-case breakdown with assertions, timing, grading
- Strengths / Weaknesses
- Recommendation: Recommended / Conditional / Marginal / Not Recommended
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime
from pathlib import Path
def load_benchmark_data(workspace):
"""Load all benchmark data from workspace."""
data = {
"grading": None,
"benchmark": None,
"preflight": None,
}
# Load grading.json
grading_path = os.path.join(workspace, "grading.json")
if os.path.exists(grading_path):
with open(grading_path, "r", encoding="utf-8") as f:
data["grading"] = json.load(f)
# Load benchmark aggregation if exists
benchmark_path = os.path.join(workspace, "benchmark-aggregation.json")
if os.path.exists(benchmark_path):
with open(benchmark_path, "r", encoding="utf-8") as f:
data["benchmark"] = json.load(f)
# Load preflight results if exists
preflight_path = os.path.join(workspace, "preflight.json")
if os.path.exists(preflight_path):
with open(preflight_path, "r", encoding="utf-8") as f:
data["preflight"] = json.load(f)
return data
def calculate_overall_score(grading, benchmark, preflight):
"""
Calculate overall score 0-10:
- Quality: 0-5 based on grading pass rate
- Delta: 0-3 based on with_skill vs without_skill improvement
- Efficiency: 0-2 based on time/token cost
"""
quality = 0
delta = 0
efficiency = 2 # Default to max efficiency
# Quality score from grading (0-5)
if grading and grading.get("summary", {}).get("total", 0) > 0:
pass_rate = grading["summary"]["pass_rate"]
quality = int(pass_rate * 5)
# Delta score from benchmark (0-3)
if benchmark:
benchmark_delta = benchmark.get("delta", {})
pass_rate_delta = benchmark_delta.get("pass_rate", "+0.00")
# Parse delta value (e.g., "+0.15" -> 0.15)
try:
delta_value = float(pass_rate_delta.replace("+", ""))
if delta_value >= 0.3:
delta = 3
elif delta_value >= 0.2:
delta = 2
elif delta_value >= 0.1:
delta = 1
else:
delta = 0
except (ValueError, AttributeError):
delta = 0
# Efficiency penalty for high-overhead skills
time_delta = benchmark_delta.get("time", "1x")
try:
time_ratio = float(time_delta.replace("x", "").replace("+", ""))
if time_ratio > 3:
efficiency = 0
elif time_ratio > 2:
efficiency = 1
except (ValueError, AttributeError):
pass
overall = quality + delta + efficiency
return min(overall, 10), {
"quality": quality,
"delta": delta,
"efficiency": efficiency,
"overall": min(overall, 10)
}
def get_recommendation(overall_score, grading, benchmark, preflight):
"""Determine recommendation based on overall score."""
if overall_score >= 7:
return "Recommended"
elif overall_score >= 5:
return "Conditional"
elif overall_score >= 3:
return "Marginal"
else:
return "Not Recommended"
def detect_strengths(grading, benchmark, preflight):
"""Identify skill strengths from data."""
strengths = []
if grading:
# Check for high pass rate areas
for assertion in grading.get("assertions", []):
if assertion.get("passed"):
cat = assertion.get("category", "")
if cat == "quality":
strengths.append(f"输出质量达标")
elif cat == "format":
strengths.append(f"格式规范")
elif cat == "security":
strengths.append(f"安全性良好")
if benchmark:
# Check for significant improvements
if benchmark.get("with_skill", {}).get("pass_rate", 0) > 0.8:
strengths.append("高任务完成率")
if preflight:
# Check dependencies
deps = preflight.get("dependencies", {})
if deps.get("all_available", True):
strengths.append("依赖完整可用")
return list(set(strengths))[:5] # Deduplicate, max 5
def detect_weaknesses(grading, benchmark, preflight):
"""Identify skill weaknesses from data."""
weaknesses = []
if grading:
# Check for failed assertions
for assertion in grading.get("assertions", []):
if not assertion.get("passed"):
cat = assertion.get("category", "")
if cat == "format":
weaknesses.append("格式不规范")
elif cat == "security":
weaknesses.append("安全性问题")
elif cat == "completeness":
weaknesses.append("输出不完整")
if preflight:
# Check for phantom tooling
phantom = preflight.get("phantom_tooling", [])
if phantom:
weaknesses.append(f"幽灵工具: {', '.join(phantom[:2])}")
# Check for unverified claims
claims = preflight.get("unverified_claims", [])
if claims:
weaknesses.append("存在未验证的营销声明")
if benchmark:
# Check for high overhead
delta = benchmark.get("delta", {})
time_ratio = delta.get("time", "1x")
try:
ratio = float(time_ratio.replace("x", "").replace("+", ""))
if ratio > 2:
weaknesses.append(f"开销较高: {time_ratio}")
except:
pass
return list(set(weaknesses))[:5] # Deduplicate, max 5
def generate_skill_card(args):
"""Generate the complete skill card."""
workspace = os.path.abspath(args.workspace)
data = load_benchmark_data(workspace)
# Calculate scores
overall_score, score_breakdown = calculate_overall_score(
data.get("grading"),
data.get("benchmark"),
data.get("preflight")
)
recommendation = get_recommendation(
overall_score,
data.get("grading"),
data.get("benchmark"),
data.get("preflight")
)
strengths = detect_strengths(
data.get("grading"),
data.get("benchmark"),
data.get("preflight")
)
weaknesses = detect_weaknesses(
data.get("grading"),
data.get("benchmark"),
data.get("preflight")
)
# Build skill card
card = []
card.append("# Skill Card")
card.append("")
card.append(f"**Skill**: {args.skill_name}")
card.append(f"**Slug**: {args.skill_slug}")
card.append(f"**Eval Date**: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
card.append(f"**Model**: {args.eval_model}")
card.append("")
card.append("---")
card.append("")
card.append("## Metadata")
card.append("")
card.append(f"| Field | Value |")
card.append(f"|-------|-------|")
card.append(f"| Skill Name | {args.skill_name} |")
card.append(f"| Skill Slug | {args.skill_slug} |")
card.append(f"| Evaluation Date | {datetime.now().strftime('%Y-%m-%d')} |")
card.append(f"| Evaluator Model | {args.eval_model} |")
card.append(f"| Engine Version | multi-skill-eval v1.0.0 |")
if args.source:
card.append(f"| Source | {args.source} |")
card.append("")
card.append("## Overall Score")
card.append("")
card.append(f"**Overall: {overall_score}/10**")
card.append("")
card.append(f"| Component | Score | Max |")
card.append(f"|-----------|-------|-----|")
card.append(f"| Quality | {score_breakdown['quality']} | 5 |")
card.append(f"| Delta | {score_breakdown['delta']} | 3 |")
card.append(f"| Efficiency | {score_breakdown['efficiency']} | 2 |")
card.append(f"| **Total** | **{overall_score}** | **10** |")
card.append("")
card.append("### Score Breakdown")
card.append("")
card.append("- **Quality (0-5)**: Based on grading pass rate")
card.append("- **Delta (0-3)**: Based on with-skill vs without-skill improvement")
card.append("- **Efficiency (0-2)**: Based on time/token cost ratio")
card.append("")
card.append("---")
card.append("")
card.append("## With-Skill vs Without-Skill Comparison")
card.append("")
if data.get("benchmark"):
bm = data["benchmark"]
card.append(f"| Metric | With-Skill | Without-Skill | Delta |")
card.append(f"|---------|-----------|---------------|-------|")
card.append(f"| Pass Rate | {bm.get('with_skill', {}).get('pass_rate', 'N/A')} | {bm.get('without_skill', {}).get('pass_rate', 'N/A')} | {bm.get('delta', {}).get('pass_rate', 'N/A')} |")
card.append(f"| Avg Time | {bm.get('with_skill', {}).get('avg_time', 'N/A')} | {bm.get('without_skill', {}).get('avg_time', 'N/A')} | {bm.get('delta', {}).get('time', 'N/A')} |")
card.append(f"| Avg Tokens | {bm.get('with_skill', {}).get('avg_tokens', 'N/A')} | {bm.get('without_skill', {}).get('avg_tokens', 'N/A')} | {bm.get('delta', {}).get('tokens', 'N/A')} |")
else:
card.append("*No benchmark data available.*")
card.append("")
card.append("---")
card.append("")
card.append("## Per-Test-Case Breakdown")
card.append("")
if data.get("grading"):
assertions = data["grading"].get("assertions", [])
by_category = {}
for a in assertions:
cat = a.get("category", "other")
by_category.setdefault(cat, []).append(a)
for cat, items in by_category.items():
card.append(f"### {cat.upper()}")
card.append("")
for item in items:
status = "✅" if item.get("passed") else "❌"
card.append(f"- {status} {item.get('text', '')}")
if item.get('evidence'):
card.append(f" - Evidence: {item.get('evidence')}")
card.append("")
else:
card.append("*No per-test data available.*")
card.append("")
card.append("---")
card.append("")
card.append("## Strengths")
card.append("")
if strengths:
for s in strengths:
card.append(f"- ✅ {s}")
else:
card.append("*No specific strengths identified.*")
card.append("")
card.append("## Weaknesses")
card.append("")
if weaknesses:
for w in weaknesses:
card.append(f"- ❌ {w}")
else:
card.append("*No specific weaknesses identified.*")
card.append("")
card.append("---")
card.append("")
card.append("## Recommendation")
card.append("")
rec_emoji = {
"Recommended": "✅",
"Conditional": "⚠️",
"Marginal": "⚡",
"Not Recommended": "❌"
}
rec_text = {
"Recommended": "该技能已通过基准测试,推荐使用。",
"Conditional": "该技能在特定场景下可用,需注意已知问题。",
"Marginal": "该技能存在明显局限,建议修复后再使用。",
"Not Recommended": "该技能未通过基准测试,暂不推荐使用。"
}
card.append(f"### {rec_emoji.get(recommendation, '❓')} **{recommendation}**")
card.append("")
card.append(rec_text.get(recommendation, ""))
card.append("")
# Dependencies section
if data.get("preflight"):
card.append("---")
card.append("")
card.append("## Dependencies")
card.append("")
deps = data["preflight"].get("dependencies", {})
if deps:
card.append(f"- Required: {', '.join(deps.get('required', []))}")
card.append(f"- Optional: {', '.join(deps.get('optional', []))}")
card.append(f"- All Available: {'Yes' if deps.get('all_available') else 'No'}")
else:
card.append("*No dependency information available.*")
card.append("")
return "\n".join(card)
def main():
parser = argparse.ArgumentParser(
description="Generate a standardized Skill Card from benchmark results"
)
parser.add_argument(
"--workspace", "-w", required=True,
help="Path to benchmark results workspace"
)
parser.add_argument(
"--skill-name", "-n", required=True,
help="Name of the skill"
)
parser.add_argument(
"--skill-slug", "-s", required=True,
help="Slug/identifier for the skill"
)
parser.add_argument(
"--eval-model", "-m", default="minimax/MiniMax-M2",
help="Model used for evaluation"
)
parser.add_argument(
"--source",
help="Source of the skill (URL, path, etc.)"
)
parser.add_argument(
"--output", "-o",
help="Output file path (default: stdout)"
)
args = parser.parse_args()
card = generate_skill_card(args)
if args.output:
os.makedirs(os.path.dirname(args.output), exist_ok=True)
with open(args.output, "w", encoding="utf-8") as f:
f.write(card)
print(f"Skill card generated: {args.output}")
else:
print(card)
if __name__ == "__main__":
main()
FILE:scripts/grade-assertions.py
#!/usr/bin/env python3
"""
Assertion-based grading for skill benchmark results.
Usage:
python3 grade-assertions.py --workspace /path/to/results
python3 grade-assertions.py --workspace /path/to/results --verbose
This script reads benchmark execution results from a workspace directory
and grades them against defined assertions.
Workspace structure expected:
/path/to/results/
grading-config.json # Assertion definitions
iteration-1/
test-name/
with_skill/outputs/
result.md
without_skill/outputs/
result.md
Output:
grading.json with pass/fail per assertion and summary statistics.
"""
import argparse
import json
import os
import re
import sys
from pathlib import Path
# --- Assertion Types ---
class AssertionResult:
def __init__(self, assertion_type, text, passed, evidence="", category=""):
self.type = assertion_type
self.text = text
self.passed = passed
self.evidence = evidence
self.category = category
def to_dict(self):
return {
"type": self.type,
"text": self.text,
"passed": self.passed,
"evidence": self.evidence,
"category": self.category,
}
def check_keyword_absent(content, keyword, case_sensitive=False):
"""Assertion: a banned keyword should NOT appear in output."""
if case_sensitive:
found = keyword in content
else:
found = keyword.lower() in content.lower()
return not found
def check_keyword_present(content, keyword, case_sensitive=False):
"""Assertion: a required keyword MUST appear in output."""
if case_sensitive:
found = keyword in content
else:
found = keyword.lower() in content.lower()
return found
def check_required_section(content, section_marker):
"""Assertion: a required section header must appear in output."""
# Match markdown headers like "## Section Name" or "### Section"
pattern = rf"^#+\s+{re.escape(section_marker)}"
return bool(re.search(pattern, content, re.MULTILINE))
def check_file_exists(filepath):
"""Assertion: a file must exist."""
return os.path.isfile(filepath)
def check_format_json(content):
"""Assertion: content must be valid JSON."""
try:
json.loads(content)
return True
except (json.JSONDecodeError, TypeError):
return False
def check_format_markdown(content, min_lines=5):
"""Assertion: content must be valid markdown with minimum lines."""
lines = content.strip().split("\n")
has_headers = any(re.match(r"^#+ ", line) for line in lines)
return len(lines) >= min_lines and has_headers
def check_bilingual_keywords(content, zh_keyword, en_keyword):
"""Assertion: at least one language variant of a keyword must appear."""
has_zh = zh_keyword in content
has_en = en_keyword.lower() in content.lower()
return has_zh or has_en
def check_no_hardcoded_credentials(content):
"""Assertion: no API keys, tokens, or credentials in content."""
patterns = [
r'sk-[a-zA-Z0-9]{20,}', # OpenAI key
r'ghp_[a-zA-Z0-9]{36}', # GitHub PAT
r'api[_-]?key["\']?\s*[=:]\s*["\'][^"\']{8,}',
r'token["\']?\s*[=:]\s*["\'][^"\']{8,}',
]
for pattern in patterns:
if re.search(pattern, content, re.IGNORECASE):
return False
return True
def check_mention_count(content, keyword, min_count=1):
"""Assertion: a keyword must appear at least min_count times."""
return content.lower().count(keyword.lower()) >= min_count
# --- Grading Engine ---
AVAILABLE_CHECKS = {
"keyword_absent": check_keyword_absent,
"keyword_present": check_keyword_present,
"required_section": check_required_section,
"file_exists": check_file_exists,
"format_json": check_format_json,
"format_markdown": check_format_markdown,
"bilingual_keywords": check_bilingual_keywords,
"no_hardcoded_credentials": check_no_hardcoded_credentials,
"mention_count": check_mention_count,
}
def load_grading_config(workspace):
"""Load grading configuration from grading-config.json or generate defaults."""
config_path = os.path.join(workspace, "grading-config.json")
if os.path.exists(config_path):
with open(config_path, "r", encoding="utf-8") as f:
return json.load(f)
# Default: grade all output files for basic quality
return {
"Assertions": [],
"default_checks": {
"format_markdown": True,
"no_hardcoded_credentials": True,
}
}
def find_output_files(workspace):
"""Find all output files in the workspace."""
outputs = []
for root, dirs, files in os.walk(workspace):
# Skip the workspace root and non-output directories
if "outputs" in root:
for f in files:
if f.endswith((".md", ".txt", ".json", ".html")):
outputs.append(os.path.join(root, f))
return outputs
def grade_file(filepath, assertions):
"""Grade a single output file against assertions."""
results = []
try:
with open(filepath, "r", encoding="utf-8") as f:
content = f.read()
except (IOError, UnicodeDecodeError):
return [AssertionResult(
"file_exists", f"File: {filepath}", False,
f"Could not read file", "file"
)]
# Check each assertion
for assertion in assertions:
check_type = assertion.get("type", "")
params = assertion.get("params", {})
category = assertion.get("category", "general")
description = assertion.get("description", f"{check_type} check")
if check_type not in AVAILABLE_CHECKS:
results.append(AssertionResult(
check_type, description, False,
f"Unknown assertion type: {check_type}", category
))
continue
check_fn = AVAILABLE_CHECKS[check_type]
try:
# Handle different check signatures
if check_type == "keyword_absent":
passed = check_fn(content, params.get("keyword"), params.get("case_sensitive", False))
evidence = f"'{params.get('keyword')}' {'found' if not passed else 'not found'}"
elif check_type == "keyword_present":
passed = check_fn(content, params.get("keyword"), params.get("case_sensitive", False))
evidence = f"'{params.get('keyword')}' {'found' if passed else 'not found'}"
elif check_type == "required_section":
passed = check_fn(content, params.get("section"))
evidence = f"Section '{params.get('section')}' {'found' if passed else 'not found'}"
elif check_type == "file_exists":
passed = check_fn(filepath)
evidence = f"File exists: {passed}"
elif check_type == "format_json":
passed = check_fn(content)
evidence = f"Valid JSON: {passed}"
elif check_type == "format_markdown":
passed = check_fn(content, params.get("min_lines", 5))
evidence = f"Valid markdown with {len(content.split())} lines: {passed}"
elif check_type == "bilingual_keywords":
passed = check_fn(content, params.get("zh"), params.get("en"))
evidence = f"Keywords '{params.get('zh')}' or '{params.get('en')}' found: {passed}"
elif check_type == "no_hardcoded_credentials":
passed = check_fn(content)
evidence = f"No credentials found: {passed}"
elif check_type == "mention_count":
passed = check_fn(content, params.get("keyword"), params.get("min_count", 1))
evidence = f"'{params.get('keyword')}' appears {content.lower().count(params.get('keyword').lower())} times (min: {params.get('min_count', 1)})"
else:
passed = False
evidence = "Unknown check type"
results.append(AssertionResult(
check_type, description, passed, evidence, category
))
except Exception as e:
results.append(AssertionResult(
check_type, description, False, f"Check error: {e}", category
))
return results
def grade_workspace(workspace, verbose=False):
"""Grade all outputs in a workspace against assertions."""
workspace = os.path.abspath(workspace)
if not os.path.exists(workspace):
return {
"error": f"Workspace not found: {workspace}",
"expectations": [],
"summary": {"passed": 0, "failed": 0, "total": 0, "pass_rate": 0}
}
config = load_grading_config(workspace)
assertions = config.get("Assertions", [])
# If no assertions defined, use default checks
if not assertions:
assertions = _get_default_assertions()
output_files = find_output_files(workspace)
if verbose:
print(f"Grading workspace: {workspace}")
print(f"Found {len(output_files)} output files")
print(f"Running {len(assertions)} assertions per file")
all_results = []
for filepath in output_files:
if verbose:
print(f" Grading: {os.path.relpath(filepath, workspace)}")
results = grade_file(filepath, assertions)
all_results.extend(results)
# Deduplicate by text
seen = set()
unique_results = []
for r in all_results:
if r.text not in seen:
seen.add(r.text)
unique_results.append(r)
return {
"workspace": workspace,
"files_checked": len(output_files),
"assertions": [r.to_dict() for r in unique_results],
"expectations": [r.to_dict() for r in unique_results],
"summary": {
"passed": sum(1 for r in unique_results if r.passed),
"failed": sum(1 for r in unique_results if not r.passed),
"total": len(unique_results),
"pass_rate": sum(1 for r in unique_results if r.passed) / max(len(unique_results), 1),
}
}
def _get_default_assertions():
"""Get default assertions for markdown-based outputs."""
return [
{
"type": "format_markdown",
"description": "Output should be valid markdown with headers",
"category": "format",
"params": {"min_lines": 5}
},
{
"type": "no_hardcoded_credentials",
"description": "Output should not contain hardcoded credentials",
"category": "security",
"params": {}
},
{
"type": "keyword_present",
"description": "Output should contain substantive content",
"category": "quality",
"params": {"keyword": "##", "case_sensitive": False}
},
]
def main():
parser = argparse.ArgumentParser(
description="Grade benchmark results against assertions"
)
parser.add_argument(
"--workspace", "-w", required=True,
help="Path to benchmark results workspace"
)
parser.add_argument(
"--verbose", "-v", action="store_true",
help="Show detailed grading output"
)
parser.add_argument(
"--json", action="store_true",
help="Output JSON format"
)
args = parser.parse_args()
result = grade_workspace(args.workspace, verbose=args.verbose)
# Save grading.json
grading_path = os.path.join(args.workspace, "grading.json")
with open(grading_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
if args.json:
print(json.dumps(result, indent=2, ensure_ascii=False))
else:
summary = result["summary"]
total = summary["total"]
passed = summary["passed"]
failed = summary["failed"]
rate = summary["pass_rate"]
print(f"\n📊 Grading Results")
print(f"{'=' * 50}")
print(f"Files checked: {result.get('files_checked', 0)}")
print(f"Total assertions: {total}")
print(f" ✅ Passed: {passed}")
print(f" ❌ Failed: {failed}")
print(f" Pass rate: {rate:.1%}")
print(f"{'=' * 50}")
if failed > 0 and verbose:
print("\nFailed assertions:")
for a in result["assertions"]:
if not a["passed"]:
print(f" ❌ {a['text']}")
print(f" Evidence: {a['evidence']}")
print(f"\n📁 Results saved to: {grading_path}")
if __name__ == "__main__":
main()
FILE:scripts/static-analyze.py
#!/usr/bin/env python3
"""Static analysis for skill-assessment (Method 1).
Usage:
python3 static-analyze.py <path-to-skill>
python3 static-analyze.py <path-to-skill> --json
python3 static-analyze.py <path-to-skill> --problems-only
python3 static-analyze.py --compare <skill-a> <skill-b>
python3 static-analyze.py --all
"""
import argparse
import json
import os
import re
import sys
import yaml
def check_documentation(skill_path):
issues = []
score = 100
# Check SKILL.md exists
skill_md = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(skill_md):
issues.append({"severity": "error", "check": "SKILL.md exists", "msg": "SKILL.md not found"})
return {"score": 0, "issues": issues}
# Parse frontmatter
with open(skill_md, "r", encoding="utf-8") as f:
content = f.read()
match = re.match(r'^---\s*\n(.*?)\n---', content, re.DOTALL)
if not match:
issues.append({"severity": "error", "check": "frontmatter", "msg": "No YAML frontmatter"})
score -= 30
else:
try:
fm = yaml.safe_load(match.group(1))
if not isinstance(fm, dict):
issues.append({"severity": "error", "check": "frontmatter", "msg": "Frontmatter must be YAML dict"})
score -= 20
else:
if not fm.get("name"):
issues.append({"severity": "error", "check": "name field", "msg": "Missing 'name' in frontmatter"})
score -= 15
if not fm.get("description"):
issues.append({"severity": "warning", "check": "description field", "msg": "Missing 'description' in frontmatter"})
score -= 10
elif len(fm["description"].split()) < 15:
issues.append({"severity": "warning", "check": "description length", "msg": f"Description too short ({len(fm['description'].split())} words)"})
score -= 5
# Check trigger contexts
desc_lower = fm.get("description", "").lower()
trigger_phrases = ["use when", "use for", "when the user", "when you need", "such as", "for example"]
has_trigger = any(p in desc_lower for p in trigger_phrases)
if not has_trigger:
issues.append({"severity": "info", "check": "trigger contexts", "msg": "No trigger phrases found — add 'Use when...' to improve activation"})
score -= 5
except Exception as e:
issues.append({"severity": "error", "check": "frontmatter parse", "msg": f"YAML parse error: {e}"})
score -= 20
# Check body length
lines = content.split("\n")
body_started = False
body_lines = 0
for line in lines:
if line.strip() == "---" and not body_started:
body_started = True
continue
if body_started:
body_lines += 1
if body_lines < 10:
issues.append({"severity": "error", "check": "body length", "msg": f"Body only {body_lines} lines — too short"})
score -= 15
elif body_lines > 500:
issues.append({"severity": "info", "check": "body length", "msg": f"Body {body_lines} lines — consider splitting"})
# Check for usage examples in body
example_indicators = ["example", "usage", "```bash", "```python", "$ "]
has_examples = any(indicator in content.lower() for indicator in example_indicators)
if not has_examples:
issues.append({"severity": "warning", "check": "examples", "msg": "No usage examples found"})
score -= 10
return {"score": max(0, score), "issues": issues}
def check_code_quality(skill_path):
issues = []
score = 100
scripts_dir = os.path.join(skill_path, "scripts")
if not os.path.isdir(scripts_dir):
return {"score": 100, "issues": [{"severity": "info", "check": "scripts", "msg": "No scripts/ directory"}]}
py_files = [f for f in os.listdir(scripts_dir) if f.endswith(".py")]
if not py_files:
return {"score": 100, "issues": []}
for f in py_files:
fpath = os.path.join(scripts_dir, f)
try:
with open(fpath, "r", encoding="utf-8") as fh:
code = fh.read()
# Syntax check
try:
compile(code, fpath, "exec")
except SyntaxError as e:
issues.append({"severity": "error", "check": f"syntax ({f})", "msg": f"Syntax error at line {e.lineno}: {e.msg}"})
score -= 10
# Check for hardcoded secrets
secret_patterns = [
r'api[_-]?key\s*[=:]\s*["\'][^"\']{8,}',
r'token\s*[=:]\s*["\'][^"\']{8,}',
r'sk-[a-zA-Z0-9]{20,}',
r'ghp_[a-zA-Z0-9]{36}',
]
for pattern in secret_patterns:
if re.search(pattern, code, re.IGNORECASE):
issues.append({"severity": "error", "check": f"secrets ({f})", "msg": f"Possible hardcoded credential in {f}"})
score -= 15
# Note: "dangerous functions" (os.system, eval, exec) check removed.
# Presence of these functions is not a security issue without
# understanding data flow. Real security issues are caught above.
except Exception as e:
issues.append({"severity": "error", "check": "read error", "msg": f"Could not read {f}: {e}"})
score -= 5
return {"score": max(0, score), "issues": issues}
def check_configuration(skill_path):
issues = []
score = 100
skill_md = os.path.join(skill_path, "SKILL.md")
if not os.path.isfile(skill_md):
return {"score": 0, "issues": [{"severity": "error", "check": "SKILL.md", "msg": "SKILL.md not found"}]}
with open(skill_md, "r", encoding="utf-8") as f:
content = f.read()
# Check env vars referenced
env_refs = re.findall(r'\$\{?([A-Z_][A-Z0-9_]*)\}?', content)
if env_refs:
# Check they are documented
env_section = "env" in content.lower() or "environment" in content.lower() or "variable" in content.lower()
if not env_section:
issues.append({"severity": "warning", "check": "env vars", "msg": f"Env vars found but not documented: {', '.join(set(env_refs[:5]))}"})
score -= 10
# Check for clear command examples
has_commands = bool(re.search(r'```(bash|sh|shell|python)', content))
if not has_commands:
issues.append({"severity": "info", "check": "command examples", "msg": "No command examples found"})
score -= 5
return {"score": max(0, score), "issues": issues}
def check_maintenance(skill_path):
issues = []
score = 100
# Check for versioning
skill_md = os.path.join(skill_path, "SKILL.md")
if os.path.isfile(skill_md):
with open(skill_md, "r", encoding="utf-8") as f:
content = f.read()
if "version" not in content.lower():
issues.append({"severity": "info", "check": "version", "msg": "No version info found in SKILL.md"})
score -= 5
# Check for update date patterns
update_indicators = re.findall(r'(updated?|modified?|changed?|revised?):?\s*\d{4}-\d{2}-\d{2}', content, re.IGNORECASE)
if update_indicators:
issues.append({"severity": "info", "check": "update date", "msg": "Found update timestamps"})
# Check directory structure
for subdir in ["scripts", "references", "assets", "knowledge"]:
dp = os.path.join(skill_path, subdir)
if os.path.isdir(dp):
contents = [f for f in os.listdir(dp) if not f.startswith(".") and f != "__pycache__"]
if not contents:
issues.append({"severity": "warning", "check": f"{subdir}/ empty", "msg": f"{subdir}/ is empty"})
score -= 5
return {"score": max(0, score), "issues": issues}
def analyze(skill_path, json_output=False, problems_only=False):
if not os.path.isdir(skill_path):
print(f"Error: {skill_path} is not a directory", file=sys.stderr)
sys.exit(1)
doc = check_documentation(skill_path)
code = check_code_quality(skill_path)
config = check_configuration(skill_path)
maint = check_maintenance(skill_path)
total_score = (doc["score"] * 0.3 + code["score"] * 0.3 + config["score"] * 0.2 + maint["score"] * 0.2)
all_issues = doc["issues"] + code["issues"] + config["issues"] + maint["issues"]
if json_output:
result = {
"skill": os.path.basename(os.path.abspath(skill_path)),
"path": os.path.abspath(skill_path),
"scores": {
"documentation": doc["score"],
"code_quality": code["score"],
"configuration": config["score"],
"maintenance": maint["score"],
},
"overall": round(total_score, 1),
"issues": all_issues,
}
print(json.dumps(result, indent=2))
else:
skill_name = os.path.basename(os.path.abspath(skill_path))
print(f"\n📊 Skill Assessment: {skill_name}")
print(f"{'=' * 50}")
print(f" Documentation: {doc['score']}/100")
print(f" Code Quality: {code['score']}/100")
print(f" Configuration: {config['score']}/100")
print(f" Maintenance: {maint['score']}/100")
print(f" ─────────────────────────")
print(f" Overall: {total_score:.0f}/100")
print()
if problems_only:
filtered = [i for i in all_issues if i["severity"] != "info"]
else:
filtered = all_issues
if filtered:
print(" Issues:")
for issue in filtered:
icon = {"error": "❌", "warning": "⚠️ ", "info": "ℹ️ "}[issue["severity"]]
print(f" {icon} [{issue['check']}] {issue['msg']}")
else:
print(" ✅ No issues found")
print()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Static skill assessment")
parser.add_argument("path", nargs="?", help="Path to skill directory")
parser.add_argument("--json", action="store_true", help="JSON output")
parser.add_argument("--problems-only", action="store_true", help="Show only errors/warnings")
parser.add_argument("--compare", nargs=2, metavar=("SKILL-A", "SKILL-B"), help="Compare two skills")
parser.add_argument("--all", action="store_true", help="Analyze all skills in ~/.openclaw/skills/")
args = parser.parse_args()
if args.compare:
for p in args.compare:
analyze(p, args.json, args.problems_only)
elif args.all:
skills_dir = os.path.expanduser("~/.openclaw/skills/")
if os.path.isdir(skills_dir):
for d in os.listdir(skills_dir):
full = os.path.join(skills_dir, d)
if os.path.isdir(full) and not d.startswith("."):
analyze(full, args.json, args.problems_only)
else:
print(f"Skills dir not found: {skills_dir}")
elif args.path:
analyze(args.path, args.json, args.problems_only)
else:
parser.print_help()技能搜索全能版 | Multi Find Skills 触发场景:用户询问"有什么技能可以帮我..."、"找一个能做X的技能"、"有没有技能可以..."、"帮我搜一下XXX相关的技能"、"search for skill"、"找技能" 功能:三生态(ClawHub+LobeHub+skills.sh)搜索+质量核...
---
name: multi-find-skills
description: |
技能搜索全能版 | Multi Find Skills
触发场景:用户询问"有什么技能可以帮我..."、"找一个能做X的技能"、"有没有技能可以..."、"帮我搜一下XXX相关的技能"、"search for skill"、"找技能"
功能:三生态(ClawHub+LobeHub+skills.sh)搜索+质量核验+安装验证+生命周期管理
触发禁用:安装skill用clawhub install、管理已安装用openclaw skills list、创建新skill用skill-creator
author: "wagnzairong (https://github.com/wangzairong)"
evolution:
created_at: "2026-04-23"
version: "2.6.0"
domain: "#engineering"
success_rate: 0
parent_skill: ""
last_used: ""
use_count: 0
metadata:
openclaw:
emoji: "🔍"
requires:
bins: [clawhub]
---
# 🔍 Multi Find Skills(技能搜索全能版)v2.6.0
> 同步支持 **ClawHub**、**LobeHub**、**skills.sh** 三大生态的技能搜索工具,具备质量核验、安装验证与完整生命周期管理。
---
## 触发词
当用户说以下内容时,激活此技能:
- "有什么技能可以帮我..."
- "有没有技能可以..."
- "找一个能做 X 的技能"
- "有没有 X 相关的技能"
- "我需要一个能...的技能"
- "帮我搜一下 XXX 相关的技能"
- "X 有什么技能吗"
- "search for skill"
## 触发判断
### When to Use ✅
- 用户询问"有什么 技能 可以帮我做 X"
- 用户说"找一个能做 Y 的技能"
- 用户询问某个功能是否已有现成的技能
- 用户表示希望扩展能力(如设计、测试、部署等)
### When NOT to Use ❌
- 安装已知名称的技能 → 直接用 `clawhub install <skill-name>`
- 管理已安装的技能 → 用 `openclaw skills list`
- 创建新技能 → 用 `skill-creator`
- 单纯执行与技能搜索无关的任务 → 直接执行
---
## 核心功能
1. **三生态搜索** — ClawHub + LobeHub + skills.sh 多来源
2. **质量核验** — 安装量门槛 + Stars + 来源声誉
3. **安装验证** — 安装后自动验证文件存在性
4. **安全重跑** — 幂等设计,支持 --force 重试
5. **生命周期管理** — update / uninstall / list
---
## 工作流程
### 快速流程
```
用户需求 → 提取关键词 → 全源搜索 → 质量筛选 → 排序输出 → 展示结果 → 提供安装命令 → 安装验证
```
### 详细流程(8步)
```
第1步:理解用户需求
├─ 识别领域(React/测试/设计/部署)
├─ 识别具体任务(写测试/创建动画/PR审查)
├─ 匹配 quick-reference.md 关键词(识别领域关键字,未匹配则跳过)
└─ 判断是否适合搜索技能
第2步:关键词处理
├─ 合并领域 + 具体任务 → 关键词
└─ 中文 → 翻译成英文
第3步:全源并行搜索
├─ skills.sh:npx skills find "<关键词>"
├─ ClawHub:clawhub search "<关键词>"
├─ LobeHub:npx -y @lobehub/market-cli skills search --q "<关键词>"
└─ 并行执行,汇总结果
第4步:质量筛选
├─ 安装量 <100 → ⚠️
├─ Stars < 5 且来源不明 → ⚠️
└─ 官方/知名厂商 → 🌟 优先
第5步:排序输出
└─ 按安装量降序
第6步:展示结果(中文格式)
└─ 名称 + 描述 + 安装量 + 安装命令 + ✅/❌
第7步:提供安装命令
└─ 根据来源提供对应命令
第8步:安装后验证
└─ ls ~/.openclaw/skills/<skill-name>/SKILL.md
```
---
## 搜索命令速查
| 来源 | 命令 | 说明 |
|------|------|------|
| skills.sh | `npx skills find "<关键词>"` | 1️⃣ 首选,质量高更新快 |
| ClawHub | `clawhub search "<关键词>"` | 2️⃣ 次选,官方生态质控严 |
| ClawHub 热门 | `clawhub explore --sort installs --limit 20` | 按安装量浏览热门 |
| LobeHub | `npx -y @lobehub/market-cli skills search --q "<关键词>"` | 3️⃣ 补充,社区贡献丰富 |
| OpenClaw Directory | https://www.openclawdirectory.dev/skills | 网页分类浏览 |
| GitHub | 网页搜索 `site:github.com "openclaw skill"` | 补充来源,手动筛选 |
**搜索顺序**:skills.sh → ClawHub → LobeHub → GitHub
**超时处理**:单个来源超时 → 继续其他来源;所有超时 → 建议手动访问网站。
详见 `references/sources.md`(含搜索技巧、skills.sh 详细说明)
---
## 质量核验
### 安装量门槛
| 等级 | 安装量 | 说明 |
|------|--------|------|
| 🌟 优先推荐 | ≥1,000 | 成熟度高,社区验证充分 |
| ✅ 最低门槛 | ≥100 | 可用,需提示风险 |
| ⚠️ 谨慎使用 | <100 | 社区验证有限 |
**硬性门槛**:Stars ≥ 5,有明确作者/来源
### 推荐优先级
1. 🌟 官方/知名厂商 + 高安装量(≥100K 顶级,≥10K 热门)
- vercel-labs、anthropics、microsoft、openclaw 有例行安全审计
2. ✅ 社区技能,安装量 ≥1K,有明确作者
3. ⚠️ 安装量 <100 或来源不明,慎用
> skills.sh 安装量基于匿名遥测数据(用户安装时自动上报),详见 `references/sources.md`
---
## 安装命令
| 来源 | 命令 |
|------|------|
| **ClawHub** | `clawhub install <skill-name>` |
| **LobeHub** | `npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw` |
| **skills.sh** | `npx skills add <owner/repo@skill> -g -y` |
| **GitHub** | `npx skills add <owner/repo> -g -y` |
详见 `references/sources.md`(含单个 skill vs 仓库安装说明)
### 中国镜像(ClawHub 安装失败时)
```bash
clawhub config set registry https://cn.clawhub-mirror.com
clawhub install <skill-name> --registry https://cn.clawhub-mirror.com
```
---
## 安装验证
```bash
ls ~/.openclaw/skills/<skill-name>/SKILL.md
# 文件存在 = 安装成功
```
- 文件存在 → ✅ 安装成功
- 文件不存在 → ❌ 安装失败,尝试 `clawhub install <skill-name> --force`
---
## 生命周期管理
### 更新技能
```bash
clawhub install <skill-name> --force
npx skills add <owner/repo@skill> -g -y --force
```
### 卸载技能
```bash
rm -rf ~/.openclaw/skills/<skill-name>/
# 验证:ls ~/.openclaw/skills/<skill-name>/SKILL.md # 应返回空
```
### 查看已安装
```bash
openclaw skills list
ls ~/.openclaw/skills/
```
---
## 输出格式
```
🔍 "【需求关键词】" 搜索结果(共 X 个来源):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📦 skills.sh(高质量首选)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
总安装量:XXX,XXX | 24h 安装量:X,XXX ↑ | 排名:#XX
安装命令:npx skills add owner/repo@skill -g -y
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📦 ClawHub(官方生态)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
下载:X,XXX | Stars: XX | ✅ 已安装 / ❌ 未安装
安装命令:clawhub install skill-name
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🌐 LobeHub(社区市场)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【skill-name】
安装量:XX,XXX
安装命令:npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🐙 GitHub(补充来源)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 【owner/repo】
描述:【描述】
Stars: XX | 安装:npx skills add owner/repo -g -y
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 推荐:【最推荐的技能】,理由:【一句话】
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
**说明**:
- 总安装量:All Time 累计安装量(越大越成熟)
- 24h 安装量:Trending 24h 新增安装(越大越热门)
- 排名:当前在 skills.sh 排行榜的位置
详见 `references/sources.md`(含安装量字段说明、安装命令语法)
---
## 安全机制
### 输入验证
```bash
# 最大长度 100 字符
[[ #keyword -gt 100 ]] && echo "关键词过长" && exit 1
# 禁止危险字符(防止注入)
[[ "$keyword" =~ [[:space:]]*[\;\|\`\"\'\\] ]] && echo "包含非法字符" && exit 1
# 最小长度 2 字符
[[ #keyword -lt 2 ]] && echo "关键词至少2个字符" && exit 1
```
### 幂等性 & 安全重跑
| 操作 | 幂等性 | 安全重跑 |
|------|--------|---------|
| 搜索 | ✅ 天然幂等 | ✅ 安全,可重复 |
| 安装 | ✅ 幂等(--force) | ✅ 支持,使用--force |
| 验证 | ✅ 幂等 | ✅ 安全,只读 |
---
## 故障排除
| 问题 | 解决方案 |
|------|---------|
| ClawHub 无结果 | 检查网络,尝试英文关键词,用 `clawhub explore` 浏览热门 |
| skills.sh 无结果 | 确认 `npx` 可用,检查关键词拼写 |
| 速率限制 | 等待1小时,使用替代来源 |
| 安装失败 | 检查路径,确认 `~/.openclaw/skills/` 可写 |
| 验证失败 | 重新安装:`clawhub install <skill-name> --force` |
详细方案见 `references/troubleshooting.md`
---
## 找不到合适技能时
```
我搜索了 "【关键词】" 相关的 skills,但没有找到合适的匹配。
我仍然可以直接帮你完成这个任务!要我试试吗?
如果经常需要这个功能,可以考虑创建自定义 skill:
npx skills init my-custom-skill
```
---
## 相关技能
- `clawhub` - ClawHub CLI 工具
- `skill-creator` - 创建新技能
- `healthcheck` - 系统健康检查
---
*🔍 技能搜索全能版 v2.6.0*
FILE:CLAWHUB_PUBLISH.md
# CLAWHUB_PUBLISH.md
## 发布信息
| 字段 | 值 |
|------|-----|
| **Slug** | multi-find-skills |
| **名称** | Multi Find Skills(技能搜索全能版) |
| **版本** | 2.6.0 |
| **作者** | wagnzairong |
| **标签** | search, skill, finder, clawhub, skills-sh, openclaw, multi-source |
| **分类** | utilities |
## 描述(ClawHub 展示用)
中文:同时支持 ClawHub 和 skills.sh 两大生态的技能搜索工具,具备质量核验、安装验证和完整生命周期管理功能。支持6个搜索来源。
English: All-in-One Skill Finder for ClawHub and skills.sh ecosystems with quality checks, install verification, and full lifecycle management. Supports 6 search sources.
## 中国区说明
本技能在中国区使用 clawhub mirror 安装:
```bash
clawhub install multi-find-skills --registry https://cn.clawhub-mirror.com
```
或设置默认镜像:
```bash
clawhub config set registry https://cn.clawhub-mirror.com
```
## 依赖说明
本技能依赖以下 CLI 工具:
| 工具 | 安装方式 | 必选 |
|------|---------|------|
| `clawhub` | `npm i -g clawhub` | ✅ |
| `npx` | Node.js 内置 | ✅ |
如在中国区安装 clawhub 遇到问题,请使用镜像站。
## 质量门槛说明
本技能文档中的安装量门槛(≥100 / ≥1,000)和 Stars 门槛(≥5)是经验性指导值,非平台强制校验。实际选择时请结合具体 skill 的更新时间、社区反馈等因素综合判断。
## 功能特性
- **双生态搜索**:ClawHub + skills.sh 6个来源并行
- **质量核验**:安装量门槛 + Stars + 来源声誉
- **安装验证**:安装后自动验证文件存在性
- **安全重跑**:幂等设计,支持 --force 安全重试
- **生命周期管理**:update / uninstall / list
- **输入验证**:关键词长度+危险字符校验
## 支持
- **GitHub Issues**: https://github.com/wangzairong/skills
- **Discord**: https://discord.com/invite/clawd
- **ClawHub**: https://clawhub.ai/skills/multi-find-skills
## 更新日志
### v2.6.0
- 优化:quick-reference.md 全面升级(新增热门技能速查、3个新分类、扩展关键词)
- 优化:匹配 quick-reference.md 关键词移至第1步(理解需求时同步匹配),未匹配则跳过
- 优化:术语统一(搜索词 → 关键词),全流程一致
- 优化:搜索命令速查体现搜索顺序(1️⃣2️⃣3️⃣标注)
- 新增:输出格式中 skills.sh 新增"总安装量/24h安装量/排名"三字段
- 新增:输出格式新增 GitHub 渠道格式(owner/repo + Stars + 安装命令)
- 优化:LobeHub 输出格式改用安装命令替代链接
- 精简:整合 skills.sh 排名机制到推荐优先级判断
- 迁移:搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 更新:SKILL.md 正文只保留快速参考,详细内容引导至 sources.md
### v2.0.0
- 优化token成本:正文精简,详细内容移至 references/
- 新增生命周期管理:update/uninstall/list 指令
- 新增安全重跑机制:幂等性说明 + --force 安全重跑
- 新增输入验证:关键词长度+危险字符校验
- 触发词扩展到 frontmatter description
- 保留原始完整触发词、工作流、全6来源输出格式
- 补回搜索策略(按功能/提供商/热度)+ 查看详情 Step 3
- 补回核心/集成/Agent技能分类推荐列表
- 补回最佳实践(5条指导原则)
### v1.2.0
- 扩充搜索来源为 6 个独立来源
- 新增"搜索策略"章节(按功能/提供商/热度)
- 新增"最佳实践"、"安装提示"章节
### v1.1.0
- 统一安装路径为 `~/.openclaw/skills/`
- 添加质量门槛标注(可灵活调整)
### v1.0.0
- 双生态搜索(ClawHub + skills.sh)
- 质量核验流程
- 安装验证机制
- 整合自 eaton/find-skills-v2 + fangkelvin/find-skills-skill + guohongbin-git/skill-finder-cn
---
*本文件为 ClawHub 发布专用补充文档,不影响技能核心功能。*
FILE:references/changelog.md
# 更新日志
## v2.6.0
- 📝 优化:quick-reference.md 全面升级(新增热门技能速查、3个新分类、扩展关键词)
- 📝 优化:匹配 quick-reference.md 关键词移至第1步(理解需求时同步匹配,未匹配则跳过)
- 📝 优化:术语统一(搜索词 → 关键词),全流程一致
- 📝 优化:搜索命令速查体现搜索顺序(1️⃣2️⃣3️⃣标注)
- 📝 新增:输出格式中 skills.sh 新增"总安装量/24h安装量/排名"三字段
- 📝 新增:输出格式新增 GitHub 渠道格式(owner/repo + Stars + 安装命令)
- 📝 优化:LobeHub 输出格式改用安装命令替代链接
- 📝 精简:整合 skills.sh 排名机制到推荐优先级判断
- 📝 迁移:搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 📝 更新:SKILL.md 正文只保留快速参考,详细内容引导至 sources.md
## v2.5.0
- 📝 优化:LobeHub 输出格式改用安装命令替代链接
- 📝 精简:整合 skills.sh 排名机制到推荐优先级判断
- 📝 迁移:搜索技巧、安装命令语法、安装量字段说明移至 sources.md
- 📝 更新:SKILL.md 正文只保留快速参考,详细内容引导至 sources.md
- 📝 精简:合并重复内容(搜索策略章节精简)
- 📝 优化:安装命令章节增加"单个 vs 仓库"对比
- 📝 整理:分区更清晰(触发判断/核心功能/工作流程/搜索命令/质量核验/安装/生命周期/输出格式/安全机制/故障排除)
## v2.4.0
- 📝 重构:工作流程全步骤详解(第2-8步)
- 📝 新增:第2步详解(搜索词构成:领域+任务组合)
- 📝 新增:第3步详解(搜索顺序建议、超时处理)
- 📝 新增:第4步详解(筛选规则、检查项阈值)
- 📝 新增:第5-7步详解(输出格式示例)
- 📝 新增:第8步详解(安装验证命令与结果判断)
## v2.3.0
- 📝 重构:工作流程细化,详解第1步(理解用户需求)
- 📝 新增:判断是否搜索技能的决策表
- 📝 新增:关键词匹配优先级(参考 quick-reference.md)
- 📝 新增:用户需求 → 搜索关键词 转换示例
## v2.2.0
- 📝 新增:skills.sh 高级用法(--list 预览、仓库 vs 单个 skill)
- 📝 新增:skills.sh 排行榜分类说明(All Time / Trending 24h / Hot)
- 📝 新增:skills.sh 排名机制解释(匿名遥测数据)
- 📝 更新:质量核验章节,区分 ClawHub vs skills.sh 评估标准
## v2.1.0
- 🔧 修复:`clawhub search --sort` 不存在,改为 `clawhub explore --sort`
- 🔧 修复:LobeHub 命令从 `skills search "keyword"` 改为 `skills search --q "keyword"`
- 🗑️ 删除:无效引用 `find-skills (eaton)`(该技能不存在)
- 🗑️ 删除:排行榜优先逻辑(skills.sh leaderboard 返回 HTML 无法解析)
- 🗑️ 删除:装饰性技能分类列表(核心/集成/Agent技能)
- 📝 精简:移除与 references/ 重复的内容,正文只保留快速参考
- 📝 更新:搜索来源说明,明确标注 CLI vs 网页搜索
## v2.0.0
- ✅ 优化token成本:正文精简,详细内容移至references/
- ✅ 新增生命周期管理:update/uninstall/list指令
- ✅ 新增安全重跑机制:幂等性说明 + --force安全重跑
- ✅ 新增输入验证:关键词长度+危险字符校验
- ✅ 触发词扩展到frontmatter description
- ✅ 保留原始完整触发词、工作流、全6来源输出格式
- ✅ 补回搜索策略(按功能/提供商/热度)+ 查看详情Step 3
- ✅ 补回核心/集成/Agent技能分类推荐列表
- ✅ 补回最佳实践(5条指导原则)
## v1.2.0
- ✅ 扩充"技能搜索来源"为6个独立来源
- ✅ 新增"搜索策略"章节(按功能/提供商/热度)
- ✅ 新增"最佳实践"章节
- ✅ 新增"相关技能"章节
- ✅ 新增"安装提示"章节
## v1.1.0
- ✅ 统一安装路径为 `~/.openclaw/skills/`
- ✅ 添加 "When NOT to Use" 段落
- ✅ 添加 "找不到时" 指引
- ✅ 质量门槛标注"可灵活调整"
## v1.0.0
- ✅ 双生态搜索(ClawHub + skills.sh)
- ✅ 质量核验流程(安装量/Stars/来源)
- ✅ 安装验证机制
- ✅ 双语输出格式
FILE:references/quick-reference.md
# 常见需求关键词速查表
> 本表用于第1步"匹配 quick-reference.md 关键词"阶段,帮助快速识别用户需求的领域关键字。
---
## 🔥 热门技能速查(高安装量优先)
| 技能名 | 安装量 | 来源 | 对应需求 |
|--------|--------|------|----------|
| `find-skills` | 1.2M | vercel-labs | "找技能"、"搜索 skill" |
| `frontend-design` | 340K | anthropics | 前端开发、UI 设计 |
| `soultrace` | 318K | soultrace-ai | Agent 个性、记忆管理 |
| `web-design-guidelines` | 280K | vercel-labs | Web 设计、设计规范 |
| `vercel-react-best-practices` | 351K | vercel-labs | React 最佳实践 |
---
## 前端 / React
| 需求 | 搜索关键词 |
|------|-----------|
| React开发 | `react`, `react hooks`, `react component` |
| Next.js | `nextjs`, `next.js`, `ssr` |
| 前端样式 | `css`, `tailwind`, `styled-components` |
| 前端测试 | `react testing`, `playwright`, `e2e` |
| 响应式设计 | `responsive`, `mobile`, `adaptive` |
| 前端设计 | `frontend-design`, `ui design` |
| React最佳实践 | `vercel-react-best-practices` |
---
## 测试
| 需求 | 搜索关键词 |
|------|-----------|
| 单元测试 | `testing`, `jest`, `vitest` |
| E2E测试 | `playwright`, `cypress`, `e2e testing` |
| 集成测试 | `integration testing`, `api testing` |
| 性能测试 | `performance testing`, `load testing`, `k6` |
| 自动化测试 | `automation testing`, `test automation` |
---
## DevOps / 部署
| 需求 | 搜索关键词 |
|------|-----------|
| Docker | `docker`, `container`, `dockerfile` |
| Kubernetes | `kubernetes`, `k8s`, `helm` |
| CI/CD | `deploy`, `ci-cd`, `github actions`, `gitlab ci` |
| 监控 | `monitoring`, `logging`, `prometheus` |
| 云部署 | `aws`, `azure`, `gcp`, `vercel`, `netlify` |
| Azure | `microsoft-foundry`, `azure-ai` |
| 自动化部署 | `deployment automation`, `cd pipeline` |
---
## 文档生成
| 需求 | 搜索关键词 |
|------|-----------|
| README生成 | `readme`, `documentation`, `docs generator` |
| API文档 | `api-docs`, `swagger`, `openapi` |
| Changelog | `changelog`, `release notes` |
| 代码文档 | `jsdoc`, `typedoc`, `sphinx` |
| 技术文档 | `technical writing`, `docs as code` |
---
## 代码质量
| 需求 | 搜索关键词 |
|------|-----------|
| 代码审查 | `code review`, `lint`, `eslint` |
| 代码重构 | `refactor`, `clean code` |
| 代码格式化 | `prettier`, `format`, `formatter` |
| 静态分析 | `static analysis`, `sonarqube` |
| 类型检查 | `typescript`, `type checking`, `flow` |
---
## 设计 / UX
| 需求 | 搜索关键词 |
|------|-----------|
| UI设计 | `ui design`, `component library`, `design system` |
| UX优化 | `ux`, `user experience`, `usability` |
| 可访问性 | `accessibility`, `a11y`, `wcag` |
| 图标 | `icon`, `emoji`, `illustration` |
| 设计规范 | `web-design-guidelines`, `design guidelines` |
| Figma | `figma`, `figma design`, `ui prototyping` |
---
## SEO / 营销
| 需求 | 搜索关键词 |
|------|-----------|
| SEO审计 | `seo audit`, `seo analysis` |
| SEO优化 | `seo`, `search engine optimization` |
| 内容营销 | `content marketing`, `copywriting` |
| 社交媒体 | `social media`, `twitter`, `discord` |
| 广告投放 | `paid ads`, `google ads`, `facebook ads` |
---
## 数据库 / 后端
| 需求 | 搜索关键词 |
|------|-----------|
| 数据库设计 | `database design`, `sql`, `postgresql` |
| API开发 | `api`, `rest api`, `graphql` |
| 认证授权 | `auth`, `oauth`, `jwt`, `authentication` |
| 缓存 | `cache`, `redis`, `memcached` |
| 后端框架 | `nodejs`, `express`, `fastapi`, `django` |
| ORM | `orm`, `prisma`, `sqlalchemy` |
---
## 移动端
| 需求 | 搜索关键词 |
|------|-----------|
| React Native | `react native`, `mobile development` |
| Flutter | `flutter`, `cross-platform` |
| iOS开发 | `ios`, `swift`, `xcode` |
| Android开发 | `android`, `kotlin`, `java` |
| 移动端测试 | `mobile testing`, `appium` |
---
## AI / Agent
| 需求 | 搜索关键词 |
|------|-----------|
| LLM集成 | `llm`, `openai`, `claude`, `gpt` |
| Prompt工程 | `prompt engineering`, `prompt template` |
| Agent开发 | `agent`, `autonomous`, `multi-agent` |
| 向量数据库 | `vector db`, `pinecone`, `chroma` |
| RAG | `rag`, `retrieval`, `knowledge retrieval` |
| Agent个性 | `soultrace`, `agent personality`, `memory` |
---
## 效率工具 / 自动化
| 需求 | 搜索关键词 |
|------|-----------|
| 效率工具 | `productivity`, `automation`, `workflow` |
| 任务管理 | `task management`, `todo`, `project` |
| 笔记 | `note taking`, `knowledge base`, `obsidian` |
| 定时任务 | `cron`, `scheduler`, `job queue` |
| 工作流自动化 | `workflow automation`, `n8n`, `make` |
---
## 视频 / 媒体
| 需求 | 搜索关键词 |
|------|-----------|
| 视频生成 | `video generation`, `remotion`, `animation` |
| 图片生成 | `image generation`, `midjourney`, `dalle` |
| 音频处理 | `audio`, `speech to text`, `tts` |
| 媒体处理 | `media processing`, `ffmpeg` |
---
## 数据处理
| 需求 | 搜索关键词 |
|------|-----------|
| 数据分析 | `data analysis`, `pandas`, `jupyter` |
| 数据可视化 | `visualization`, `charts`, `dashboard` |
| ETL | `etl`, `data pipeline`, `airflow` |
| 数据工程 | `data engineering`, `spark`, `kafka` |
---
## 安全
| 需求 | 搜索关键词 |
|------|-----------|
| 安全审计 | `security audit`, `penetration testing` |
| 密钥管理 | `secret management`, `vault`, `aws secrets` |
| 依赖扫描 | `dependency scanning`, `snyk`, `renovate` |
| 安全加固 | `security hardening`, `owasp` |
---
## 平台集成
| 平台 | 搜索关键词 |
|------|-----------|
| GitHub | `github`, `github api`, `github actions` |
| 飞书 | `feishu`, `lark`, `飞书` |
| Notion | `notion`, `notion api` |
| Slack | `slack`, `slack bot` |
| Discord | `discord`, `discord bot` |
| Twitter/X | `twitter`, `x api`, `tweet` |
| 微信 | `wechat`, `wechat bot`, `wechat mp` |
---
## 多 Agent / 协作
| 需求 | 搜索关键词 |
|------|-----------|
| 多Agent系统 | `multi-agent`, `agent orchestration` |
| Agent通信 | `agent communication`, `agent protocol` |
| 任务编排 | `task orchestration`, `workflow` |
| Agent协作 | `agent collaboration`, `multi-agent ai` |
---
## 搜索策略总结
1. **从具体到泛化**:先搜索具体关键词,无结果再扩大范围
2. **多语言尝试**:中文需求先翻译成英文搜索
3. **同义词扩展**:一个概念多个同义词都试试
4. **官方+社区组合**:官方包优先,社区包补充
5. **优先热门技能**:优先匹配高安装量技能(见顶部🔥热门技能速查)
---
## 关键词 → 技能速查映射
| 用户需求关键词 | 对应技能 |
|----------------|----------|
| `find skill`, `搜技能` | `find-skills` (1.2M) |
| `frontend`, `前端` | `frontend-design` (340K) |
| `react best`, `React最佳实践` | `vercel-react-best-practices` (351K) |
| `web design`, `Web设计` | `web-design-guidelines` (280K) |
| `agent personality`, `Agent个性` | `soultrace` (318K) |
| `video`, `动画` | `remotion-best-practices` (269K) |
| `azure`, `microsoft云` | `microsoft-foundry` (254K) |
FILE:references/sources.md
# 搜索来源详细说明
## 1. ClawHub(官方生态)
```bash
# 搜索 skills(向量搜索)
clawhub search "<关键词>"
# 浏览热门技能(支持排序)
clawhub explore --sort installs --limit 20 # 按安装量
clawhub explore --sort rating --limit 20 # 按评分
clawhub explore --sort trending --limit 20 # 按趋势
# 查看详情
clawhub inspect <skill-name>
```
**特点**:官方生态,质控严格,SKILL.md 格式规范
## 2. skills.sh(Vercel 生态,高质量首选)
### 基础命令
```bash
# 搜索 skills
npx skills find "<关键词>"
# 浏览所有可用技能
npx skills add vercel-labs/agent-skills --list
```
### 高级用法
| 命令 | 说明 |
|------|------|
| `npx skills find "<关键词>"` | 搜索 skills |
| `npx skills add owner/repo --list` | 预览可用技能(不安装) |
| `npx skills list` | 查看已安装 |
| `npx skills list -g` | 查看全局安装 |
| `npx skills find "popular"` | 浏览热门技能 |
### 排行榜分类
| 分类 | 说明 | 适用场景 |
|------|------|---------|
| **All Time** | 总安装量排名 | 发现经过时间验证的成熟技能 |
| **Trending (24h)** | 24小时内安装量 | 发现最新热门趋势 |
| **Hot** | 实时热度 | 追逐最新潮流 |
**访问** https://skills.sh 查看完整排行榜
### 排名机制
skills.sh 安装量基于 **匿名遥测数据**(用户安装时自动上报,聚合统计,不追踪个人信息):
- ≥100K → 顶级 | ≥10K → 热门
- 官方技能(vercel-labs、anthropics、microsoft)有例行安全审计
**特点**:Vercel 出品,社区活跃,更新频繁
## 3. LobeHub(社区市场)
```bash
# 搜索 skills
npx -y @lobehub/market-cli skills search --q "<关键词>"
```
**特点**:社区贡献,市场丰富,适合发现创新/实验性技能
## 4. OpenClaw Directory
- 网站:https://www.openclawdirectory.dev/skills
- 支持按分类、人气或关键词搜索
- 可直接访问详情页查看安装量、描述
## 5. GitHub(补充来源)
```bash
# 网页搜索
site:github.com "openclaw skill"
site:github.com "SKILL.md"
```
**推荐来源**:
- `vercel-labs/agent-skills` - Vercel 官方
- `anthropics/` - Anthropic 相关
- `microsoft/` - Microsoft 相关
- `openclaw/` - OpenClaw 官方
## 6. 社区论坛
- SitePoint:https://www.sitepoint.com/community/
- Discord:https://discord.com/invite/clawd
## 搜索技巧
1. **使用具体关键词**:`"react testing"` 比单独 `"testing"` 效果更好
2. **尝试替代术语**:如果 `"deploy"` 没结果,试试 `"deployment"`
3. **检查热门来源**:vercel-labs、anthropics、microsoft、openclaw
4. **中文关键词翻译**:先用中文理解需求,再翻译成英文搜索
5. **搜索顺序**:skills.sh(首选)→ ClawHub → LobeHub → GitHub
6. **超时处理**:单个来源超时 → 继续其他来源;所有超时 → 建议手动访问网站
---
## 附录:输出格式字段说明
### skills.sh 安装量字段
| 字段 | 说明 |
|------|------|
| 总安装量 | All Time 累计安装量(越大越成熟) |
| 24h 安装量 | Trending 24h 新增安装(越大越热门) |
| 排名 | 当前在 skills.sh 排行榜的位置 |
### 安装命令语法
```bash
# 安装特定 skill
npx skills add owner/repo@skill-name -g -y
# 安装整个技能库
npx skills add owner/repo -g -y
# 参数说明
# -g:全局安装(用户级)
# -y:跳过确认提示
# @skill-name:指定特定 skill(不指定则安装整个仓库)
```
### LobeHub 安装命令
```bash
npx -y @lobehub/market-cli skills install <skill-name> --agent open-claw
```
FILE:references/troubleshooting.md
# 故障排除指南
## ClawHub 搜索无结果
**原因**:
- 网络连接问题
- 关键词不够具体
- 速率限制触发
**解决方案**:
```bash
# 1. 检查网络
curl -I https://clawhub.ai
# 2. 尝试不同关键词(同义词/英文)
clawhub search "weather" # 尝试英文
clawhub search "tavily search" # 尝试具体名称
# 3. 浏览热门技能发现新选项
clawhub explore --sort installs --limit 20
# 4. 手动访问网站搜索
open https://clawhub.ai
```
## skills.sh 搜索无结果
**原因**:
- `npx` 不可用
- 网络问题
- 包名/关键词不正确
**解决方案**:
```bash
# 1. 确认 npx 可用
npx --version
# 2. 尝试更通用的关键词
npx skills find "web"
# 3. 使用完整包名
npx skills find "vercel-labs/agent-skills@seo-best-practices"
```
## LobeHub 搜索无结果
**原因**:
- `npx` 不可用
- 网络问题
**解决方案**:
```bash
# 1. 确认 npx 可用
npx --version
# 2. 尝试网页搜索
open https://lobehub.com/skills?q=<关键词>
```
## 速率限制
**原因**:ClawHub API 触发速率限制
**解决方案**:
```bash
# 1. 等待 1 小时后再试
sleep 3600
# 2. 使用替代来源(网站搜索)
open https://clawhub.ai/search?q=<关键词>
# 3. 使用 GitHub 手动搜索
open https://github.com/search?q=openclaw+skill+<关键词>
```
## 安装失败
**原因**:
- 网络问题
- 权限问题
- 包名错误
- 安装到错误路径
**解决方案**:
```bash
# 1. 检查 ~/.openclaw/skills/ 是否可写
ls -la ~/.openclaw/skills/
# 2. 使用 --force 重试
clawhub install <skill-name> --force
# 3. 确认包名正确
clawhub search "<关键词>" # 确认精确包名
# 4. 使用中国镜像(网络问题)
clawhub install <skill-name> --registry https://cn.clawhub-mirror.com
```
## 安装后验证失败
**原因**:
- 安装实际失败但无报错
- 安装到了错误路径
**解决方案**:
```bash
# 1. 检查是否安装到了正确路径
ls ~/.openclaw/skills/<skill-name>/SKILL.md # 正确路径
ls ~/.openclaw/<skill-name>/SKILL.md # 错误路径
# 2. 如果安装到错误路径,手动移动
mv ~/.openclaw/<skill-name> ~/.openclaw/skills/<skill-name>/
# 3. 重新安装
clawhub install <skill-name> --force
```
## 搜索结果质量低
**原因**:
- 关键词太泛
- 相关技能确实不存在
**解决方案**:
```bash
# 1. 使用更具体的关键词
clawhub search "seo audit" # 比 "seo" 更具体
clawhub search "web scraping" # 比 "web" 更具体
# 2. 尝试多个相关关键词
clawhub search "weather"
clawhub search "weather forecast"
clawhub search "weather api"
# 3. 浏览热门技能发现灵感
clawhub explore --sort installs --limit 20
# 4. 如果确实没有合适匹配,明确告知用户
```
## 命令参数错误
**症状**:执行命令时报 `unknown option '--sort'` 等错误
**原因**:使用了不存在的命令参数
**解决方案**:
```bash
# 排序应使用 explore 命令,不是 search
clawhub explore --sort installs --limit 20 # ✅ 正确
# 不是
clawhub search --sort installs # ❌ 错误
```