@clawhub-harryzsh-88cbe715e7
A Karpathy-style persistent LLM wiki. Use when: (1) user says '加进wiki/ingest/摄入', (2) user says '查wiki/wiki里有没有', (3) user says '整理wiki/lint', (4) answering...
---
name: wikisage
description: "A Karpathy-style persistent LLM wiki. Use when: (1) user says '加进wiki/ingest/摄入', (2) user says '查wiki/wiki里有没有', (3) user says '整理wiki/lint', (4) answering questions that should check long-lived local knowledge first. Also use after answering valuable technical questions to ask if user wants to save to wiki."
metadata:
---
# Wikisage Skill
基于 Karpathy llm-wiki 模式的持久化 Wiki。
LLM 负责写和维护所有内容,用户负责来源、探索方向和提问。
纯本地 markdown 文件,用 index.md 导航,无需向量数据库。
## 📍 路径约定(环境变量驱动)
本 skill 所有路径都基于环境变量,无硬编码:
| 变量 | 默认值 | 作用 |
|------|--------|------|
| `WIKI_ROOT` | `$HOME/.openclaw/workspace/wiki` | Wiki markdown 根目录 |
| `MCPORTER_CONFIG` | `$HOME/.openclaw/workspace/config/mcporter.json` | mcporter 配置文件(可选) |
| `WIKI_SKILL_DIR` | `$HOME/.openclaw/workspace/skills/wikisage` | Skill 自身目录(脚本位置) |
首次部署时,在 shell/agent 环境里 export 一下这三个变量即可(或用默认值)。
下文示例用 `$WIKI_ROOT` 这种写法代替绝对路径。
## 🛠 执行通道:Obsidian MCP(首选,强烈推荐)
> **本 skill 围绕 Obsidian filesystem MCP server 设计。** 没装 MCP 也能跑(走 `read`/`write`/`edit` fallback),但装了会更稳:allowed-dir 边界兜底、错误更规范、LLM 不会意外写到 wiki 外面。
**所有 wiki 文件读写优先走 Obsidian filesystem MCP**,而不是通用 `read`/`write` 工具。
| 操作 | MCP 调用 |
|------|----------|
| 读文件 | `mcporter call obsidian.read_text_file path=<abs path>` |
| 写/覆盖文件 | `mcporter call obsidian.write_file path=<abs> content=<str>` |
| 列目录 | `mcporter call obsidian.list_directory path=<abs>` |
| 搜文件名 | `mcporter call obsidian.search_files path=<abs> pattern=<glob>` |
| 改文件 | `mcporter call obsidian.edit_file path=<abs> edits=...` |
| 看边界 | `mcporter call obsidian.list_allowed_directories` |
**所有调用都需要 `--config $MCPORTER_CONFIG`**
(mcporter 有双 config 坑:会同时读 `~/.claude.json` 和项目 config,不带 `--config` 只会看到 claude.json 里的 server)
**Fallback**:MCP 不可用时(daemon 挂了、server 不 healthy),用通用 `read`/`write`/`edit`/`exec grep` 兜底,但要在回复里告诉用户"MCP 离线,走 fallback"。
**全文搜索不走 MCP**:MCP 的 search 只匹配文件名。找内容用:
- qmd-search(workspace 集合,BM25,快但索引可能滞后)
- `exec grep -rn "关键词" $WIKI_ROOT/`
---
## 触发条件
| 用户说 | 执行 |
|---|---|
| "加进 wiki" / "ingest" / "摄入这篇" | → ingest 流程 |
| "查 wiki" / "wiki 里有没有" / "从 wiki 查" | → query 流程 |
| "整理 wiki" / "wiki 健康检查" / "lint" | → lint 流程 |
| 涉及**客户、历史决策、账号信息**的技术问题 | → 先本地查 wiki,再回答 |
| 通用技术问题(无特定上下文)| → 直接 MCP → LLM |
| 回答完有价值的技术问题后 | → 询问"要把这些存进 wiki 吗?" |
## 三层架构
```
$WIKI_ROOT/
├── raw/ 原始文档(只读,用户放入,LLM 不修改)
├── pages/ LLM 生成并维护的 markdown 文件集
│ ├── aws/ AWS 服务、架构、合规
│ ├── ai/ AI/LLM 技术
│ ├── clients/ 客户信息(账号、联系人、项目)
│ ├── projects/ 具体项目
│ └── ops/ 运维、kubectl、DevOps
├── index.md 所有页面目录(标题 + 一行描述 + 路径),每次 ingest 后更新
├── log.md 操作日志(append-only,格式:## [YYYY-MM-DD] ingest | 标题)
└── .ingest-cache.json SHA256 去重缓存(dedup.py 维护,不进 Obsidian vault)
```
**只有一个 wiki 目录:** `$WIKI_ROOT`(即 Obsidian MCP 的 allowed dir)
## Query 流程
详见 `scripts/query.md`
核心逻辑:
1. `obsidian.read_text_file` 读 `$WIKI_ROOT/index.md`,找相关页面
2. `obsidian.read_text_file` 读相关页面全文,综合回答,标注来源 `> 参考:[[页面名]]`
3. 答案本身有价值 → 询问用户是否存回 wiki
## Ingest 流程
详见 `scripts/ingest.md`
核心逻辑:
0. `dedup.py check` 去重(来源是文件/URL 时)→ DUPLICATE 就停
1. `obsidian.read_text_file` 读 index.md,判断是否已有相关页面
2. `obsidian.write_file` / `obsidian.edit_file` 新建 or 更新页面(一次 ingest 可能触碰 5-15 个页面)
3. `obsidian.edit_file` 更新 index.md
4. `obsidian.edit_file` 追加 log.md(`## [YYYY-MM-DD] ingest | 来源标题`)
5. `dedup.py record` 记录 SHA256 缓存(来源是文件/URL 时)
## Lint 流程
详见 `scripts/lint.md`
检查:孤儿页面、缺失概念页、index.md 不一致、矛盾内容、过时内容
(lint.py 脚本走 Python filesystem 直接读,不经过 MCP;LLM Layer 2 整改时走 MCP)
## 页面模板
```markdown
# 页面标题
**最后更新:** YYYY-MM-DD
**来源数量:** N
**分类:** aws/security
**置信度:** EXTRACTED <!-- 整页默认值;段落内可局部覆盖 -->
## 概述
## 核心内容
<!-- 置信度可以在段落/句子级别用 inline tag 标注: -->
<!-- [EXTRACTED] 原文直接扒的事实 -->
<!-- [INFERRED] 基于来源推理的结论 -->
<!-- [AMBIGUOUS] 来源本身表述模糊 -->
<!-- [UNVERIFIED] AI 自己补的常识/背景,未经来源验证 -->
## 相关页面
- [[相关页面名]]
## 来源
- [[原始文档页面名]]
- [外部链接](https://...)
```
### 置信度标签规则(强制)
| Tag | 含义 | 什么时候用 |
|-----|------|-----------|
| `EXTRACTED` | 从来源原文直接扒的事实 | 定价、API 参数、官方原话 |
| `INFERRED` | 基于来源推理/组合得出 | "所以月成本约 $80"(来源只给了单价) |
| `AMBIGUOUS` | 来源本身说得不清楚 | 文档自相矛盾或写得模糊 |
| `UNVERIFIED` | AI 补的背景常识,没来源 | 写页面时为了通顺加的常识性描述 |
**原则:**
- 整页默认置信度写在 frontmatter,**不要省略**
- 页面内如果混合了不同置信度的内容,**必须在段落开头/句尾用 inline tag 标注**
- Query 时如果引用了 `INFERRED` / `UNVERIFIED` 的内容,**必须在回答里明说**("这条是推断的")
## log.md 格式
每条记录格式:`## [YYYY-MM-DD] {操作} | {标题}`
```
## [2026-04-09] ingest | Karpathy llm-wiki 模式
## [2026-04-09] query | S3 Files POSIX 访问方案
## [2026-04-09] lint | 全库健康检查
```
可用 `grep "^## \[" $WIKI_ROOT/log.md | tail -10` 查最近操作。
FILE:README.md
# wikisage
A **Karpathy-style LLM Wiki** packaged as an [AgentSkill](https://github.com/openclaw/openclaw) for
[OpenClaw](https://openclaw.ai) / Claude Code / any skill-aware agent.
> Persistent, plain-markdown knowledge base where **the LLM writes and maintains all content**,
> and the user supplies sources, exploration direction, and questions.
> No vector database — an `index.md` plus Obsidian-style `[[wikilinks]]` is enough.
Inspired by Andrej Karpathy's "LLM wiki" pattern.
---
## ✨ Features
- **Three-layer structure**: `raw/` (sources) → `pages/` (LLM-maintained knowledge) → `index.md` (navigation)
- **Confidence tagging**: every page declares `EXTRACTED` / `INFERRED` / `AMBIGUOUS` / `UNVERIFIED` at both frontmatter and paragraph level — the LLM must surface this in answers
- **SHA256 dedup** for ingest sources (files / URLs), so you never re-index the same PDF twice
- **Two-layer lint**:
- *Layer 1* — `lint.py` (mechanical scan: orphans, missing concept pages, stale pages, missing cross-refs, missing confidence tags, index consistency)
- *Layer 2* — LLM walks the report and fixes issues interactively via MCP
- **Obsidian MCP first**: all reads/writes prefer the filesystem-sandboxed Obsidian MCP server, with `read`/`write`/`edit` fallback
- **Logged everything**: `log.md` is an append-only timeline of every ingest / query / lint operation
- **Cross-platform**: Linux, macOS, Windows — pure `pathlib`, no POSIX-only calls
---
## 🖥️ Platform support
Works on **Linux, macOS, and Windows**. All scripts use `pathlib` + `os.path.expanduser("~")`,
so `~` resolves correctly everywhere (`/home/you` on Linux, `/Users/you` on macOS,
`C:\Users\you` on Windows). There are no POSIX-only syscalls.
Only the *shell one-liners* in this README differ per OS — see platform-specific blocks below.
> **Heads-up**: this skill relies on an **Obsidian filesystem MCP server** as its primary
> read/write channel. It falls back to `read`/`write`/`edit` tools if MCP isn't wired up, but
> you get meaningfully better behavior (sandboxing, structured errors) with it. See
> [Dependencies](#-dependencies) below.
## 📦 Install
### As an OpenClaw skill
**Linux / macOS:**
```bash
git clone https://github.com/harryzsh/wikisage \
~/.openclaw/workspace/skills/wikisage
```
**Windows (PowerShell):**
```powershell
git clone https://github.com/harryzsh/wikisage `
"$HOME\.openclaw\workspace\skills\wikisage"
```
That's it — OpenClaw auto-discovers skills at startup.
### As a Claude Code skill
**Linux / macOS:**
```bash
git clone https://github.com/harryzsh/wikisage ~/.claude/skills/wikisage
```
**Windows (PowerShell):**
```powershell
git clone https://github.com/harryzsh/wikisage "$HOME\.claude\skills\wikisage"
```
### As a generic agent skill
Copy the folder into whatever directory your agent scans for skills, or point the agent at
`SKILL.md` directly.
---
## ⚙️ Configuration
All paths are driven by environment variables with safe defaults:
| Variable | Default | Purpose |
|----------|---------|---------|
| `WIKI_ROOT` | `$HOME/.openclaw/workspace/wiki` | Where the markdown wiki lives |
| `WIKI_SKILL_DIR` | `$HOME/.openclaw/workspace/skills/wikisage` | Where this skill is installed (scripts referenced by SKILL.md) |
| `MCPORTER_CONFIG` | `$HOME/.openclaw/workspace/config/mcporter.json` | Optional — path to your [mcporter](https://github.com/CrazyPython/mcporter) config (for the Obsidian MCP server) |
| `AWS_REGION` / `WIKI_EMBED_SECRET` | `us-east-1` / `wikisage/opensearch` | Only used by the optional `embed.py` (see below) |
> The skill itself is **channel-agnostic**. It does not push notifications anywhere. If you
> want weekly lint reports delivered to chat/email/a webhook, pipe `lint.py --summary` from
> your scheduler — see [Weekly lint schedule](#-weekly-lint-schedule) for examples.
Set them once in your shell profile, agent env, or cron line.
**Linux / macOS (bash/zsh):**
```bash
export WIKI_ROOT="$HOME/my-wiki"
export WIKI_SKILL_DIR="$HOME/.openclaw/workspace/skills/wikisage"
```
**Windows (PowerShell, current session):**
```powershell
$env:WIKI_ROOT = "$HOME\my-wiki"
$env:WIKI_SKILL_DIR = "$HOME\.openclaw\workspace\skills\wikisage"
```
**Windows (persistent, user-level):**
```powershell
[Environment]::SetEnvironmentVariable("WIKI_ROOT", "$HOME\my-wiki", "User")
[Environment]::SetEnvironmentVariable("WIKI_SKILL_DIR", "$HOME\.openclaw\workspace\skills\wikisage", "User")
```
> **Note for Windows users**: defaults like `~/.openclaw/workspace/wiki` resolve to
> `C:\Users\<you>\.openclaw\workspace\wiki`. If you prefer a more Windows-native location
> (e.g. `%USERPROFILE%\Documents\wiki`), just set `WIKI_ROOT` explicitly.
---
## 🗂 Initial wiki layout
After install, create the empty skeleton (or let the first ingest create it).
**Linux / macOS:**
```bash
mkdir -p "$WIKI_ROOT"/{raw,pages/{aws,ai,clients,projects,ops},.lint-history}
cat > "$WIKI_ROOT/index.md" <<'EOF'
# Wiki Index
_Pages auto-listed here by the LLM after each ingest._
EOF
touch "$WIKI_ROOT/log.md"
```
**Windows (PowerShell):**
```powershell
$root = $env:WIKI_ROOT
"raw","pages\aws","pages\ai","pages\clients","pages\projects","pages\ops",".lint-history" |
ForEach-Object { New-Item -ItemType Directory -Force -Path "$root\$_" | Out-Null }
"# Wiki Index`n`n_Pages auto-listed here by the LLM after each ingest._" |
Set-Content -Path "$root\index.md" -Encoding UTF8
New-Item -ItemType File -Force -Path "$root\log.md" | Out-Null
```
---
## 🔌 Dependencies
### Required
- **Python ≥ 3.9** (stdlib only for `lint.py` / `dedup.py` — no `pip install` needed)
- **Git** (to clone this repo)
### Strongly recommended: Obsidian filesystem MCP server
This skill is **designed around an Obsidian-style filesystem MCP server** as its primary
read/write channel. All operating rules in [`SKILL.md`](./SKILL.md) assume the LLM can call
`obsidian.read_text_file`, `obsidian.write_file`, `obsidian.edit_file`, `obsidian.list_directory`,
`obsidian.search_files`, and `obsidian.list_allowed_directories`.
**Why it matters:**
- Sandboxes all writes inside `$WIKI_ROOT` (allowed-dir enforcement) — the LLM can't accidentally
touch files outside the wiki.
- Gives structured errors the LLM can reason about, instead of raw shell failures.
- Matches the Obsidian editor's view if you also open the same directory in Obsidian desktop
(with any filesystem-based sync plugin) — works fine on Windows, macOS, Linux.
**How to wire it up via [mcporter](https://github.com/CrazyPython/mcporter):**
Add this to your `mcporter.json` (path defaults to `~/.openclaw/workspace/config/mcporter.json`,
or wherever `$MCPORTER_CONFIG` points):
```json
{
"servers": {
"obsidian": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "<absolute path to $WIKI_ROOT>"]
}
}
}
```
Replace `<absolute path to $WIKI_ROOT>` with the real path — most MCP launchers don't expand
environment variables inside the `args` array. Examples:
- Linux/macOS: `/home/you/.openclaw/workspace/wiki` or `/Users/you/wiki`
- Windows: `C:\\Users\\you\\.openclaw\\workspace\\wiki` (escape the backslashes in JSON)
Alternative MCP servers that work the same way:
- [`@modelcontextprotocol/server-filesystem`](https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem) (vanilla, used above)
- Any other filesystem-style MCP server that exposes `read_text_file` / `write_file` / `edit_file` / `list_directory`
**Fallback without MCP:** the skill still works — `SKILL.md` explicitly tells the LLM to fall
back to plain `read` / `write` / `edit` tools and log that MCP is offline. You lose the sandbox
guarantee but everything else keeps running.
### Optional / experimental
- `embed.py` — Bedrock Titan embeddings → OpenSearch indexing. Requires AWS creds, a secret named
`$WIKI_EMBED_SECRET` containing `{endpoint, username, password}`, and
`pip install boto3 opensearch-py requests-aws4auth`. Skip unless you want semantic search on
top of the wiki.
---
## 🚀 Usage
Once the skill is loaded, talk to your agent naturally:
| You say | Skill does |
|---------|-----------|
| "加进 wiki" / "ingest this" | Reads `index.md`, decides new-vs-update, writes page + updates index + logs |
| "查 wiki" / "what do we have on X" | Reads `index.md` + relevant pages, answers with `> 参考:[[page]]` citations |
| "整理 wiki" / "lint the wiki" | Runs `lint.py`, then LLM walks the report interactively to fix issues |
Under the hood the agent follows the flows in `scripts/ingest.md`, `scripts/query.md`, `scripts/lint.md`.
---
## 🗓 Weekly lint schedule
`lint.py` only *scans* and writes a report to `$WIKI_ROOT/.lint-history/YYYY-MM-DD.md`. It
does not push notifications anywhere — **delivery is your scheduler's job**. Use
`--summary` to get a single-line status suitable for piping into mail/chat/webhooks.
### Linux / macOS (cron)
```cron
# every Monday 02:00 local time: run full lint, write report to .lint-history/
0 2 * * 1 WIKI_ROOT=$HOME/.openclaw/workspace/wiki \
python3 $HOME/.openclaw/workspace/skills/wikisage/scripts/lint.py \
>> $HOME/.openclaw/workspace/wiki/.lint-history/cron.log 2>&1
```
**Pipe the one-line summary to whatever you use:**
```bash
# email
python3 .../lint.py --summary | mail -s 'wiki lint' [email protected]
# Slack incoming webhook
python3 .../lint.py --summary | \
xargs -I{} curl -s -X POST -H 'Content-type: application/json' \
--data '{"text":"{}"}' https://hooks.slack.com/services/XXX/YYY/ZZZ
# Any chat via openclaw CLI (Feishu / Discord / Telegram / Slack / ...)
python3 .../lint.py --summary | \
xargs -I{} openclaw message send --channel feishu --target user:ou_xxx --message {}
# Discord webhook
python3 .../lint.py --summary | \
xargs -I{} curl -s -X POST -H 'Content-type: application/json' \
--data '{"content":"{}"}' https://discord.com/api/webhooks/XXX/YYY
```
### Windows (Task Scheduler, PowerShell)
Register a weekly task that runs Monday 02:00:
```powershell
$wikiRoot = "$HOME\.openclaw\workspace\wiki"
$skillDir = "$HOME\.openclaw\workspace\skills\wikisage"
$action = New-ScheduledTaskAction -Execute "python" `
-Argument "`"$skillDir\scripts\lint.py`""
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Monday -At 2am
$principal = New-ScheduledTaskPrincipal -UserId "$env:USERNAME" -LogonType Interactive
$settings = New-ScheduledTaskSettingsSet -StartWhenAvailable
Register-ScheduledTask -TaskName "wikisage-weekly-lint" `
-Action $action -Trigger $trigger -Principal $principal -Settings $settings
[Environment]::SetEnvironmentVariable("WIKI_ROOT", $wikiRoot, "User")
```
To deliver a summary to chat/email, wrap it in a small script that pipes `lint.py --summary`
into your tool of choice, then point the scheduled task at that wrapper.
---
## 🧭 Why this pattern?
Plain markdown + Obsidian-style links gives you:
- **Zero lock-in** — it's just `.md` files; any editor works
- **Version-control friendly** — your wiki content belongs in a separate (private) git repo
- **Grep-able forever** — no ORM, no schema migrations, no embeddings to rebuild
- **LLM-native** — every page fits in context, and the whole `index.md` is an agent's cognitive map
The LLM is responsible for *curation* (deduping, cross-referencing, contradiction detection),
not just bulk-dumping. Hence the confidence tags, the lint flow, and the append-only `log.md`.
---
## 📁 Repository layout
```
wikisage/
├── SKILL.md # skill manifest + operating rules (what the LLM reads)
├── scripts/
│ ├── ingest.md # ingest flow spec
│ ├── query.md # query flow spec
│ ├── lint.md # lint flow spec (Layer 1 + Layer 2)
│ ├── lint.py # Layer 1 mechanical scanner
│ ├── dedup.py # SHA256 dedup cache for sources
│ └── embed.py # optional: Bedrock Titan → OpenSearch
├── README.md # you are here
└── LICENSE # MIT
```
---
## 🔐 Separate your wiki content from this skill
**Do not commit your actual wiki (`$WIKI_ROOT`) to this public repo.**
This repo contains only the *skill definition*. Your wiki content — clients, account IDs,
decisions — should live in:
- a **separate private repo** (recommended), or
- a local Mutagen/rclone mount, or
- AWS S3 / any blob store
That separation is the whole point: the skill is reusable across machines; the knowledge is yours.
---
## 📝 License
MIT — see [LICENSE](./LICENSE).
## 🙏 Credits
Pattern inspired by [Andrej Karpathy](https://x.com/karpathy)'s "LLM wiki" idea.
Built for / battle-tested on [OpenClaw](https://openclaw.ai).
FILE:scripts/dedup.py
#!/usr/bin/env python3
"""
Wiki Ingest 去重缓存(SHA256)
用法:
python3 dedup.py check <file_or_url> # 检查是否已 ingest,0=新内容,1=重复
python3 dedup.py record <file_or_url> <wiki_page_path> # 记录一条
python3 dedup.py list # 列出所有已记录
python3 dedup.py stats # 统计
环境变量:
WIKI_ROOT wiki markdown 根目录(默认 ~/.openclaw/workspace/wiki)
缓存文件:$WIKI_ROOT/.ingest-cache.json
格式:
{
"<sha256>": {
"source": "file:/path/to/pdf OR https://...",
"title": "来源标题(可选)",
"wiki_page": "pages/aws/xxx.md",
"ingested_at": "2026-04-25T06:07:00Z",
"size_bytes": 12345
},
...
}
"""
import hashlib
import json
import os
import sys
import urllib.request
from datetime import datetime, timezone
from pathlib import Path
def _wiki_dir() -> Path:
env = os.environ.get("WIKI_ROOT")
if env:
return Path(env).expanduser()
return Path.home() / ".openclaw/workspace/wiki"
WIKI_DIR = _wiki_dir()
CACHE_FILE = WIKI_DIR / ".ingest-cache.json"
def load_cache() -> dict:
if not CACHE_FILE.exists():
return {}
try:
return json.loads(CACHE_FILE.read_text())
except (json.JSONDecodeError, OSError):
return {}
def save_cache(cache: dict) -> None:
CACHE_FILE.parent.mkdir(parents=True, exist_ok=True)
CACHE_FILE.write_text(json.dumps(cache, indent=2, ensure_ascii=False, sort_keys=True))
def compute_hash(source: str) -> tuple[str, int]:
"""Return (sha256_hex, size_bytes) for a local path or URL. Raises on fetch errors."""
if source.startswith(("http://", "https://")):
req = urllib.request.Request(source, headers={"User-Agent": "wikisage-dedup/1.0"})
with urllib.request.urlopen(req, timeout=30) as resp:
data = resp.read()
else:
path = Path(source).expanduser()
if not path.exists():
raise FileNotFoundError(f"source not found: {source}")
data = path.read_bytes()
return hashlib.sha256(data).hexdigest(), len(data)
def cmd_check(args: list[str]) -> int:
if not args:
print("usage: dedup.py check <file_or_url>", file=sys.stderr)
return 2
source = args[0]
try:
digest, size = compute_hash(source)
except Exception as e:
print(f"ERROR computing hash for {source}: {e}", file=sys.stderr)
return 2
cache = load_cache()
if digest in cache:
entry = cache[digest]
print(f"DUPLICATE")
print(f" sha256: {digest}")
print(f" source: {entry.get('source')}")
print(f" title: {entry.get('title', '-')}")
print(f" wiki_page: {entry.get('wiki_page')}")
print(f" ingested: {entry.get('ingested_at')}")
return 1
print(f"NEW")
print(f" sha256: {digest}")
print(f" size: {size} bytes")
return 0
def cmd_record(args: list[str]) -> int:
if len(args) < 2:
print("usage: dedup.py record <file_or_url> <wiki_page_path> [title]", file=sys.stderr)
return 2
source = args[0]
wiki_page = args[1]
title = args[2] if len(args) > 2 else ""
try:
digest, size = compute_hash(source)
except Exception as e:
print(f"ERROR computing hash for {source}: {e}", file=sys.stderr)
return 2
cache = load_cache()
now = datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")
cache[digest] = {
"source": source,
"title": title,
"wiki_page": wiki_page,
"ingested_at": now,
"size_bytes": size,
}
save_cache(cache)
print(f"RECORDED {digest[:16]}... {wiki_page}")
return 0
def cmd_list(_args: list[str]) -> int:
cache = load_cache()
if not cache:
print("(cache empty)")
return 0
for digest, entry in sorted(cache.items(), key=lambda kv: kv[1].get("ingested_at", "")):
print(f"{digest[:16]}... {entry.get('ingested_at', '?'):<20} {entry.get('wiki_page'):<40} {entry.get('source')}")
return 0
def cmd_stats(_args: list[str]) -> int:
cache = load_cache()
total = len(cache)
size = sum(e.get("size_bytes", 0) for e in cache.values())
print(f"entries: {total}")
print(f"total size: {size:,} bytes")
print(f"cache file: {CACHE_FILE}")
return 0
COMMANDS = {
"check": cmd_check,
"record": cmd_record,
"list": cmd_list,
"stats": cmd_stats,
}
def main(argv: list[str]) -> int:
if len(argv) < 2 or argv[1] not in COMMANDS:
print(__doc__)
return 2
return COMMANDS[argv[1]](argv[2:])
if __name__ == "__main__":
sys.exit(main(sys.argv))
FILE:scripts/embed.py
#!/usr/bin/env python3
"""
embed.py - Bedrock Titan Embeddings → OpenSearch
用法:
# 索引一个页面
python3 embed.py --page wiki/pages/aws/eks.md --index wiki-personal
# 索引多个页面
python3 embed.py --pages wiki/pages/aws/eks.md wiki/pages/ai/litellm.md --index wiki-personal
# 向量搜索
python3 embed.py --query "EKS 节点组配置" --index wiki-personal --top-k 5
# 客户 wiki
python3 embed.py --page wiki-clients/clientA/pages/xxx.md --index wiki-client-clientA
"""
import argparse
import json
import os
import sys
import boto3
from datetime import datetime
REGION = os.environ.get("AWS_REGION", "us-east-1")
SECRET_NAME = os.environ.get("WIKI_EMBED_SECRET", "wikisage/opensearch")
WORKSPACE = os.environ.get("WIKI_WORKSPACE", os.path.expanduser("~/.openclaw/workspace"))
def get_opensearch_config():
sm = boto3.client("secretsmanager", region_name=REGION)
secret = sm.get_secret_value(SecretId=SECRET_NAME)
return json.loads(secret["SecretString"])
def get_embedding(text: str) -> list:
bedrock = boto3.client("bedrock-runtime", region_name=REGION)
body = json.dumps({"inputText": text[:8000]}) # Titan v2 max 8192 tokens
response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=body,
contentType="application/json",
accept="application/json",
)
result = json.loads(response["body"].read())
return result["embedding"]
def ensure_index(os_client, index_name: str):
"""创建 index(如果不存在)"""
from opensearchpy import OpenSearch, RequestsHttpConnection
if not os_client.indices.exists(index=index_name):
mapping = {
"settings": {"index": {"knn": True}},
"mappings": {
"properties": {
"page": {"type": "keyword"},
"content": {"type": "text"},
"embedding": {
"type": "knn_vector",
"dimension": 1024, # Titan v2 default
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
},
},
"category": {"type": "keyword"},
"updated_at": {"type": "date"},
}
},
}
os_client.indices.create(index=index_name, body=mapping)
print(f"✅ 创建 index: {index_name}")
def index_page(os_client, index_name: str, page_path: str):
"""索引单个页面"""
full_path = os.path.join(WORKSPACE, page_path) if not os.path.isabs(page_path) else page_path
with open(full_path, "r") as f:
content = f.read()
# 判断分类
category = "general"
for cat in ["aws", "ai", "projects", "ops"]:
if f"/{cat}/" in page_path:
category = cat
break
print(f"📄 生成 embedding: {page_path}")
embedding = get_embedding(content)
doc = {
"page": page_path,
"content": content,
"embedding": embedding,
"category": category,
"updated_at": datetime.utcnow().isoformat(),
}
os_client.index(index=index_name, id=page_path, body=doc)
print(f"✅ 已索引: {page_path}")
def search(os_client, index_name: str, query: str, top_k: int = 5):
"""向量搜索"""
print(f"🔍 搜索: {query}")
embedding = get_embedding(query)
search_body = {
"size": top_k,
"query": {
"knn": {
"embedding": {
"vector": embedding,
"k": top_k,
}
}
},
"_source": ["page", "content", "category"],
}
response = os_client.search(index=index_name, body=search_body)
hits = response["hits"]["hits"]
results = []
for hit in hits:
results.append({
"page": hit["_source"]["page"],
"score": hit["_score"],
"category": hit["_source"].get("category", ""),
"preview": hit["_source"]["content"][:200],
})
return results
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--page", help="单个页面路径")
parser.add_argument("--pages", nargs="+", help="多个页面路径")
parser.add_argument("--query", help="搜索查询")
parser.add_argument("--index", required=True, help="OpenSearch index 名")
parser.add_argument("--top-k", type=int, default=5)
args = parser.parse_args()
# 获取 OpenSearch 配置
config = get_opensearch_config()
endpoint = config["endpoint"]
if not endpoint:
print(f"❌ OpenSearch endpoint 未配置,请更新 Secrets Manager: {SECRET_NAME}")
sys.exit(1)
# 去掉 https:// 前缀
host = endpoint.replace("https://", "").rstrip("/")
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
# 使用 basic auth(Fine-grained access control)
os_client = OpenSearch(
hosts=[{"host": host, "port": 443}],
http_auth=(config["username"], config["password"]),
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
)
# 确保 index 存在
ensure_index(os_client, args.index)
if args.query:
results = search(os_client, args.index, args.query, args.top_k)
print(json.dumps(results, ensure_ascii=False, indent=2))
elif args.page:
index_page(os_client, args.index, args.page)
elif args.pages:
for page in args.pages:
index_page(os_client, args.index, page)
else:
print("❌ 请指定 --page、--pages 或 --query")
sys.exit(1)
if __name__ == "__main__":
main()
FILE:scripts/ingest.md
# Ingest 流程
当用户提供文档、PDF、URL,或产生了有价值的回答时执行。
所有 wiki 读写走 Obsidian MCP(详见 SKILL.md 执行通道)。
简写下方示例省略了 `--config`,实际调用要带上:
`--config $MCPORTER_CONFIG`(默认 `~/.openclaw/workspace/config/mcporter.json`)
示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`,`$WIKI_SKILL_DIR` 默认是 `~/.openclaw/workspace/skills/wikisage`。
---
## ⚠️ 强制:Step 0 去重检查(来源是文件/URL 时必跑)
如果 ingest 来源是 **具体文件或 URL**(PDF、博文、文档链接),**读之前**先算 SHA256 去重:
```bash
python3 $WIKI_SKILL_DIR/scripts/dedup.py check <file_or_url>
```
- 输出 `NEW` → 继续下面的 Step 1
- 输出 `DUPLICATE` → **停下**,告诉用户这个内容已 ingest 过(并指出对应 wiki 页面),问是否强制重新入库
如果来源是 **用户对话内容或即兴结论**(没有原始文件/URL),**跳过 Step 0**,直接从 Step 1 开始。
写完页面后,在最后的 Step 4/5 之间加一步记录缓存:
```bash
python3 $WIKI_SKILL_DIR/scripts/dedup.py \
record <file_or_url> pages/<category>/<slug>.md "来源标题"
```
---
## ⚠️ 强制:先判断,再决定怎么存
### Step 1:读 index.md,找相关页面
```bash
mcporter call obsidian.read_text_file path=$WIKI_ROOT/index.md
```
扫描所有已有页面标题和描述,判断新内容与哪个页面最相关。
### Step 2:判断存法
```
新内容
│
├── index.md 里有相关页面?
│ │
│ ├── YES:内容是什么类型?
│ │ ├── 扩展/补充现有概念 → 更新现有页面,追加新章节
│ │ ├── 独立子主题(可单独成篇)→ 新建页面 + 原页面加 [[链接]]
│ │ ├── 与现有内容矛盾 → 标注矛盾,询问用户哪个正确
│ │ └── 完全重复 → 不存,告知用户已有相关内容
│ │
│ └── NO → 新建页面
│
└── 存完后:更新 index.md(页面数 + 1,加条目)
```
### Step 3:判断标准详细说明
| 情况 | 判断依据 | 做法 |
|------|---------|------|
| 同一概念的不同角度 | 主题词相同(如都是"NIS 2")| 更新现有页面 |
| 独立子主题 | 标题不同,但有交叉(如"NIS 2 行动清单"vs"NIS 2 概述")| 新建 + 交叉链接 |
| 完全不同的主题 | 无重叠 | 新建页面 |
| 内容矛盾 | 两处对同一事实描述不同 | 询问用户 |
---
## 标准 Ingest 流程
```
Step 0: 去重检查(dedup.py check,只对文件/URL 类来源)
- NEW → 继续
- DUPLICATE → 停下并告知用户已 ingest 过
Step 1: 读 index.md(MCP read_text_file,判断是否已有相关页面)
Step 2: 根据判断:
- 新建 → MCP write_file(整篇)
- 更新 → MCP edit_file(局部)或 write_file(整篇覆盖)
Step 3: 追加 wiki/log.md(MCP edit_file,append-only,格式:## [YYYY-MM-DD] ingest | 标题)
Step 4: 更新 wiki/index.md(MCP edit_file:新建加条目;更新改描述)
Step 4.5: 记录去重缓存(dedup.py record,只对文件/URL 类来源)
Step 5: 一次 ingest 可能触碰 5-15 个相关页面,逐一更新交叉引用(MCP edit_file)
Step 6: 告知用户存储结果(页面路径 + 更新了哪些页面)
```
### MCP 调用示例
```bash
# 新建页面
mcporter call obsidian.write_file \
path=$WIKI_ROOT/pages/aws/security-hub.md \
content='# Security Hub
**最后更新:** 2026-04-25
...'
# 更新页面(局部改)
mcporter call obsidian.edit_file \
path=$WIKI_ROOT/pages/aws/security-hub.md \
edits='[{"oldText":"## 相关页面\n- [[A]]","newText":"## 相关页面\n- [[A]]\n- [[B]]"}]'
# 追加 log.md(用 edit_file 在文件尾部加一行;或整篇读+写)
```
---
## 页面文件命名规范
```
$WIKI_ROOT/pages/
├── aws/ AWS 服务、合规、架构
│ └── sources/ 原始文档摘要(raw sources)
├── ai/ AI/LLM 相关
│ └── sources/
├── projects/ 项目相关
└── ops/ 运维相关
```
- 文件名:小写 + 连字符,如 `security-hub.md`、`nis2-compliance-checklist.md`
- sources/ 下存原始文档摘要,父目录下存编译后的知识页面
---
## 页面模板
```markdown
# 页面标题
**最后更新:** YYYY-MM-DD
**来源数量:** N
**分类:** aws/security(路径)
**置信度:** EXTRACTED <!-- EXTRACTED | INFERRED | AMBIGUOUS | UNVERIFIED -->
## 概述
一段话说清楚这个主题是什么。
## 核心内容
...
<!-- 段落级置信度 inline tag(混合置信度的页面必须打): -->
<!-- [EXTRACTED] 原文直接扒的 -->
<!-- [INFERRED] 基于来源推理 -->
<!-- [AMBIGUOUS] 来源本身模糊 -->
<!-- [UNVERIFIED] AI 自己补的常识,没来源 -->
## 相关页面
- [[相关页面名]]
## 来源
- [[原始文档页面名]]
- [外部链接](https://...)
```
**置信度标注原则:**
1. frontmatter 的 `置信度:` 是整页默认值,**不要省**
2. 页面里如果一部分是来源原文扒的(EXTRACTED)、一部分是 AI 推断的(INFERRED),**必须在段落前加 inline tag**
3. Query 时引用 INFERRED/UNVERIFIED 的内容,回答里要明说是推断的
4. 详细规则见 SKILL.md「置信度标签规则」
---
## Ingest 完成后的轻量 lint
```bash
python3 $WIKI_SKILL_DIR/scripts/lint.py --quick
```
只查 index.md 一致性,防止 ingest 留脏。
FILE:scripts/lint.md
# Lint 流程(两层:机械扫描 + LLM 整理)
对齐 Karpathy LLM Wiki 模式:**LLM 才是真正的 lint 者**,脚本只做机械扫描和提醒。
示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`,
`$WIKI_SKILL_DIR` 默认是 `~/.openclaw/workspace/skills/wikisage`。
---
## Layer 1:机械扫描(lint.py · cron 每周一 02:00 UTC)
由 `scripts/lint.py` 执行,产出报告到 `$WIKI_ROOT/.lint-history/YYYY-MM-DD.md`。
### 扫描项(6 项对齐 Karpathy 原版)
| # | 检查项 | 谁做 |
|---|---|---|
| 1 | index.md 一致性(有条目但文件不存在 / 有文件但未记录) | 脚本 ✅ |
| 2 | 孤儿页面(没有被任何页面 [[引用]]) | 脚本 ✅ |
| 3 | 缺失概念页([[链接]] 但无对应文件) | 脚本 ✅ |
| 4 | 缺失交叉引用(A 提到 B 但没建 [[B]] 链接) | 脚本 ✅ |
| 5 | 过时页面(超过 90 天未更新,按 mtime) | 脚本 ✅ |
| 6 | 矛盾内容 / 被推翻的旧说法 / 数据空白 | **LLM**(Layer 2) |
### 定时调度(跨平台)
**Linux / macOS (cron):**
```cron
0 2 * * 1 WIKI_ROOT=$HOME/.openclaw/workspace/wiki python3 $HOME/.openclaw/workspace/skills/wikisage/scripts/lint.py >> $HOME/.openclaw/workspace/wiki/.lint-history/cron.log 2>&1
```
**Windows(Task Scheduler)**:详见 `README.md` 的 *Weekly lint schedule* 节(用 `python.exe`,而非 `python3`)。
脚本只写报告、打印到 stdout。**要推通知到邮件/聊天/webhook**,加 `--summary` 参数拿到一行摘要再在 cron/Task Scheduler 里自己 pipe,示例见 README。
脚本本身不推通知,只写报告。想要「本周 Lint:X 孤儿、Y 缺失页…」这种推送:
- 跳到 `--summary` 获取一行摘要
- 在你的 cron/Task Scheduler 里 pipe 到那个工具(邮件、Slack webhook、Discord webhook、`openclaw message`、飞书自定义机器人…)
---
## Layer 2:LLM 整理(用户触发 · Agent 执行)
**执行通道**:Layer 2 所有读写 wiki 文件走 **Obsidian MCP**(详见 SKILL.md 执行通道)。
Layer 1 的 `lint.py` 脚本还是走 Python filesystem 直读,速度快、扫描无副作用。
### 触发条件
用户说以下任一关键词 → 进入 Layer 2:
- "整理 wiki"
- "wiki 健康检查"
- "lint"(不加参数)
- "整理 wiki 矛盾"(只跑第 6 项)
### 执行流程
```
Step 1: 读最新 lint 报告
→ exec: ls $WIKI_ROOT/.lint-history/ | tail -1
→ mcporter call obsidian.read_text_file path=<报告文件>
→ 如果找不到报告(cron 还没跑过):先手动跑 python3 scripts/lint.py --no-log
Step 2: 逐类处理(按优先级)
【孤儿页面】——通常是漏了从 index.md 或其他页面建链接
→ 每个孤儿:
- 读页面看内容
- 判断归属:应该被谁引用?(index.md 肯定要加)
- 问用户:"建议在 X 页面加 [[孤儿]] 链接,同意吗?"
- 同意 → 改目标页面 + index.md
【缺失概念页】——[[链接]] 引用了但没文件
→ 按"被引用次数"排序(高频的先处理)
→ 每个:
- 看引用它的几个页面说了什么
- 判断:这概念**有独立价值**吗?
- 有 → 建议新建页面(问用户是否需要)
- 没(只是随手引用)→ 建议改成普通文字 + 删除 [[]]
- 是别的页面的别名 → 建议改成正确的 slug
【缺失交叉引用】——A 提到 B 但没 [[B]]
→ 每个:问"在 X 页面的 Y 章节加 [[B]] 链接吗?"
→ 同意 → 插入链接
【过时页面】
→ 每个:
- 读页面内容
- 是否过时?(时效性强的才算过时,概念性内容不算)
- 过时 → 建议:更新 / 标注 / 删除
- 问用户决策
【矛盾内容】(Layer 2 独有)
→ 扫所有页面,找同一概念/事实的描述
→ 对比发现矛盾
→ 标注 ⚠️ + 问用户哪个是对的
→ 改页面 + 更新 log
【数据空白】(Layer 2 独有)
→ 扫 wiki 的主题覆盖,找可能缺的重要主题
→ 建议:"要不要让我搜 X 然后补一页?"
Step 3: 每改一组,同步更新(全部走 MCP edit_file / write_file)
- index.md(页面增删)
- log.md(追加 ## [日期] lint-fix | 做了什么)
Step 4: 收尾
- 再跑一次 python3 scripts/lint.py --no-log 验证
- 汇报:改了 N 条,剩余 M 条未处理(不紧急)
```
### 行为原则(重要)
1. **逐项问,不批量改**——每个改动用户确认,避免改坏知识结构
2. **宁可保守**——不确定就问,不要自作主张
3. **链接规范**:slug 统一用小写连字符(`obsidian`、`aws-security-hub`),避免大小写孤儿
4. **改一批同步 index.md 一批**——防止中途出错留脏状态
5. **永远同步更新 log.md**——log 是 wiki 的时间线
---
## 轻量 lint(ingest 后自动触发)
```bash
python3 $WIKI_SKILL_DIR/scripts/lint.py --quick
```
只查 index.md 一致性,防止 ingest 留脏。ingest 流程最后一步可调用。
---
## 报告归档
```
$WIKI_ROOT/.lint-history/
├── 2026-04-18.md ← 今天的报告
├── 2026-04-20.md ← cron 下周一产出
├── 2026-04-27.md
└── cron.log ← cron 运行日志(stderr 也在这里)
```
如果 wiki 被 Mutagen/git 同步到本地编辑器(Obsidian/VS Code 等),用户在本地也能直接翻历史报告。
---
## 执行频率
- Layer 1(脚本):**每周一 02:00 UTC** 自动
- Layer 2(LLM):**用户触发** —— 看到周报通知后决定是否整理
- Ingest 后:**自动 --quick**(轻量 lint 防脏)
FILE:scripts/lint.py
#!/usr/bin/env python3
"""
Wiki Lint 脚本(Karpathy LLM Wiki 模式 · Layer 1 机械扫描)
Layer 1:机械扫描,写报告 + 打印摘要到 stdout
Layer 2:LLM 介入整理(由用户说「整理 wiki」触发,不是这个脚本的事)
用法:
python3 lint.py # 完整 lint
python3 lint.py --quick # 轻量 lint(ingest 后触发)
python3 lint.py --wiki-root /path # 自定义 wiki 根目录(也可用 $WIKI_ROOT)
python3 lint.py --summary # 只打印一行摘要到 stdout(给 cron pipe 用)
python3 lint.py --no-log # 不写 log.md(预览模式)
环境变量:
WIKI_ROOT wiki markdown 根目录(默认 ~/.openclaw/workspace/wiki)
报告产出:
$WIKI_ROOT/.lint-history/YYYY-MM-DD.md # 持久化报告
stdout # 完整报告或一行摘要(带 --summary)
推送通知?自行在 cron/Task Scheduler 里 pipe:
python3 lint.py --summary | mail -s 'wiki lint' [email protected]
python3 lint.py --summary | xargs -I{} openclaw message send --target user:xxx --message {}
python3 lint.py --summary | curl -X POST -d @- https://hooks.slack.com/services/...
"""
import os
import re
import sys
import argparse
from datetime import datetime, timedelta
from pathlib import Path
from collections import defaultdict
def default_wiki_root() -> Path:
env = os.environ.get("WIKI_ROOT")
if env:
return Path(env).expanduser()
return Path.home() / ".openclaw/workspace/wiki"
# "缺失交叉引用"判定:两个页面如果标题/标签相似度高,但互相没 [[链接]],可能漏了交叉引用
# 简化版:同目录下的页面,如果页面 A 的正文提到页面 B 的标题(非 [[]] 包裹),算"可能缺交叉引用"
SKIP_MENTION_CHECK_DIRS = {"sources", "raw", ".lint-history"}
def find_all_pages(wiki_dir: Path):
pages_dir = wiki_dir / "pages"
if not pages_dir.exists():
return []
return sorted(pages_dir.rglob("*.md"))
def extract_title(page_path: Path) -> str:
"""从页面第一行 # 标题 提取标题"""
try:
content = page_path.read_text(errors="ignore")
m = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
return m.group(1).strip() if m else page_path.stem
except Exception:
return page_path.stem
def extract_links(content: str):
"""提取所有 [[链接]]"""
return re.findall(r"\[\[([^\]]+)\]\]", content)
def check_index_consistency(wiki_dir: Path):
issues = []
index_file = wiki_dir / "index.md"
if not index_file.exists():
return [f"❌ index.md 不存在:{index_file}"]
index_content = index_file.read_text()
index_links = set(extract_links(index_content))
actual_pages = {p.stem for p in find_all_pages(wiki_dir)}
for link in index_links:
slug = link.replace(" ", "-").lower()
if link not in actual_pages and slug not in actual_pages:
issues.append(f" 📋 index.md 有条目但文件不存在:[[{link}]]")
for page in actual_pages:
if page not in index_links and page.replace("-", " ") not in index_links:
issues.append(f" 📋 文件存在但 index.md 未记录:{page}.md")
return issues
def check_orphan_pages(wiki_dir: Path):
pages = find_all_pages(wiki_dir)
if not pages:
return []
all_links = set()
for page in pages:
content = page.read_text(errors="ignore")
all_links.update(extract_links(content))
# index.md 里的链接也算引用
index_file = wiki_dir / "index.md"
if index_file.exists():
all_links.update(extract_links(index_file.read_text()))
orphans = []
for page in pages:
stem = page.stem
if stem not in all_links and stem.replace("-", " ") not in all_links:
rel = page.relative_to(wiki_dir)
orphans.append(f" - {rel}")
return orphans
def check_missing_concept_pages(wiki_dir: Path):
pages = find_all_pages(wiki_dir)
actual_pages = {p.stem for p in pages}
link_refs = defaultdict(list)
for page in pages:
content = page.read_text(errors="ignore")
for link in extract_links(content):
slug = link.replace(" ", "-").lower()
if link not in actual_pages and slug not in actual_pages:
link_refs[link].append(page.stem)
missing = []
for link, refs in sorted(link_refs.items(), key=lambda x: -len(x[1])):
missing.append(f" - [[{link}]] — 被 {len(refs)} 个页面引用({', '.join(refs[:3])}{'...' if len(refs) > 3 else ''})")
return missing
def check_stale_pages(wiki_dir: Path, days: int = 90):
pages = find_all_pages(wiki_dir)
stale = []
cutoff = datetime.now() - timedelta(days=days)
for page in pages:
mtime = datetime.fromtimestamp(page.stat().st_mtime)
if mtime < cutoff:
rel = page.relative_to(wiki_dir)
delta = (datetime.now() - mtime).days
stale.append(f" - {rel}({delta} 天未更新)")
return stale
def check_missing_confidence(wiki_dir: Path):
"""
检查每个页面 frontmatter 里是否有 `置信度:` 字段。
旧页面可以没有,但新写/更新的应该补。入库未标会让 Query 时无法判断来源可信度。
跟 lint 过的其他检查保持一致,返回 markdown list 条目。
"""
pages = find_all_pages(wiki_dir)
missing = []
tag_pat = re.compile(r"^\*\*置信度:\*\*", re.MULTILINE)
for page in pages:
# sources/ 和 raw/ 子目录的摘要页可先跳过(内容是摘抄,置信度一律看作 EXTRACTED)
if any(seg in page.parts for seg in SKIP_MENTION_CHECK_DIRS):
continue
try:
head = page.read_text(errors="replace")[:1024]
except OSError:
continue
if not tag_pat.search(head):
rel = page.relative_to(wiki_dir)
missing.append(f" - {rel}")
return missing
def check_missing_cross_refs(wiki_dir: Path):
"""
检查可能缺失的交叉引用:
页面 A 的正文里提到了页面 B 的完整标题(纯文本,非 [[]]),
但 A 的「相关页面」章节没有 [[B]] 链接 → 可能漏了交叉引用
"""
pages = find_all_pages(wiki_dir)
# 建 title → path 映射
title_to_page = {}
page_to_title = {}
for p in pages:
# 跳过 sources/raw 下的页面(它们本来就是摘要,不适合做概念枢纽)
rel = p.relative_to(wiki_dir)
if any(part in SKIP_MENTION_CHECK_DIRS for part in rel.parts):
continue
title = extract_title(p)
# 标题太短(< 4 字符)会误报,跳过
if len(title) < 4:
continue
title_to_page[title] = p
page_to_title[p] = title
suggestions = []
for page, title in page_to_title.items():
content = page.read_text(errors="ignore")
# 把本页面已有的 [[链接]] 全去掉,剩下的才是"纯文本提到"
stripped = re.sub(r"\[\[[^\]]+\]\]", "", content)
existing_links = set(extract_links(content))
existing_links_normalized = {l.lower() for l in existing_links}
for other_title, other_page in title_to_page.items():
if other_page == page:
continue
# 本页正文提到了 other_title(纯文本)
if other_title in stripped:
# 但 [[链接]] 里没包含 other_page.stem 或 other_title
if (other_page.stem not in existing_links_normalized
and other_title.lower() not in existing_links_normalized):
rel = page.relative_to(wiki_dir)
suggestions.append(f" - {rel} 提到了「{other_title}」但没建立 [[{other_page.stem}]] 链接")
# 去重(一个页面可能提到多次同一个别人,只报一次)
return sorted(set(suggestions))[:30] # 限制 30 条防爆
def write_report_file(wiki_dir: Path, report_md: str, now_date: str) -> Path:
"""报告写到 wiki/.lint-history/YYYY-MM-DD.md(持久化)"""
history_dir = wiki_dir / ".lint-history"
history_dir.mkdir(exist_ok=True)
report_file = history_dir / f"{now_date}.md"
report_file.write_text(report_md)
return report_file
def build_summary(wiki_dir: Path, now_date: str, stats: dict) -> str:
"""构造一行报警摘要,用于 --summary / 外部推送 pipe。"""
total_issues = (
stats.get("index_issues", 0)
+ stats.get("orphans", 0)
+ stats.get("missing_concepts", 0)
+ stats.get("missing_cross_refs", 0)
+ stats.get("stale", 0)
+ stats.get("missing_confidence", 0)
)
report_path = f"{wiki_dir}/.lint-history/{now_date}.md"
if total_issues == 0:
return f"📚 Wiki Lint: ✅ 0 issues | report: {report_path}"
return (
f"📚 Wiki Lint: {total_issues} issues "
f"(index:{stats.get('index_issues', 0)} "
f"orphans:{stats.get('orphans', 0)} "
f"missing-concepts:{stats.get('missing_concepts', 0)} "
f"missing-xref:{stats.get('missing_cross_refs', 0)} "
f"stale:{stats.get('stale', 0)} "
f"no-confidence:{stats.get('missing_confidence', 0)}) "
f"| report: {report_path}"
)
def run_lint(wiki_dir: Path, quick: bool = False, write_log: bool = True, summary_only: bool = False):
now = datetime.now().strftime("%Y-%m-%d %H:%M")
now_date = datetime.now().strftime("%Y-%m-%d")
report_lines = [f"# Wiki Lint 报告 — {wiki_dir} — {now}\n"]
print(f"🔍 Lint: {wiki_dir} ({'轻量模式' if quick else '完整模式'})")
# 1. index.md 一致性
index_issues = check_index_consistency(wiki_dir)
report_lines.append("## 📋 index.md 一致性")
if index_issues:
report_lines.extend(index_issues)
else:
report_lines.append(" ✅ 无问题")
report_lines.append("")
# 统计数据
stats = {"index_issues": len(index_issues)}
if not quick:
# 2. 孤儿页面
orphans = check_orphan_pages(wiki_dir)
stats["orphans"] = len(orphans)
report_lines.append(f"## ⚠️ 孤儿页面({len(orphans)} 个)")
report_lines.extend(orphans if orphans else [" ✅ 无孤儿页面"])
report_lines.append("")
# 3. 缺失概念页
missing = check_missing_concept_pages(wiki_dir)
stats["missing_concepts"] = len(missing)
report_lines.append(f"## 🔗 缺失概念页({len(missing)} 个)")
report_lines.extend(missing if missing else [" ✅ 无缺失概念页"])
report_lines.append("")
# 4. 缺失交叉引用
cross_refs = check_missing_cross_refs(wiki_dir)
stats["missing_cross_refs"] = len(cross_refs)
report_lines.append(f"## 🔀 可能缺失的交叉引用({len(cross_refs)} 个)")
report_lines.append(" _规则:页面 A 正文提到页面 B 的标题但没有 [[B]] 链接_")
report_lines.extend(cross_refs if cross_refs else [" ✅ 无"])
report_lines.append("")
# 5. 过时内容
stale = check_stale_pages(wiki_dir)
stats["stale"] = len(stale)
report_lines.append(f"## 📅 过时页面({len(stale)} 个,超过 90 天)")
report_lines.extend(stale if stale else [" ✅ 无过时页面"])
report_lines.append("")
# 5.5 缺置信度标签
no_conf = check_missing_confidence(wiki_dir)
stats["missing_confidence"] = len(no_conf)
report_lines.append(f"## 🏷️ 缺置信度标签({len(no_conf)} 个)")
report_lines.append(" _规则:页面 frontmatter 应有 `**置信度:**` 字段(EXTRACTED/INFERRED/AMBIGUOUS/UNVERIFIED)_")
report_lines.extend(no_conf if no_conf else [" ✅ 全部页面都有置信度标签"])
report_lines.append("")
# 6. 矛盾内容 / 空白点 → 需要 LLM(Layer 2)
report_lines.append("## 💡 需要 LLM 判断(Layer 2)")
report_lines.append(" - 矛盾内容(同一事实在多页描述不一致)")
report_lines.append(" - 过时说法(新来源推翻旧说法)")
report_lines.append(" - 数据空白(可以上网搜的主题)")
report_lines.append(" → 在对话里说「整理 wiki」触发 LLM 逐项处理")
report_lines.append("")
report = "\n".join(report_lines)
if not summary_only:
print(report)
# 写报告文件
report_file = write_report_file(wiki_dir, report, now_date)
if not summary_only:
print(f"\n📄 报告已保存:{report_file.relative_to(wiki_dir)}")
# 追加 log.md
if write_log:
log_file = wiki_dir / "log.md"
mode = "快速" if quick else "完整"
with open(log_file, "a") as f:
f.write(f"\n## [{now_date}] lint | {mode} lint\n\n")
f.write(f"- 模式:{mode}\n")
f.write(f"- 报告:`.lint-history/{now_date}.md`\n")
if not quick:
f.write(f"- 孤儿页面:{stats['orphans']} 个\n")
f.write(f"- 缺失概念页:{stats['missing_concepts']} 个\n")
f.write(f"- 缺失交叉引用:{stats['missing_cross_refs']} 个\n")
f.write(f"- 过时页面:{stats['stale']} 个\n")
f.write(f"- 缺置信度标签:{stats['missing_confidence']} 个\n")
f.write("\n")
# --summary: 只打一行到 stdout,供外部 pipe
if summary_only:
print(build_summary(wiki_dir, now_date, stats))
return report, stats
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--quick", action="store_true", help="轻量模式(只查 index.md)")
parser.add_argument("--wiki-root", default=None, help="wiki 根目录(默认 $WIKI_ROOT 或 ~/.openclaw/workspace/wiki)")
parser.add_argument("--summary", action="store_true", help="只打印一行摘要到 stdout(供 cron/Scheduler pipe 到邮件/聊天/webhook)")
parser.add_argument("--no-log", action="store_true", help="不追加 log.md(预览模式)")
args = parser.parse_args()
wiki_dir = Path(args.wiki_root).expanduser() if args.wiki_root else default_wiki_root()
if not wiki_dir.exists():
print(f"❌ wiki 目录不存在:{wiki_dir}", file=sys.stderr)
print(f" 提示:设置 $WIKI_ROOT 或用 --wiki-root 指定", file=sys.stderr)
sys.exit(2)
run_lint(
wiki_dir=wiki_dir,
quick=args.quick,
write_log=not args.no_log,
summary_only=args.summary,
)
FILE:scripts/query.md
# Query 流程
当用户问技术问题,或明确说"查 wiki"时执行。
## ⚠️ 强制顺序:wiki → MCP → LLM
所有 wiki 读操作走 Obsidian MCP(详见 SKILL.md 执行通道)。
简写下方示例省略了 `--config`,实际调用要带上:
`--config $MCPORTER_CONFIG`(默认 `~/.openclaw/workspace/config/mcporter.json`)
示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`。
### 第一步:读 wiki/index.md
```bash
mcporter call obsidian.read_text_file path=$WIKI_ROOT/index.md
```
→ 扫描所有页面标题和描述,找相关页面
→ 找到 → 读相关页面全文(下面第二步)→ 综合答
→ 找不到 → 进入第三步
### 第二步:读具体页面
```bash
mcporter call obsidian.read_text_file \
path=$WIKI_ROOT/pages/<category>/<slug>.md
```
需要的话同时读多篇,逐一综合。
**如果要全文模糊搜**(MCP 只能 glob 文件名):
```bash
# 优先:workspace 集合的 qmd-search(BM25)
# 兜底:grep 直搜
exec grep -rn "关键词" $WIKI_ROOT/pages/
```
### 第三步:查外部 MCP / 搜索(可选,按需)
如果本地 wiki 找不到,根据话题类型查外部来源(AWS 文档、定价、Tavily 搜索等)。
具体 MCP server 取决于用户在 `$MCPORTER_CONFIG` 里配置了什么:
```bash
# 例:AWS 文档(如果配置了 aws-kb)
mcporter call 'aws-kb.aws___search_documentation(search_phrase: "关键词")'
# 例:AWS 定价(如果配置了 aws-pricing)
mcporter call 'aws-pricing.get_aws_pricing(service_code: "...", region: "us-east-1")'
# 例:Web 搜索(如果配置了 tavily)
mcporter call tavily.search query="关键词"
```
→ 有结果 → 基于 MCP 结果回答,附 reference links
→ 没结果 → LLM 直接回答(兜底)
### 第四步:综合回答
基于 wiki 或 MCP 内容回答,末尾标注来源:
- wiki 来源:`> 参考:[[页面名]]`
- MCP 来源:`> 参考:[AWS 文档链接]`
**置信度透明(强制):**
- 读页面时注意 frontmatter 的 `置信度:` 和正文里的 inline tag([EXTRACTED] / [INFERRED] / [AMBIGUOUS] / [UNVERIFIED])
- 如果回答引用了 `INFERRED` / `UNVERIFIED` / `AMBIGUOUS` 的内容,**必须在回答里明说**:
- INFERRED → "这条是推断的(来源只写了…)"
- UNVERIFIED → "这是我补的常识,不在 wiki 来源里"
- AMBIGUOUS → "原文这里写得模糊,其他题请核对来源"
- 如果全部是 EXTRACTED,不用特别标注(默认就是原文扒的)
### 第五步:问是否存入 wiki
如果这次回答有价值(新知识、客户信息、决策记录),询问用户:
> "这个回答要存进 wiki 吗?"
如果是,通过 MCP 新建页面:
```bash
mcporter call obsidian.write_file \
path=$WIKI_ROOT/pages/<category>/queries/<date>-<slug>.md \
content='...'
```
然后进 ingest 流程更新 index.md 和 log.md。