HarryZhu

@clawhub-harryzsh-88cbe715e7
1prompts
0upvotes received
0contributions
Joined 3 months ago
1 contribution in the last year
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Less
Wikisage
Skill
A Karpathy-style persistent LLM wiki. Use when: (1) user says '加进wiki/ingest/摄入', (2) user says '查wiki/wiki里有没有', (3) user says '整理wiki/lint', (4) answering...
---
name: wikisage
description: "A Karpathy-style persistent LLM wiki. Use when: (1) user says '加进wiki/ingest/摄入', (2) user says '查wiki/wiki里有没有', (3) user says '整理wiki/lint', (4) answering questions that should check long-lived local knowledge first. Also use after answering valuable technical questions to ask if user wants to save to wiki."
metadata:
---

# Wikisage Skill

基于 Karpathy llm-wiki 模式的持久化 Wiki。
LLM 负责写和维护所有内容，用户负责来源、探索方向和提问。
纯本地 markdown 文件，用 index.md 导航，无需向量数据库。

## 📍 路径约定（环境变量驱动）

本 skill 所有路径都基于环境变量，无硬编码：

| 变量 | 默认值 | 作用 |
|------|--------|------|
| `WIKI_ROOT` | `$HOME/.openclaw/workspace/wiki` | Wiki markdown 根目录 |
| `MCPORTER_CONFIG` | `$HOME/.openclaw/workspace/config/mcporter.json` | mcporter 配置文件（可选） |
| `WIKI_SKILL_DIR` | `$HOME/.openclaw/workspace/skills/wikisage` | Skill 自身目录（脚本位置） |

首次部署时，在 shell/agent 环境里 export 一下这三个变量即可（或用默认值）。
下文示例用 `$WIKI_ROOT` 这种写法代替绝对路径。

## 🛠 执行通道：Obsidian MCP（首选，强烈推荐）

> **本 skill 围绕 Obsidian filesystem MCP server 设计。** 没装 MCP 也能跑（走 `read`/`write`/`edit` fallback），但装了会更稳：allowed-dir 边界兜底、错误更规范、LLM 不会意外写到 wiki 外面。

**所有 wiki 文件读写优先走 Obsidian filesystem MCP**，而不是通用 `read`/`write` 工具。

| 操作 | MCP 调用 |
|------|----------|
| 读文件 | `mcporter call obsidian.read_text_file path=<abs path>` |
| 写/覆盖文件 | `mcporter call obsidian.write_file path=<abs> content=<str>` |
| 列目录 | `mcporter call obsidian.list_directory path=<abs>` |
| 搜文件名 | `mcporter call obsidian.search_files path=<abs> pattern=<glob>` |
| 改文件 | `mcporter call obsidian.edit_file path=<abs> edits=...` |
| 看边界 | `mcporter call obsidian.list_allowed_directories` |

**所有调用都需要 `--config $MCPORTER_CONFIG`**
（mcporter 有双 config 坑：会同时读 `~/.claude.json` 和项目 config，不带 `--config` 只会看到 claude.json 里的 server）

**Fallback**：MCP 不可用时（daemon 挂了、server 不 healthy），用通用 `read`/`write`/`edit`/`exec grep` 兜底，但要在回复里告诉用户"MCP 离线，走 fallback"。

**全文搜索不走 MCP**：MCP 的 search 只匹配文件名。找内容用：
- qmd-search（workspace 集合，BM25，快但索引可能滞后）
- `exec grep -rn "关键词" $WIKI_ROOT/`

---

## 触发条件

| 用户说 | 执行 |
|---|---|
| "加进 wiki" / "ingest" / "摄入这篇" | → ingest 流程 |
| "查 wiki" / "wiki 里有没有" / "从 wiki 查" | → query 流程 |
| "整理 wiki" / "wiki 健康检查" / "lint" | → lint 流程 |
| 涉及**客户、历史决策、账号信息**的技术问题 | → 先本地查 wiki，再回答 |
| 通用技术问题（无特定上下文）| → 直接 MCP → LLM |
| 回答完有价值的技术问题后 | → 询问"要把这些存进 wiki 吗？" |

## 三层架构

```
$WIKI_ROOT/
├── raw/                  原始文档（只读，用户放入，LLM 不修改）
├── pages/                LLM 生成并维护的 markdown 文件集
│   ├── aws/              AWS 服务、架构、合规
│   ├── ai/               AI/LLM 技术
│   ├── clients/          客户信息（账号、联系人、项目）
│   ├── projects/         具体项目
│   └── ops/              运维、kubectl、DevOps
├── index.md              所有页面目录（标题 + 一行描述 + 路径），每次 ingest 后更新
├── log.md                操作日志（append-only，格式：## [YYYY-MM-DD] ingest | 标题）
└── .ingest-cache.json    SHA256 去重缓存（dedup.py 维护，不进 Obsidian vault）
```

**只有一个 wiki 目录：** `$WIKI_ROOT`（即 Obsidian MCP 的 allowed dir）

## Query 流程

详见 `scripts/query.md`

核心逻辑：
1. `obsidian.read_text_file` 读 `$WIKI_ROOT/index.md`，找相关页面
2. `obsidian.read_text_file` 读相关页面全文，综合回答，标注来源 `> 参考：[[页面名]]`
3. 答案本身有价值 → 询问用户是否存回 wiki

## Ingest 流程

详见 `scripts/ingest.md`

核心逻辑：
0. `dedup.py check` 去重（来源是文件/URL 时）→ DUPLICATE 就停
1. `obsidian.read_text_file` 读 index.md，判断是否已有相关页面
2. `obsidian.write_file` / `obsidian.edit_file` 新建 or 更新页面（一次 ingest 可能触碰 5-15 个页面）
3. `obsidian.edit_file` 更新 index.md
4. `obsidian.edit_file` 追加 log.md（`## [YYYY-MM-DD] ingest | 来源标题`）
5. `dedup.py record` 记录 SHA256 缓存（来源是文件/URL 时）

## Lint 流程

详见 `scripts/lint.md`

检查：孤儿页面、缺失概念页、index.md 不一致、矛盾内容、过时内容
（lint.py 脚本走 Python filesystem 直接读，不经过 MCP；LLM Layer 2 整改时走 MCP）

## 页面模板

```markdown
# 页面标题

**最后更新：** YYYY-MM-DD
**来源数量：** N
**分类：** aws/security
**置信度：** EXTRACTED  <!-- 整页默认值；段落内可局部覆盖 -->

## 概述

## 核心内容

<!-- 置信度可以在段落/句子级别用 inline tag 标注： -->
<!-- [EXTRACTED] 原文直接扒的事实 -->
<!-- [INFERRED]  基于来源推理的结论 -->
<!-- [AMBIGUOUS] 来源本身表述模糊 -->
<!-- [UNVERIFIED] AI 自己补的常识/背景，未经来源验证 -->

## 相关页面
- [[相关页面名]]

## 来源
- [[原始文档页面名]]
- [外部链接](https://...)
```

### 置信度标签规则（强制）

| Tag | 含义 | 什么时候用 |
|-----|------|-----------|
| `EXTRACTED`  | 从来源原文直接扒的事实 | 定价、API 参数、官方原话 |
| `INFERRED`   | 基于来源推理/组合得出 | "所以月成本约 $80"（来源只给了单价） |
| `AMBIGUOUS`  | 来源本身说得不清楚 | 文档自相矛盾或写得模糊 |
| `UNVERIFIED` | AI 补的背景常识，没来源 | 写页面时为了通顺加的常识性描述 |

**原则：**
- 整页默认置信度写在 frontmatter，**不要省略**
- 页面内如果混合了不同置信度的内容，**必须在段落开头/句尾用 inline tag 标注**
- Query 时如果引用了 `INFERRED` / `UNVERIFIED` 的内容，**必须在回答里明说**（"这条是推断的"）

## log.md 格式

每条记录格式：`## [YYYY-MM-DD] {操作} | {标题}`

```
## [2026-04-09] ingest | Karpathy llm-wiki 模式
## [2026-04-09] query | S3 Files POSIX 访问方案
## [2026-04-09] lint | 全库健康检查
```

可用 `grep "^## \[" $WIKI_ROOT/log.md | tail -10` 查最近操作。

FILE:README.md
# wikisage

A **Karpathy-style LLM Wiki** packaged as an [AgentSkill](https://github.com/openclaw/openclaw) for
[OpenClaw](https://openclaw.ai) / Claude Code / any skill-aware agent.

> Persistent, plain-markdown knowledge base where **the LLM writes and maintains all content**,
> and the user supplies sources, exploration direction, and questions.
> No vector database — an `index.md` plus Obsidian-style `[[wikilinks]]` is enough.

Inspired by Andrej Karpathy's "LLM wiki" pattern.

---

## ✨ Features

- **Three-layer structure**: `raw/` (sources) → `pages/` (LLM-maintained knowledge) → `index.md` (navigation)
- **Confidence tagging**: every page declares `EXTRACTED` / `INFERRED` / `AMBIGUOUS` / `UNVERIFIED` at both frontmatter and paragraph level — the LLM must surface this in answers
- **SHA256 dedup** for ingest sources (files / URLs), so you never re-index the same PDF twice
- **Two-layer lint**:
  - *Layer 1* — `lint.py` (mechanical scan: orphans, missing concept pages, stale pages, missing cross-refs, missing confidence tags, index consistency)
  - *Layer 2* — LLM walks the report and fixes issues interactively via MCP
- **Obsidian MCP first**: all reads/writes prefer the filesystem-sandboxed Obsidian MCP server, with `read`/`write`/`edit` fallback
- **Logged everything**: `log.md` is an append-only timeline of every ingest / query / lint operation
- **Cross-platform**: Linux, macOS, Windows — pure `pathlib`, no POSIX-only calls

---

## 🖥️ Platform support

Works on **Linux, macOS, and Windows**. All scripts use `pathlib` + `os.path.expanduser("~")`,
so `~` resolves correctly everywhere (`/home/you` on Linux, `/Users/you` on macOS,
`C:\Users\you` on Windows). There are no POSIX-only syscalls.

Only the *shell one-liners* in this README differ per OS — see platform-specific blocks below.

> **Heads-up**: this skill relies on an **Obsidian filesystem MCP server** as its primary
> read/write channel. It falls back to `read`/`write`/`edit` tools if MCP isn't wired up, but
> you get meaningfully better behavior (sandboxing, structured errors) with it. See
> [Dependencies](#-dependencies) below.

## 📦 Install

### As an OpenClaw skill

**Linux / macOS:**
```bash
git clone https://github.com/harryzsh/wikisage \
  ~/.openclaw/workspace/skills/wikisage
```

**Windows (PowerShell):**
```powershell
git clone https://github.com/harryzsh/wikisage `
  "$HOME\.openclaw\workspace\skills\wikisage"
```

That's it — OpenClaw auto-discovers skills at startup.

### As a Claude Code skill

**Linux / macOS:**
```bash
git clone https://github.com/harryzsh/wikisage ~/.claude/skills/wikisage
```

**Windows (PowerShell):**
```powershell
git clone https://github.com/harryzsh/wikisage "$HOME\.claude\skills\wikisage"
```

### As a generic agent skill

Copy the folder into whatever directory your agent scans for skills, or point the agent at
`SKILL.md` directly.

---

## ⚙️ Configuration

All paths are driven by environment variables with safe defaults:

| Variable | Default | Purpose |
|----------|---------|---------|
| `WIKI_ROOT` | `$HOME/.openclaw/workspace/wiki` | Where the markdown wiki lives |
| `WIKI_SKILL_DIR` | `$HOME/.openclaw/workspace/skills/wikisage` | Where this skill is installed (scripts referenced by SKILL.md) |
| `MCPORTER_CONFIG` | `$HOME/.openclaw/workspace/config/mcporter.json` | Optional — path to your [mcporter](https://github.com/CrazyPython/mcporter) config (for the Obsidian MCP server) |
| `AWS_REGION` / `WIKI_EMBED_SECRET` | `us-east-1` / `wikisage/opensearch` | Only used by the optional `embed.py` (see below) |

> The skill itself is **channel-agnostic**. It does not push notifications anywhere. If you
> want weekly lint reports delivered to chat/email/a webhook, pipe `lint.py --summary` from
> your scheduler — see [Weekly lint schedule](#-weekly-lint-schedule) for examples.

Set them once in your shell profile, agent env, or cron line.

**Linux / macOS (bash/zsh):**
```bash
export WIKI_ROOT="$HOME/my-wiki"
export WIKI_SKILL_DIR="$HOME/.openclaw/workspace/skills/wikisage"
```

**Windows (PowerShell, current session):**
```powershell
$env:WIKI_ROOT = "$HOME\my-wiki"
$env:WIKI_SKILL_DIR = "$HOME\.openclaw\workspace\skills\wikisage"
```

**Windows (persistent, user-level):**
```powershell
[Environment]::SetEnvironmentVariable("WIKI_ROOT", "$HOME\my-wiki", "User")
[Environment]::SetEnvironmentVariable("WIKI_SKILL_DIR", "$HOME\.openclaw\workspace\skills\wikisage", "User")
```

> **Note for Windows users**: defaults like `~/.openclaw/workspace/wiki` resolve to
> `C:\Users\<you>\.openclaw\workspace\wiki`. If you prefer a more Windows-native location
> (e.g. `%USERPROFILE%\Documents\wiki`), just set `WIKI_ROOT` explicitly.

---

## 🗂 Initial wiki layout

After install, create the empty skeleton (or let the first ingest create it).

**Linux / macOS:**
```bash
mkdir -p "$WIKI_ROOT"/{raw,pages/{aws,ai,clients,projects,ops},.lint-history}
cat > "$WIKI_ROOT/index.md" <<'EOF'
# Wiki Index

_Pages auto-listed here by the LLM after each ingest._
EOF
touch "$WIKI_ROOT/log.md"
```

**Windows (PowerShell):**
```powershell
$root = $env:WIKI_ROOT
"raw","pages\aws","pages\ai","pages\clients","pages\projects","pages\ops",".lint-history" |
  ForEach-Object { New-Item -ItemType Directory -Force -Path "$root\$_" | Out-Null }
"# Wiki Index`n`n_Pages auto-listed here by the LLM after each ingest._" |
  Set-Content -Path "$root\index.md" -Encoding UTF8
New-Item -ItemType File -Force -Path "$root\log.md" | Out-Null
```

---

## 🔌 Dependencies

### Required
- **Python ≥ 3.9** (stdlib only for `lint.py` / `dedup.py` — no `pip install` needed)
- **Git** (to clone this repo)

### Strongly recommended: Obsidian filesystem MCP server

This skill is **designed around an Obsidian-style filesystem MCP server** as its primary
read/write channel. All operating rules in [`SKILL.md`](./SKILL.md) assume the LLM can call
`obsidian.read_text_file`, `obsidian.write_file`, `obsidian.edit_file`, `obsidian.list_directory`,
`obsidian.search_files`, and `obsidian.list_allowed_directories`.

**Why it matters:**
- Sandboxes all writes inside `$WIKI_ROOT` (allowed-dir enforcement) — the LLM can't accidentally
  touch files outside the wiki.
- Gives structured errors the LLM can reason about, instead of raw shell failures.
- Matches the Obsidian editor's view if you also open the same directory in Obsidian desktop
  (with any filesystem-based sync plugin) — works fine on Windows, macOS, Linux.

**How to wire it up via [mcporter](https://github.com/CrazyPython/mcporter):**

Add this to your `mcporter.json` (path defaults to `~/.openclaw/workspace/config/mcporter.json`,
or wherever `$MCPORTER_CONFIG` points):

```json
{
  "servers": {
    "obsidian": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "<absolute path to $WIKI_ROOT>"]
    }
  }
}
```

Replace `<absolute path to $WIKI_ROOT>` with the real path — most MCP launchers don't expand
environment variables inside the `args` array. Examples:
- Linux/macOS: `/home/you/.openclaw/workspace/wiki` or `/Users/you/wiki`
- Windows:     `C:\\Users\\you\\.openclaw\\workspace\\wiki` (escape the backslashes in JSON)

Alternative MCP servers that work the same way:
- [`@modelcontextprotocol/server-filesystem`](https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem) (vanilla, used above)
- Any other filesystem-style MCP server that exposes `read_text_file` / `write_file` / `edit_file` / `list_directory`

**Fallback without MCP:** the skill still works — `SKILL.md` explicitly tells the LLM to fall
back to plain `read` / `write` / `edit` tools and log that MCP is offline. You lose the sandbox
guarantee but everything else keeps running.

### Optional / experimental
- `embed.py` — Bedrock Titan embeddings → OpenSearch indexing. Requires AWS creds, a secret named
  `$WIKI_EMBED_SECRET` containing `{endpoint, username, password}`, and
  `pip install boto3 opensearch-py requests-aws4auth`. Skip unless you want semantic search on
  top of the wiki.

---

## 🚀 Usage

Once the skill is loaded, talk to your agent naturally:

| You say | Skill does |
|---------|-----------|
| "加进 wiki" / "ingest this" | Reads `index.md`, decides new-vs-update, writes page + updates index + logs |
| "查 wiki" / "what do we have on X" | Reads `index.md` + relevant pages, answers with `> 参考：[[page]]` citations |
| "整理 wiki" / "lint the wiki" | Runs `lint.py`, then LLM walks the report interactively to fix issues |

Under the hood the agent follows the flows in `scripts/ingest.md`, `scripts/query.md`, `scripts/lint.md`.

---

## 🗓 Weekly lint schedule

`lint.py` only *scans* and writes a report to `$WIKI_ROOT/.lint-history/YYYY-MM-DD.md`. It
does not push notifications anywhere — **delivery is your scheduler's job**. Use
`--summary` to get a single-line status suitable for piping into mail/chat/webhooks.

### Linux / macOS (cron)

```cron
# every Monday 02:00 local time: run full lint, write report to .lint-history/
0 2 * * 1 WIKI_ROOT=$HOME/.openclaw/workspace/wiki \
  python3 $HOME/.openclaw/workspace/skills/wikisage/scripts/lint.py \
  >> $HOME/.openclaw/workspace/wiki/.lint-history/cron.log 2>&1
```

**Pipe the one-line summary to whatever you use:**

```bash
# email
python3 .../lint.py --summary | mail -s 'wiki lint' [email protected]

# Slack incoming webhook
python3 .../lint.py --summary | \
  xargs -I{} curl -s -X POST -H 'Content-type: application/json' \
    --data '{"text":"{}"}' https://hooks.slack.com/services/XXX/YYY/ZZZ

# Any chat via openclaw CLI (Feishu / Discord / Telegram / Slack / ...)
python3 .../lint.py --summary | \
  xargs -I{} openclaw message send --channel feishu --target user:ou_xxx --message {}

# Discord webhook
python3 .../lint.py --summary | \
  xargs -I{} curl -s -X POST -H 'Content-type: application/json' \
    --data '{"content":"{}"}' https://discord.com/api/webhooks/XXX/YYY
```

### Windows (Task Scheduler, PowerShell)

Register a weekly task that runs Monday 02:00:

```powershell
$wikiRoot  = "$HOME\.openclaw\workspace\wiki"
$skillDir  = "$HOME\.openclaw\workspace\skills\wikisage"
$action    = New-ScheduledTaskAction -Execute "python" `
    -Argument "`"$skillDir\scripts\lint.py`""
$trigger   = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Monday -At 2am
$principal = New-ScheduledTaskPrincipal -UserId "$env:USERNAME" -LogonType Interactive
$settings  = New-ScheduledTaskSettingsSet -StartWhenAvailable

Register-ScheduledTask -TaskName "wikisage-weekly-lint" `
  -Action $action -Trigger $trigger -Principal $principal -Settings $settings

[Environment]::SetEnvironmentVariable("WIKI_ROOT", $wikiRoot, "User")
```

To deliver a summary to chat/email, wrap it in a small script that pipes `lint.py --summary`
into your tool of choice, then point the scheduled task at that wrapper.

---

## 🧭 Why this pattern?

Plain markdown + Obsidian-style links gives you:

- **Zero lock-in** — it's just `.md` files; any editor works
- **Version-control friendly** — your wiki content belongs in a separate (private) git repo
- **Grep-able forever** — no ORM, no schema migrations, no embeddings to rebuild
- **LLM-native** — every page fits in context, and the whole `index.md` is an agent's cognitive map

The LLM is responsible for *curation* (deduping, cross-referencing, contradiction detection),
not just bulk-dumping. Hence the confidence tags, the lint flow, and the append-only `log.md`.

---

## 📁 Repository layout

```
wikisage/
├── SKILL.md              # skill manifest + operating rules (what the LLM reads)
├── scripts/
│   ├── ingest.md         # ingest flow spec
│   ├── query.md          # query flow spec
│   ├── lint.md           # lint flow spec (Layer 1 + Layer 2)
│   ├── lint.py           # Layer 1 mechanical scanner
│   ├── dedup.py          # SHA256 dedup cache for sources
│   └── embed.py          # optional: Bedrock Titan → OpenSearch
├── README.md             # you are here
└── LICENSE               # MIT
```

---

## 🔐 Separate your wiki content from this skill

**Do not commit your actual wiki (`$WIKI_ROOT`) to this public repo.**

This repo contains only the *skill definition*. Your wiki content — clients, account IDs,
decisions — should live in:

- a **separate private repo** (recommended), or
- a local Mutagen/rclone mount, or
- AWS S3 / any blob store

That separation is the whole point: the skill is reusable across machines; the knowledge is yours.

---

## 📝 License

MIT — see [LICENSE](./LICENSE).

## 🙏 Credits

Pattern inspired by [Andrej Karpathy](https://x.com/karpathy)'s "LLM wiki" idea.
Built for / battle-tested on [OpenClaw](https://openclaw.ai).

FILE:scripts/dedup.py
#!/usr/bin/env python3
"""
Wiki Ingest 去重缓存（SHA256）

用法：
  python3 dedup.py check <file_or_url>    # 检查是否已 ingest，0=新内容，1=重复
  python3 dedup.py record <file_or_url> <wiki_page_path>  # 记录一条
  python3 dedup.py list                   # 列出所有已记录
  python3 dedup.py stats                  # 统计

环境变量：
  WIKI_ROOT    wiki markdown 根目录（默认 ~/.openclaw/workspace/wiki）

缓存文件：$WIKI_ROOT/.ingest-cache.json
格式：
  {
    "<sha256>": {
      "source": "file:/path/to/pdf OR https://...",
      "title": "来源标题（可选）",
      "wiki_page": "pages/aws/xxx.md",
      "ingested_at": "2026-04-25T06:07:00Z",
      "size_bytes": 12345
    },
    ...
  }
"""

import hashlib
import json
import os
import sys
import urllib.request
from datetime import datetime, timezone
from pathlib import Path

def _wiki_dir() -> Path:
    env = os.environ.get("WIKI_ROOT")
    if env:
        return Path(env).expanduser()
    return Path.home() / ".openclaw/workspace/wiki"


WIKI_DIR = _wiki_dir()
CACHE_FILE = WIKI_DIR / ".ingest-cache.json"


def load_cache() -> dict:
    if not CACHE_FILE.exists():
        return {}
    try:
        return json.loads(CACHE_FILE.read_text())
    except (json.JSONDecodeError, OSError):
        return {}


def save_cache(cache: dict) -> None:
    CACHE_FILE.parent.mkdir(parents=True, exist_ok=True)
    CACHE_FILE.write_text(json.dumps(cache, indent=2, ensure_ascii=False, sort_keys=True))


def compute_hash(source: str) -> tuple[str, int]:
    """Return (sha256_hex, size_bytes) for a local path or URL. Raises on fetch errors."""
    if source.startswith(("http://", "https://")):
        req = urllib.request.Request(source, headers={"User-Agent": "wikisage-dedup/1.0"})
        with urllib.request.urlopen(req, timeout=30) as resp:
            data = resp.read()
    else:
        path = Path(source).expanduser()
        if not path.exists():
            raise FileNotFoundError(f"source not found: {source}")
        data = path.read_bytes()
    return hashlib.sha256(data).hexdigest(), len(data)


def cmd_check(args: list[str]) -> int:
    if not args:
        print("usage: dedup.py check <file_or_url>", file=sys.stderr)
        return 2
    source = args[0]
    try:
        digest, size = compute_hash(source)
    except Exception as e:
        print(f"ERROR computing hash for {source}: {e}", file=sys.stderr)
        return 2

    cache = load_cache()
    if digest in cache:
        entry = cache[digest]
        print(f"DUPLICATE")
        print(f"  sha256:     {digest}")
        print(f"  source:     {entry.get('source')}")
        print(f"  title:      {entry.get('title', '-')}")
        print(f"  wiki_page:  {entry.get('wiki_page')}")
        print(f"  ingested:   {entry.get('ingested_at')}")
        return 1
    print(f"NEW")
    print(f"  sha256: {digest}")
    print(f"  size:   {size} bytes")
    return 0


def cmd_record(args: list[str]) -> int:
    if len(args) < 2:
        print("usage: dedup.py record <file_or_url> <wiki_page_path> [title]", file=sys.stderr)
        return 2
    source = args[0]
    wiki_page = args[1]
    title = args[2] if len(args) > 2 else ""

    try:
        digest, size = compute_hash(source)
    except Exception as e:
        print(f"ERROR computing hash for {source}: {e}", file=sys.stderr)
        return 2

    cache = load_cache()
    now = datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")
    cache[digest] = {
        "source": source,
        "title": title,
        "wiki_page": wiki_page,
        "ingested_at": now,
        "size_bytes": size,
    }
    save_cache(cache)
    print(f"RECORDED {digest[:16]}...  {wiki_page}")
    return 0


def cmd_list(_args: list[str]) -> int:
    cache = load_cache()
    if not cache:
        print("(cache empty)")
        return 0
    for digest, entry in sorted(cache.items(), key=lambda kv: kv[1].get("ingested_at", "")):
        print(f"{digest[:16]}...  {entry.get('ingested_at', '?'):<20}  {entry.get('wiki_page'):<40}  {entry.get('source')}")
    return 0


def cmd_stats(_args: list[str]) -> int:
    cache = load_cache()
    total = len(cache)
    size = sum(e.get("size_bytes", 0) for e in cache.values())
    print(f"entries:     {total}")
    print(f"total size:  {size:,} bytes")
    print(f"cache file:  {CACHE_FILE}")
    return 0


COMMANDS = {
    "check": cmd_check,
    "record": cmd_record,
    "list": cmd_list,
    "stats": cmd_stats,
}


def main(argv: list[str]) -> int:
    if len(argv) < 2 or argv[1] not in COMMANDS:
        print(__doc__)
        return 2
    return COMMANDS[argv[1]](argv[2:])


if __name__ == "__main__":
    sys.exit(main(sys.argv))

FILE:scripts/embed.py
#!/usr/bin/env python3
"""
embed.py - Bedrock Titan Embeddings → OpenSearch

用法：
  # 索引一个页面
  python3 embed.py --page wiki/pages/aws/eks.md --index wiki-personal

  # 索引多个页面
  python3 embed.py --pages wiki/pages/aws/eks.md wiki/pages/ai/litellm.md --index wiki-personal

  # 向量搜索
  python3 embed.py --query "EKS 节点组配置" --index wiki-personal --top-k 5

  # 客户 wiki
  python3 embed.py --page wiki-clients/clientA/pages/xxx.md --index wiki-client-clientA
"""

import argparse
import json
import os
import sys
import boto3
from datetime import datetime

REGION = os.environ.get("AWS_REGION", "us-east-1")
SECRET_NAME = os.environ.get("WIKI_EMBED_SECRET", "wikisage/opensearch")
WORKSPACE = os.environ.get("WIKI_WORKSPACE", os.path.expanduser("~/.openclaw/workspace"))


def get_opensearch_config():
    sm = boto3.client("secretsmanager", region_name=REGION)
    secret = sm.get_secret_value(SecretId=SECRET_NAME)
    return json.loads(secret["SecretString"])


def get_embedding(text: str) -> list:
    bedrock = boto3.client("bedrock-runtime", region_name=REGION)
    body = json.dumps({"inputText": text[:8000]})  # Titan v2 max 8192 tokens
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=body,
        contentType="application/json",
        accept="application/json",
    )
    result = json.loads(response["body"].read())
    return result["embedding"]


def ensure_index(os_client, index_name: str):
    """创建 index（如果不存在）"""
    from opensearchpy import OpenSearch, RequestsHttpConnection
    if not os_client.indices.exists(index=index_name):
        mapping = {
            "settings": {"index": {"knn": True}},
            "mappings": {
                "properties": {
                    "page": {"type": "keyword"},
                    "content": {"type": "text"},
                    "embedding": {
                        "type": "knn_vector",
                        "dimension": 1024,  # Titan v2 default
                        "method": {
                            "name": "hnsw",
                            "space_type": "cosinesimil",
                            "engine": "nmslib",
                        },
                    },
                    "category": {"type": "keyword"},
                    "updated_at": {"type": "date"},
                }
            },
        }
        os_client.indices.create(index=index_name, body=mapping)
        print(f"✅ 创建 index: {index_name}")


def index_page(os_client, index_name: str, page_path: str):
    """索引单个页面"""
    full_path = os.path.join(WORKSPACE, page_path) if not os.path.isabs(page_path) else page_path
    with open(full_path, "r") as f:
        content = f.read()

    # 判断分类
    category = "general"
    for cat in ["aws", "ai", "projects", "ops"]:
        if f"/{cat}/" in page_path:
            category = cat
            break

    print(f"📄 生成 embedding: {page_path}")
    embedding = get_embedding(content)

    doc = {
        "page": page_path,
        "content": content,
        "embedding": embedding,
        "category": category,
        "updated_at": datetime.utcnow().isoformat(),
    }

    os_client.index(index=index_name, id=page_path, body=doc)
    print(f"✅ 已索引: {page_path}")


def search(os_client, index_name: str, query: str, top_k: int = 5):
    """向量搜索"""
    print(f"🔍 搜索: {query}")
    embedding = get_embedding(query)

    search_body = {
        "size": top_k,
        "query": {
            "knn": {
                "embedding": {
                    "vector": embedding,
                    "k": top_k,
                }
            }
        },
        "_source": ["page", "content", "category"],
    }

    response = os_client.search(index=index_name, body=search_body)
    hits = response["hits"]["hits"]

    results = []
    for hit in hits:
        results.append({
            "page": hit["_source"]["page"],
            "score": hit["_score"],
            "category": hit["_source"].get("category", ""),
            "preview": hit["_source"]["content"][:200],
        })

    return results


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--page", help="单个页面路径")
    parser.add_argument("--pages", nargs="+", help="多个页面路径")
    parser.add_argument("--query", help="搜索查询")
    parser.add_argument("--index", required=True, help="OpenSearch index 名")
    parser.add_argument("--top-k", type=int, default=5)
    args = parser.parse_args()

    # 获取 OpenSearch 配置
    config = get_opensearch_config()
    endpoint = config["endpoint"]
    if not endpoint:
        print(f"❌ OpenSearch endpoint 未配置，请更新 Secrets Manager: {SECRET_NAME}")
        sys.exit(1)

    # 去掉 https:// 前缀
    host = endpoint.replace("https://", "").rstrip("/")

    from opensearchpy import OpenSearch, RequestsHttpConnection
    from requests_aws4auth import AWS4Auth

    # 使用 basic auth（Fine-grained access control）
    os_client = OpenSearch(
        hosts=[{"host": host, "port": 443}],
        http_auth=(config["username"], config["password"]),
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
    )

    # 确保 index 存在
    ensure_index(os_client, args.index)

    if args.query:
        results = search(os_client, args.index, args.query, args.top_k)
        print(json.dumps(results, ensure_ascii=False, indent=2))

    elif args.page:
        index_page(os_client, args.index, args.page)

    elif args.pages:
        for page in args.pages:
            index_page(os_client, args.index, page)

    else:
        print("❌ 请指定 --page、--pages 或 --query")
        sys.exit(1)


if __name__ == "__main__":
    main()

FILE:scripts/ingest.md
# Ingest 流程

当用户提供文档、PDF、URL，或产生了有价值的回答时执行。

所有 wiki 读写走 Obsidian MCP（详见 SKILL.md 执行通道）。
简写下方示例省略了 `--config`，实际调用要带上：
`--config $MCPORTER_CONFIG`（默认 `~/.openclaw/workspace/config/mcporter.json`）

示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`，`$WIKI_SKILL_DIR` 默认是 `~/.openclaw/workspace/skills/wikisage`。

---

## ⚠️ 强制：Step 0 去重检查（来源是文件/URL 时必跑）

如果 ingest 来源是 **具体文件或 URL**（PDF、博文、文档链接），**读之前**先算 SHA256 去重：

```bash
python3 $WIKI_SKILL_DIR/scripts/dedup.py check <file_or_url>
```

- 输出 `NEW` → 继续下面的 Step 1
- 输出 `DUPLICATE` → **停下**，告诉用户这个内容已 ingest 过（并指出对应 wiki 页面），问是否强制重新入库

如果来源是 **用户对话内容或即兴结论**（没有原始文件/URL），**跳过 Step 0**，直接从 Step 1 开始。

写完页面后，在最后的 Step 4/5 之间加一步记录缓存：
```bash
python3 $WIKI_SKILL_DIR/scripts/dedup.py \
  record <file_or_url> pages/<category>/<slug>.md "来源标题"
```

---

## ⚠️ 强制：先判断，再决定怎么存

### Step 1：读 index.md，找相关页面

```bash
mcporter call obsidian.read_text_file path=$WIKI_ROOT/index.md
```

扫描所有已有页面标题和描述，判断新内容与哪个页面最相关。

### Step 2：判断存法

```
新内容
  │
  ├── index.md 里有相关页面？
  │     │
  │     ├── YES：内容是什么类型？
  │     │     ├── 扩展/补充现有概念 → 更新现有页面，追加新章节
  │     │     ├── 独立子主题（可单独成篇）→ 新建页面 + 原页面加 [[链接]]
  │     │     ├── 与现有内容矛盾 → 标注矛盾，询问用户哪个正确
  │     │     └── 完全重复 → 不存，告知用户已有相关内容
  │     │
  │     └── NO → 新建页面
  │
  └── 存完后：更新 index.md（页面数 + 1，加条目）
```

### Step 3：判断标准详细说明

| 情况 | 判断依据 | 做法 |
|------|---------|------|
| 同一概念的不同角度 | 主题词相同（如都是"NIS 2"）| 更新现有页面 |
| 独立子主题 | 标题不同，但有交叉（如"NIS 2 行动清单"vs"NIS 2 概述"）| 新建 + 交叉链接 |
| 完全不同的主题 | 无重叠 | 新建页面 |
| 内容矛盾 | 两处对同一事实描述不同 | 询问用户 |

---

## 标准 Ingest 流程

```
Step 0: 去重检查（dedup.py check，只对文件/URL 类来源）
         - NEW    → 继续
         - DUPLICATE → 停下并告知用户已 ingest 过
Step 1: 读 index.md（MCP read_text_file，判断是否已有相关页面）
Step 2: 根据判断：
         - 新建 → MCP write_file（整篇）
         - 更新 → MCP edit_file（局部）或 write_file（整篇覆盖）
Step 3: 追加 wiki/log.md（MCP edit_file，append-only，格式：## [YYYY-MM-DD] ingest | 标题）
Step 4: 更新 wiki/index.md（MCP edit_file：新建加条目；更新改描述）
Step 4.5: 记录去重缓存（dedup.py record，只对文件/URL 类来源）
Step 5: 一次 ingest 可能触碰 5-15 个相关页面，逐一更新交叉引用（MCP edit_file）
Step 6: 告知用户存储结果（页面路径 + 更新了哪些页面）
```

### MCP 调用示例

```bash
# 新建页面
mcporter call obsidian.write_file \
  path=$WIKI_ROOT/pages/aws/security-hub.md \
  content='# Security Hub

**最后更新：** 2026-04-25
...'

# 更新页面（局部改）
mcporter call obsidian.edit_file \
  path=$WIKI_ROOT/pages/aws/security-hub.md \
  edits='[{"oldText":"## 相关页面\n- [[A]]","newText":"## 相关页面\n- [[A]]\n- [[B]]"}]'

# 追加 log.md（用 edit_file 在文件尾部加一行；或整篇读+写）
```

---

## 页面文件命名规范

```
$WIKI_ROOT/pages/
├── aws/              AWS 服务、合规、架构
│   └── sources/      原始文档摘要（raw sources）
├── ai/               AI/LLM 相关
│   └── sources/
├── projects/         项目相关
└── ops/              运维相关
```

- 文件名：小写 + 连字符，如 `security-hub.md`、`nis2-compliance-checklist.md`
- sources/ 下存原始文档摘要，父目录下存编译后的知识页面

---

## 页面模板

```markdown
# 页面标题

**最后更新：** YYYY-MM-DD
**来源数量：** N
**分类：** aws/security（路径）
**置信度：** EXTRACTED  <!-- EXTRACTED | INFERRED | AMBIGUOUS | UNVERIFIED -->

## 概述
一段话说清楚这个主题是什么。

## 核心内容
...

<!-- 段落级置信度 inline tag（混合置信度的页面必须打）： -->
<!-- [EXTRACTED] 原文直接扒的 -->
<!-- [INFERRED]  基于来源推理 -->
<!-- [AMBIGUOUS] 来源本身模糊 -->
<!-- [UNVERIFIED] AI 自己补的常识，没来源 -->

## 相关页面
- [[相关页面名]]

## 来源
- [[原始文档页面名]]
- [外部链接](https://...)
```

**置信度标注原则：**
1. frontmatter 的 `置信度:` 是整页默认值，**不要省**
2. 页面里如果一部分是来源原文扒的（EXTRACTED）、一部分是 AI 推断的（INFERRED），**必须在段落前加 inline tag**
3. Query 时引用 INFERRED/UNVERIFIED 的内容，回答里要明说是推断的
4. 详细规则见 SKILL.md「置信度标签规则」

---

## Ingest 完成后的轻量 lint

```bash
python3 $WIKI_SKILL_DIR/scripts/lint.py --quick
```

只查 index.md 一致性，防止 ingest 留脏。

FILE:scripts/lint.md
# Lint 流程（两层：机械扫描 + LLM 整理）

对齐 Karpathy LLM Wiki 模式：**LLM 才是真正的 lint 者**，脚本只做机械扫描和提醒。

示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`，
`$WIKI_SKILL_DIR` 默认是 `~/.openclaw/workspace/skills/wikisage`。

---

## Layer 1：机械扫描（lint.py · cron 每周一 02:00 UTC）

由 `scripts/lint.py` 执行，产出报告到 `$WIKI_ROOT/.lint-history/YYYY-MM-DD.md`。

### 扫描项（6 项对齐 Karpathy 原版）

| # | 检查项 | 谁做 |
|---|---|---|
| 1 | index.md 一致性（有条目但文件不存在 / 有文件但未记录） | 脚本 ✅ |
| 2 | 孤儿页面（没有被任何页面 [[引用]]） | 脚本 ✅ |
| 3 | 缺失概念页（[[链接]] 但无对应文件） | 脚本 ✅ |
| 4 | 缺失交叉引用（A 提到 B 但没建 [[B]] 链接） | 脚本 ✅ |
| 5 | 过时页面（超过 90 天未更新，按 mtime） | 脚本 ✅ |
| 6 | 矛盾内容 / 被推翻的旧说法 / 数据空白 | **LLM**（Layer 2） |

### 定时调度（跨平台）

**Linux / macOS (cron)：**
```cron
0 2 * * 1 WIKI_ROOT=$HOME/.openclaw/workspace/wiki python3 $HOME/.openclaw/workspace/skills/wikisage/scripts/lint.py >> $HOME/.openclaw/workspace/wiki/.lint-history/cron.log 2>&1
```

**Windows（Task Scheduler）**：详见 `README.md` 的 *Weekly lint schedule* 节（用 `python.exe`，而非 `python3`）。

脚本只写报告、打印到 stdout。**要推通知到邮件/聊天/webhook**，加 `--summary` 参数拿到一行摘要再在 cron/Task Scheduler 里自己 pipe，示例见 README。

脚本本身不推通知，只写报告。想要「本周 Lint：X 孤儿、Y 缺失页…」这种推送：
- 跳到 `--summary` 获取一行摘要
- 在你的 cron/Task Scheduler 里 pipe 到那个工具（邮件、Slack webhook、Discord webhook、`openclaw message`、飞书自定义机器人…）

---

## Layer 2：LLM 整理（用户触发 · Agent 执行）

**执行通道**：Layer 2 所有读写 wiki 文件走 **Obsidian MCP**（详见 SKILL.md 执行通道）。
Layer 1 的 `lint.py` 脚本还是走 Python filesystem 直读，速度快、扫描无副作用。


### 触发条件

用户说以下任一关键词 → 进入 Layer 2：
- "整理 wiki"
- "wiki 健康检查"
- "lint"（不加参数）
- "整理 wiki 矛盾"（只跑第 6 项）

### 执行流程

```
Step 1: 读最新 lint 报告
  → exec: ls $WIKI_ROOT/.lint-history/ | tail -1
  → mcporter call obsidian.read_text_file path=<报告文件>
  → 如果找不到报告（cron 还没跑过）：先手动跑 python3 scripts/lint.py --no-log

Step 2: 逐类处理（按优先级）

  【孤儿页面】——通常是漏了从 index.md 或其他页面建链接
    → 每个孤儿：
        - 读页面看内容
        - 判断归属：应该被谁引用？(index.md 肯定要加)
        - 问用户："建议在 X 页面加 [[孤儿]] 链接，同意吗？"
        - 同意 → 改目标页面 + index.md

  【缺失概念页】——[[链接]] 引用了但没文件
    → 按"被引用次数"排序（高频的先处理）
    → 每个：
        - 看引用它的几个页面说了什么
        - 判断：这概念**有独立价值**吗？
          - 有 → 建议新建页面（问用户是否需要）
          - 没（只是随手引用）→ 建议改成普通文字 + 删除 [[]]
          - 是别的页面的别名 → 建议改成正确的 slug

  【缺失交叉引用】——A 提到 B 但没 [[B]]
    → 每个：问"在 X 页面的 Y 章节加 [[B]] 链接吗？"
    → 同意 → 插入链接

  【过时页面】
    → 每个：
        - 读页面内容
        - 是否过时？（时效性强的才算过时，概念性内容不算）
        - 过时 → 建议：更新 / 标注 / 删除
        - 问用户决策

  【矛盾内容】（Layer 2 独有）
    → 扫所有页面，找同一概念/事实的描述
    → 对比发现矛盾
    → 标注 ⚠️ + 问用户哪个是对的
    → 改页面 + 更新 log

  【数据空白】（Layer 2 独有）
    → 扫 wiki 的主题覆盖，找可能缺的重要主题
    → 建议："要不要让我搜 X 然后补一页？"

Step 3: 每改一组，同步更新（全部走 MCP edit_file / write_file）
  - index.md（页面增删）
  - log.md（追加 ## [日期] lint-fix | 做了什么）

Step 4: 收尾
  - 再跑一次 python3 scripts/lint.py --no-log 验证
  - 汇报：改了 N 条，剩余 M 条未处理（不紧急）
```

### 行为原则（重要）

1. **逐项问，不批量改**——每个改动用户确认，避免改坏知识结构
2. **宁可保守**——不确定就问，不要自作主张
3. **链接规范**：slug 统一用小写连字符（`obsidian`、`aws-security-hub`），避免大小写孤儿
4. **改一批同步 index.md 一批**——防止中途出错留脏状态
5. **永远同步更新 log.md**——log 是 wiki 的时间线

---

## 轻量 lint（ingest 后自动触发）

```bash
python3 $WIKI_SKILL_DIR/scripts/lint.py --quick
```

只查 index.md 一致性，防止 ingest 留脏。ingest 流程最后一步可调用。

---

## 报告归档

```
$WIKI_ROOT/.lint-history/
├── 2026-04-18.md          ← 今天的报告
├── 2026-04-20.md          ← cron 下周一产出
├── 2026-04-27.md
└── cron.log               ← cron 运行日志（stderr 也在这里）
```

如果 wiki 被 Mutagen/git 同步到本地编辑器（Obsidian/VS Code 等），用户在本地也能直接翻历史报告。

---

## 执行频率

- Layer 1（脚本）：**每周一 02:00 UTC** 自动
- Layer 2（LLM）：**用户触发** —— 看到周报通知后决定是否整理
- Ingest 后：**自动 --quick**（轻量 lint 防脏）

FILE:scripts/lint.py
#!/usr/bin/env python3
"""
Wiki Lint 脚本（Karpathy LLM Wiki 模式 · Layer 1 机械扫描）

Layer 1：机械扫描，写报告 + 打印摘要到 stdout
Layer 2：LLM 介入整理（由用户说「整理 wiki」触发，不是这个脚本的事）

用法：
  python3 lint.py                    # 完整 lint
  python3 lint.py --quick            # 轻量 lint（ingest 后触发）
  python3 lint.py --wiki-root /path  # 自定义 wiki 根目录（也可用 $WIKI_ROOT）
  python3 lint.py --summary          # 只打印一行摘要到 stdout（给 cron pipe 用）
  python3 lint.py --no-log           # 不写 log.md（预览模式）

环境变量：
  WIKI_ROOT          wiki markdown 根目录（默认 ~/.openclaw/workspace/wiki）

报告产出：
  $WIKI_ROOT/.lint-history/YYYY-MM-DD.md   # 持久化报告
  stdout                                    # 完整报告或一行摘要（带 --summary）

推送通知？自行在 cron/Task Scheduler 里 pipe：
  python3 lint.py --summary | mail -s 'wiki lint' [email protected]
  python3 lint.py --summary | xargs -I{} openclaw message send --target user:xxx --message {}
  python3 lint.py --summary | curl -X POST -d @- https://hooks.slack.com/services/...
"""

import os
import re
import sys
import argparse
from datetime import datetime, timedelta
from pathlib import Path
from collections import defaultdict


def default_wiki_root() -> Path:
    env = os.environ.get("WIKI_ROOT")
    if env:
        return Path(env).expanduser()
    return Path.home() / ".openclaw/workspace/wiki"


# "缺失交叉引用"判定：两个页面如果标题/标签相似度高，但互相没 [[链接]]，可能漏了交叉引用
# 简化版：同目录下的页面，如果页面 A 的正文提到页面 B 的标题（非 [[]] 包裹），算"可能缺交叉引用"
SKIP_MENTION_CHECK_DIRS = {"sources", "raw", ".lint-history"}


def find_all_pages(wiki_dir: Path):
    pages_dir = wiki_dir / "pages"
    if not pages_dir.exists():
        return []
    return sorted(pages_dir.rglob("*.md"))


def extract_title(page_path: Path) -> str:
    """从页面第一行 # 标题 提取标题"""
    try:
        content = page_path.read_text(errors="ignore")
        m = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
        return m.group(1).strip() if m else page_path.stem
    except Exception:
        return page_path.stem


def extract_links(content: str):
    """提取所有 [[链接]]"""
    return re.findall(r"\[\[([^\]]+)\]\]", content)


def check_index_consistency(wiki_dir: Path):
    issues = []
    index_file = wiki_dir / "index.md"
    if not index_file.exists():
        return [f"❌ index.md 不存在：{index_file}"]

    index_content = index_file.read_text()
    index_links = set(extract_links(index_content))
    actual_pages = {p.stem for p in find_all_pages(wiki_dir)}

    for link in index_links:
        slug = link.replace(" ", "-").lower()
        if link not in actual_pages and slug not in actual_pages:
            issues.append(f"  📋 index.md 有条目但文件不存在：[[{link}]]")

    for page in actual_pages:
        if page not in index_links and page.replace("-", " ") not in index_links:
            issues.append(f"  📋 文件存在但 index.md 未记录：{page}.md")

    return issues


def check_orphan_pages(wiki_dir: Path):
    pages = find_all_pages(wiki_dir)
    if not pages:
        return []
    all_links = set()
    for page in pages:
        content = page.read_text(errors="ignore")
        all_links.update(extract_links(content))
    # index.md 里的链接也算引用
    index_file = wiki_dir / "index.md"
    if index_file.exists():
        all_links.update(extract_links(index_file.read_text()))

    orphans = []
    for page in pages:
        stem = page.stem
        if stem not in all_links and stem.replace("-", " ") not in all_links:
            rel = page.relative_to(wiki_dir)
            orphans.append(f"  - {rel}")
    return orphans


def check_missing_concept_pages(wiki_dir: Path):
    pages = find_all_pages(wiki_dir)
    actual_pages = {p.stem for p in pages}

    link_refs = defaultdict(list)
    for page in pages:
        content = page.read_text(errors="ignore")
        for link in extract_links(content):
            slug = link.replace(" ", "-").lower()
            if link not in actual_pages and slug not in actual_pages:
                link_refs[link].append(page.stem)

    missing = []
    for link, refs in sorted(link_refs.items(), key=lambda x: -len(x[1])):
        missing.append(f"  - [[{link}]] — 被 {len(refs)} 个页面引用（{', '.join(refs[:3])}{'...' if len(refs) > 3 else ''}）")
    return missing


def check_stale_pages(wiki_dir: Path, days: int = 90):
    pages = find_all_pages(wiki_dir)
    stale = []
    cutoff = datetime.now() - timedelta(days=days)
    for page in pages:
        mtime = datetime.fromtimestamp(page.stat().st_mtime)
        if mtime < cutoff:
            rel = page.relative_to(wiki_dir)
            delta = (datetime.now() - mtime).days
            stale.append(f"  - {rel}（{delta} 天未更新）")
    return stale


def check_missing_confidence(wiki_dir: Path):
    """
    检查每个页面 frontmatter 里是否有 `置信度:` 字段。
    旧页面可以没有，但新写/更新的应该补。入库未标会让 Query 时无法判断来源可信度。
    跟 lint 过的其他检查保持一致，返回 markdown list 条目。
    """
    pages = find_all_pages(wiki_dir)
    missing = []
    tag_pat = re.compile(r"^\*\*置信度：\*\*", re.MULTILINE)
    for page in pages:
        # sources/ 和 raw/ 子目录的摘要页可先跳过（内容是摘抄，置信度一律看作 EXTRACTED）
        if any(seg in page.parts for seg in SKIP_MENTION_CHECK_DIRS):
            continue
        try:
            head = page.read_text(errors="replace")[:1024]
        except OSError:
            continue
        if not tag_pat.search(head):
            rel = page.relative_to(wiki_dir)
            missing.append(f"  - {rel}")
    return missing


def check_missing_cross_refs(wiki_dir: Path):
    """
    检查可能缺失的交叉引用：
    页面 A 的正文里提到了页面 B 的完整标题（纯文本，非 [[]]），
    但 A 的「相关页面」章节没有 [[B]] 链接 → 可能漏了交叉引用
    """
    pages = find_all_pages(wiki_dir)
    # 建 title → path 映射
    title_to_page = {}
    page_to_title = {}
    for p in pages:
        # 跳过 sources/raw 下的页面（它们本来就是摘要，不适合做概念枢纽）
        rel = p.relative_to(wiki_dir)
        if any(part in SKIP_MENTION_CHECK_DIRS for part in rel.parts):
            continue
        title = extract_title(p)
        # 标题太短（< 4 字符）会误报，跳过
        if len(title) < 4:
            continue
        title_to_page[title] = p
        page_to_title[p] = title

    suggestions = []
    for page, title in page_to_title.items():
        content = page.read_text(errors="ignore")
        # 把本页面已有的 [[链接]] 全去掉，剩下的才是"纯文本提到"
        stripped = re.sub(r"\[\[[^\]]+\]\]", "", content)
        existing_links = set(extract_links(content))
        existing_links_normalized = {l.lower() for l in existing_links}

        for other_title, other_page in title_to_page.items():
            if other_page == page:
                continue
            # 本页正文提到了 other_title（纯文本）
            if other_title in stripped:
                # 但 [[链接]] 里没包含 other_page.stem 或 other_title
                if (other_page.stem not in existing_links_normalized
                        and other_title.lower() not in existing_links_normalized):
                    rel = page.relative_to(wiki_dir)
                    suggestions.append(f"  - {rel} 提到了「{other_title}」但没建立 [[{other_page.stem}]] 链接")

    # 去重（一个页面可能提到多次同一个别人，只报一次）
    return sorted(set(suggestions))[:30]  # 限制 30 条防爆


def write_report_file(wiki_dir: Path, report_md: str, now_date: str) -> Path:
    """报告写到 wiki/.lint-history/YYYY-MM-DD.md（持久化）"""
    history_dir = wiki_dir / ".lint-history"
    history_dir.mkdir(exist_ok=True)
    report_file = history_dir / f"{now_date}.md"
    report_file.write_text(report_md)
    return report_file


def build_summary(wiki_dir: Path, now_date: str, stats: dict) -> str:
    """构造一行报警摘要，用于 --summary / 外部推送 pipe。"""
    total_issues = (
        stats.get("index_issues", 0)
        + stats.get("orphans", 0)
        + stats.get("missing_concepts", 0)
        + stats.get("missing_cross_refs", 0)
        + stats.get("stale", 0)
        + stats.get("missing_confidence", 0)
    )
    report_path = f"{wiki_dir}/.lint-history/{now_date}.md"
    if total_issues == 0:
        return f"📚 Wiki Lint: ✅ 0 issues | report: {report_path}"
    return (
        f"📚 Wiki Lint: {total_issues} issues "
        f"(index:{stats.get('index_issues', 0)} "
        f"orphans:{stats.get('orphans', 0)} "
        f"missing-concepts:{stats.get('missing_concepts', 0)} "
        f"missing-xref:{stats.get('missing_cross_refs', 0)} "
        f"stale:{stats.get('stale', 0)} "
        f"no-confidence:{stats.get('missing_confidence', 0)}) "
        f"| report: {report_path}"
    )


def run_lint(wiki_dir: Path, quick: bool = False, write_log: bool = True, summary_only: bool = False):
    now = datetime.now().strftime("%Y-%m-%d %H:%M")
    now_date = datetime.now().strftime("%Y-%m-%d")
    report_lines = [f"# Wiki Lint 报告 — {wiki_dir} — {now}\n"]

    print(f"🔍 Lint: {wiki_dir} ({'轻量模式' if quick else '完整模式'})")

    # 1. index.md 一致性
    index_issues = check_index_consistency(wiki_dir)
    report_lines.append("## 📋 index.md 一致性")
    if index_issues:
        report_lines.extend(index_issues)
    else:
        report_lines.append("  ✅ 无问题")
    report_lines.append("")

    # 统计数据
    stats = {"index_issues": len(index_issues)}

    if not quick:
        # 2. 孤儿页面
        orphans = check_orphan_pages(wiki_dir)
        stats["orphans"] = len(orphans)
        report_lines.append(f"## ⚠️ 孤儿页面（{len(orphans)} 个）")
        report_lines.extend(orphans if orphans else ["  ✅ 无孤儿页面"])
        report_lines.append("")

        # 3. 缺失概念页
        missing = check_missing_concept_pages(wiki_dir)
        stats["missing_concepts"] = len(missing)
        report_lines.append(f"## 🔗 缺失概念页（{len(missing)} 个）")
        report_lines.extend(missing if missing else ["  ✅ 无缺失概念页"])
        report_lines.append("")

        # 4. 缺失交叉引用
        cross_refs = check_missing_cross_refs(wiki_dir)
        stats["missing_cross_refs"] = len(cross_refs)
        report_lines.append(f"## 🔀 可能缺失的交叉引用（{len(cross_refs)} 个）")
        report_lines.append("  _规则：页面 A 正文提到页面 B 的标题但没有 [[B]] 链接_")
        report_lines.extend(cross_refs if cross_refs else ["  ✅ 无"])
        report_lines.append("")

        # 5. 过时内容
        stale = check_stale_pages(wiki_dir)
        stats["stale"] = len(stale)
        report_lines.append(f"## 📅 过时页面（{len(stale)} 个，超过 90 天）")
        report_lines.extend(stale if stale else ["  ✅ 无过时页面"])
        report_lines.append("")

        # 5.5 缺置信度标签
        no_conf = check_missing_confidence(wiki_dir)
        stats["missing_confidence"] = len(no_conf)
        report_lines.append(f"## 🏷️ 缺置信度标签（{len(no_conf)} 个）")
        report_lines.append("  _规则：页面 frontmatter 应有 `**置信度：**` 字段（EXTRACTED/INFERRED/AMBIGUOUS/UNVERIFIED）_")
        report_lines.extend(no_conf if no_conf else ["  ✅ 全部页面都有置信度标签"])
        report_lines.append("")

        # 6. 矛盾内容 / 空白点 → 需要 LLM（Layer 2）
        report_lines.append("## 💡 需要 LLM 判断（Layer 2）")
        report_lines.append("  - 矛盾内容（同一事实在多页描述不一致）")
        report_lines.append("  - 过时说法（新来源推翻旧说法）")
        report_lines.append("  - 数据空白（可以上网搜的主题）")
        report_lines.append("  → 在对话里说「整理 wiki」触发 LLM 逐项处理")
        report_lines.append("")

    report = "\n".join(report_lines)
    if not summary_only:
        print(report)

    # 写报告文件
    report_file = write_report_file(wiki_dir, report, now_date)
    if not summary_only:
        print(f"\n📄 报告已保存：{report_file.relative_to(wiki_dir)}")

    # 追加 log.md
    if write_log:
        log_file = wiki_dir / "log.md"
        mode = "快速" if quick else "完整"
        with open(log_file, "a") as f:
            f.write(f"\n## [{now_date}] lint | {mode} lint\n\n")
            f.write(f"- 模式：{mode}\n")
            f.write(f"- 报告：`.lint-history/{now_date}.md`\n")
            if not quick:
                f.write(f"- 孤儿页面：{stats['orphans']} 个\n")
                f.write(f"- 缺失概念页：{stats['missing_concepts']} 个\n")
                f.write(f"- 缺失交叉引用：{stats['missing_cross_refs']} 个\n")
                f.write(f"- 过时页面：{stats['stale']} 个\n")
                f.write(f"- 缺置信度标签：{stats['missing_confidence']} 个\n")
            f.write("\n")

    # --summary: 只打一行到 stdout，供外部 pipe
    if summary_only:
        print(build_summary(wiki_dir, now_date, stats))

    return report, stats


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--quick", action="store_true", help="轻量模式（只查 index.md）")
    parser.add_argument("--wiki-root", default=None, help="wiki 根目录（默认 $WIKI_ROOT 或 ~/.openclaw/workspace/wiki）")
    parser.add_argument("--summary", action="store_true", help="只打印一行摘要到 stdout（供 cron/Scheduler pipe 到邮件/聊天/webhook）")
    parser.add_argument("--no-log", action="store_true", help="不追加 log.md（预览模式）")
    args = parser.parse_args()

    wiki_dir = Path(args.wiki_root).expanduser() if args.wiki_root else default_wiki_root()
    if not wiki_dir.exists():
        print(f"❌ wiki 目录不存在：{wiki_dir}", file=sys.stderr)
        print(f"   提示：设置 $WIKI_ROOT 或用 --wiki-root 指定", file=sys.stderr)
        sys.exit(2)

    run_lint(
        wiki_dir=wiki_dir,
        quick=args.quick,
        write_log=not args.no_log,
        summary_only=args.summary,
    )

FILE:scripts/query.md
# Query 流程

当用户问技术问题，或明确说"查 wiki"时执行。

## ⚠️ 强制顺序：wiki → MCP → LLM

所有 wiki 读操作走 Obsidian MCP（详见 SKILL.md 执行通道）。
简写下方示例省略了 `--config`，实际调用要带上：
`--config $MCPORTER_CONFIG`（默认 `~/.openclaw/workspace/config/mcporter.json`）

示例里的 `$WIKI_ROOT` 默认是 `~/.openclaw/workspace/wiki`。

### 第一步：读 wiki/index.md

```bash
mcporter call obsidian.read_text_file path=$WIKI_ROOT/index.md
```

→ 扫描所有页面标题和描述，找相关页面
→ 找到 → 读相关页面全文（下面第二步）→ 综合答
→ 找不到 → 进入第三步

### 第二步：读具体页面

```bash
mcporter call obsidian.read_text_file \
  path=$WIKI_ROOT/pages/<category>/<slug>.md
```

需要的话同时读多篇，逐一综合。

**如果要全文模糊搜**（MCP 只能 glob 文件名）：
```bash
# 优先：workspace 集合的 qmd-search（BM25）
# 兜底：grep 直搜
exec grep -rn "关键词" $WIKI_ROOT/pages/
```

### 第三步：查外部 MCP / 搜索（可选，按需）

如果本地 wiki 找不到，根据话题类型查外部来源（AWS 文档、定价、Tavily 搜索等）。
具体 MCP server 取决于用户在 `$MCPORTER_CONFIG` 里配置了什么：

```bash
# 例：AWS 文档（如果配置了 aws-kb）
mcporter call 'aws-kb.aws___search_documentation(search_phrase: "关键词")'

# 例：AWS 定价（如果配置了 aws-pricing）
mcporter call 'aws-pricing.get_aws_pricing(service_code: "...", region: "us-east-1")'

# 例：Web 搜索（如果配置了 tavily）
mcporter call tavily.search query="关键词"
```

→ 有结果 → 基于 MCP 结果回答，附 reference links
→ 没结果 → LLM 直接回答（兜底）

### 第四步：综合回答

基于 wiki 或 MCP 内容回答，末尾标注来源：
- wiki 来源：`> 参考：[[页面名]]`
- MCP 来源：`> 参考：[AWS 文档链接]`

**置信度透明（强制）：**
- 读页面时注意 frontmatter 的 `置信度：` 和正文里的 inline tag（[EXTRACTED] / [INFERRED] / [AMBIGUOUS] / [UNVERIFIED]）
- 如果回答引用了 `INFERRED` / `UNVERIFIED` / `AMBIGUOUS` 的内容，**必须在回答里明说**：
  - INFERRED → "这条是推断的（来源只写了…）"
  - UNVERIFIED → "这是我补的常识，不在 wiki 来源里"
  - AMBIGUOUS → "原文这里写得模糊，其他题请核对来源"
- 如果全部是 EXTRACTED，不用特别标注（默认就是原文扒的）

### 第五步：问是否存入 wiki

如果这次回答有价值（新知识、客户信息、决策记录），询问用户：
> "这个回答要存进 wiki 吗？"

如果是，通过 MCP 新建页面：
```bash
mcporter call obsidian.write_file \
  path=$WIKI_ROOT/pages/<category>/queries/<date>-<slug>.md \
  content='...'
```
然后进 ingest 流程更新 index.md 和 log.md。
ClawHub Writing Documentation+2
H@clawhub-harryzsh-88cbe715e7