@clawhub-hgvgfgvh-345c67f17a
Enables a strict 7-phase collaborative workflow for diagnosing and fixing complex, intermittent, or multi-layer bugs with verified user input and data-backed...
---
name: complex-bug-debugging-with-ai
description: A meta-methodology for collaborative debugging between humans and AI on complex bugs. When the user reports bugs that are "weird / intermittent / multi-layered / not fixed by restart / cross-system / stuck for a long time", activate this SKILL's 7-phase workflow: "Business-Flow Alignment → Symptom Structuring → Boundary Probing Loop → Solution Layout → Execute & Verify → Failure Escalation → Closed-Loop Documentation". This is dual discipline — it constrains BOTH the AI (no subjective claims, no riding assumptions, must announce failed plans, must stop when user is uncooperative) AND the user (verify the flow diagram, answer structured questions, give precise counter-signals, own the solution decision). On non-cooperation, the AI must call it out using built-in scripts and never push forward "while sick".
---
# Complex Bug Debugging with AI (Engineering Harness for Human × AI Collaboration)
## What this is
Not a case library — **the collaboration workflow itself**.
> Case library `bug-pattern-diagnosis` answers "**what is this bug**"
> This SKILL answers "**how to debug a complex bug together with AI**"
**Core belief**: complex bugs cannot be cracked by AI alone, nor by humans alone. AI lacks: domain intuition / business context / counter-signals / decision authority. Humans lack: bandwidth to run 100 commands. **Only human × AI collaboration with strict workflow discipline reliably cracks them.**
## When to activate
Activate proactively when the user describes:
- "Stuck / been debugging this for a long time"
- "Weird / not reproducible / intermittent"
- "Heals after restart, but comes back"
- "Looks like X, but fixing X didn't help"
- "Multiple services / nodes / clusters involved"
- "Looks contradictory on the surface"
For plain NPE / compile errors / "how do I write this function" → **do NOT activate**, just handle directly.
---
## Prerequisite: model and capability pre-check (MUST do)
### 1. Model must be Opus 4.7 (or equivalent)
- Weak models ride the first hypothesis forever (internally consistent but wrong) and drive the user into a ditch
- Opus 4.7 **counter-doubts itself** (e.g. doubts "the workspace code may not match deployed code", proactively pulls jar to decompile and compare)
- **If current model is not Opus 4.7 → tell user to switch first, do not push forward "while sick"**
### 2. Capability completeness
Floor of debugging capability is set by the **weakest tool**:
| Capability | Impact if missing |
|---|---|
| Code access (Read / Grep) | Cannot verify business logic |
| Infrastructure (K8S MCP / SSH) | Cannot inspect pods / nodes |
| Data access (DB MCP) | Forced to trust verbal reports |
| Log access (real logs) | Stuck "guessing the stack" |
| Network / HTTP | Cannot run experiments |
| Specialized SKILLs (e.g. server-log-analysis) | Efficiency drops |
**Plug whatever is missing. Do not start work while sick.**
---
## Four hard rules the AI must follow throughout
### ① No subjective claims
Every conclusion must be backed by **data we just ran or code we just read**. **Forbidden**: "should be / probably is / usually is" as a conclusion. **Allowed**: "based on the metrics I just pulled, the cause is ...".
### ② No riding on assumptions
User's stated direction ≠ truth. Your previous round's hypothesis ≠ confirmed fact. On a counter-signal ("I tried that too" / data does not match prediction), **stop the current path immediately** and re-gather evidence.
### ③ Announce failed plans
Fix did not work → **immediately say "Plan X failed, evidence is ..."**, auto-escalate to next plan. **Forbidden**: "should be fixed, you try" / "partially worked..." / silently switching plans.
### ④ Stop when user is uncooperative
No strong model / capability gap / no answer / no boundary info → **do not start while sick**. Use the scripts below to call it out. If user insists on not cooperating → may continue, but **first label "the following runs without info X, conclusions may be biased"**.
---
## Dual Discipline: proactive inquiry + mandatory user-cooperation checks
> Collaboration is not one-sided. AI must not push forward when user is uncooperative, nor silently decide on user's behalf.
### Proactive inquiry principle
Entering each phase, AI **must proactively ask** for that phase's required info. **Forbidden**: user gives a vague description, AI dives in head-first.
### 9 user-uncooperative signals + AI scripts (use directly, do not improvise)
#### ① Not using a strong model
```
⚠️ Current model is not Opus 4.7. Weak models ride assumptions (consistent but wrong).
Recommend switching first. If you insist, challenge every "should be ..." with "what data backs this?".
```
#### ② Capability gap
```
⚠️ This investigation needs [capability]; not configured.
Impact: [impact]. Please configure first.
If you can't, I'll work from your text logs but confidence drops significantly.
```
#### ③ Phase A: symptom too vague
```
⚠️ Symptom too vague — cannot draw flow diagram. Please provide at least 2 of 3:
1. One-line symptom ("API X returns 500 / device sends register but no reply")
2. Real log / API response / screenshot
3. Services involved ("frontend → gateway → access-service → broker")
Without these I'm stuck guessing possibilities.
```
#### ④ Phase A: not verifying the flow diagram
```
⚠️ You haven't confirmed the diagram. If it's wrong, every later discussion is on a wrong premise.
Reply "right" or "wrong, the key is XXX" before we continue.
```
#### ⑤ Phase B: skipping structured questions
```
⚠️ You skipped the structured questions (I can't get this myself):
□ Reproduction rate? □ Environment? □ Recent changes? □ Did YOU reproduce it (gold question)?
Without these I can only guess. Please answer each.
```
#### ⑥ Phase C: not answering / vague answer
```
⚠️ I hit a fact that **must be confirmed by you**:
Question: [specific binary question]
Why it matters: decides path (A → branch X; B → branch Y)
Please: 1) tell me how to find out, I'll check; or 2) say "don't know and can't find out", I'll branch on both.
Don't change the topic — I can't shrink the diagnostic space.
```
#### ⑦ Phase C: counter-signal too vague
```
⚠️ "Also broken / no problem" is a critical counter-signal but too vague. Please add:
- How exactly did you try? (command / tool / steps)
- What did you see? (output / error code)
- Was the environment identical?
Don't say "MQTTX also fails" — say "MQTTX QoS 1 publish XXX, broker XXX, no error but no reply received".
```
#### ⑧ Phase D: asking AI to decide
```
⚠️ Solution choice MUST be yours:
- You know production tolerance / what cannot break / rollback capability better
- Consequences fall on your team, not me
I've laid out fix strength / production impact / rollback cost. Decide based on "how much impact today is acceptable".
If you have no basis, tell me "window is X / can't impact Y" — I'll filter, but you still pick.
```
#### ⑨ Phase G: not documenting after fix
```
⚠️ Details are fading from short-term memory. Strongly recommend documenting now (5 min):
- BUGxx.md from bug-pattern-diagnosis template
- Focus: symptom quick-match / negative features / 5-min self-check / wrong turns
Cost of skipping: next time you / team / AI all start from zero. Reply "document" or "skip" — be explicit.
```
### Compliance Gates (self-check before each transition)
| Transition | Gate |
|---|---|
| A → B | Did user verify the flow diagram? |
| B → C | Answered structured questions? Filled "I reproduced it"? |
| Each loop in C | Last round's question answered? Counter-signal specific? |
| C → D | Decisive evidence sufficient? AI not self-persuading? |
| D → E | User picked a plan? Or making AI decide? |
| E → F/G | Verification complete? Before/after side-by-side? |
| G done | Agreed to document? BUGxx.md complete? |
**Any failed gate → stop and use the script. Do not push past it.**
---
## The 7-phase workflow
### Phase A: Business-Flow Alignment [draw the map first, do not fix yet]
> Different mental models of the "flow" → every later discussion is two ships passing in the night.
**AI proactive opening (mandatory)**:
```
Running this through the SKILL workflow (interrupt me if not needed). For Phase A I need:
1. One-line symptom (don't guess causes yet)
2. Real log / API response / screenshot
3. Which services / flow it touches
I'll draw the diagram for you to confirm.
```
**AI does**:
1. Asks / Reads code, draws end-to-end flow diagram
2. Re-states symptom: "what I understood = what you said"
3. Lists "I know X" / "I do not know Y"
**Human verification (mandatory Gate)**:
- "Right" → enter B
- "Wrong, key is XXX" → redraw
- No verification → use script ④
**Anti-patterns**: diving into code first / moving past without verification.
### Phase B: Symptom structuring + domain info gathering
**AI proactive opening (mandatory)**:
```
Entering B. Answer each (any miss skews the investigation):
□ Reproduction rate: 100% / intermittent / specific conditions?
□ Environment: reproduces locally?
□ Multi-instance: single / multi-replica?
□ Recent changes: deploy / scale-out / config / dependency upgrade?
□ Log signature: concentrated / spread? time window?
□ Did YOU reproduce it? (gold question) Method? What did you see?
□ What directions have you suspected / ruled out?
I'll re-prompt anything vague or skipped.
```
**Human supplies domain info**: "this is broker cluster" / "we scaled out last week" / "I tried with MQTTX, also fails" ← **this 'I reproduced it' is gold**.
**Gate**: 5+ items answered → C; under 3 → script ⑤; vague counter-signal → script ⑦.
**Anti-patterns**: investigating without structuring / filling skipped items by imagination.
### Phase C: AI-driven boundary probing loop [core engine]
> Complex bugs almost never get pinpointed by a single experiment. Must converge by looping.
**AI proactive opening (mandatory)**:
```
Entering C. Loop: symptom → boundary experiment → side-by-side data → if doubt, ask you.
Each round I'll: state hypothesis explicitly, show data side-by-side, stop and ask on facts that need you.
Interrupt me anytime with "wait, why does this say XXX?" — encouraged, helps me avoid self-persuasion.
This round expects [commands], needs [capability]. Capabilities ready?
```
**AI per loop**:
1. Design experiment that **bisects the diagnostic space** (not exhaustive command-spam)
2. Auto-execute: MCP / shell / code reads / cross-node compare
3. **Display side-by-side**:
| Experiment | Predicted | Actual | Match? |
|---|---|---|---|
| Entry A | should pass | passed ✅ | ✓ |
| Entry B | should pass | failed ❌ | **✗ anomaly** |
4. Self-check "actual fully matches hypothesis?":
- Full match + sufficient → tentative conclusion → D
- Any "doesn't fit" data → **do not force conclusion**, list doubts, ask
- Insufficient → next round
**Human**: read AI's listed doubts / **interrupt** AI's self-persuasion: "wait, why does that number say XXX?"
**Gate**: AI's questions must be **answered or explicitly marked "don't know"**. Counter-signals must be specific.
**Anti-patterns**: 10 commands without side-by-side / "exhaustive" not "bisecting" / partial match → conclude / **not exposing doubts (worst!)** / continuing after unanswered question.
### Phase D: Solution design + risk laydown [AI lays out, human decides]
**AI proactive opening (mandatory)**:
```
Entering D. I list every viable plan, **final pick is yours**.
Tell me: maintenance window today? what cannot break? rollback capability?
If you say "you choose" → look at "production impact" column first. I won't decide for you (you bear consequences).
```
**AI lists all plans, never decides**:
| Plan | Steps | Fix strength | Production impact | Rollback cost | Recommendation | Reasoning |
|---|---|---|---|---|---|---|
**Gate**: explicit pick → E; "you choose" → script ⑧; rushing without picking → "I will not act before you pick".
**Anti-patterns**: "I recommend X" + acts / hiding plans / no production-impact assessment.
### Phase E: Execute + verify in real time [prove while you act]
> "I think it's fixed" is the biggest trap.
**AI does**:
1. Execute fix
2. **Immediately re-run Phase C's decisive experiment** (same command, same input)
3. Before/after side-by-side:
| Metric | Before | After | Matches expectation? |
|---|---|---|---|
**Anti-patterns**: "should be fixed" without verifying / partial improvement → "fixed" / delegating verification.
### Phase F: Proactive escalation on plan failure [most failure-prone]
**AI does**:
1. Data does not match → **immediately say "Plan X failed, evidence is ..."**
2. Analyze failure cause
3. **Auto-escalate to next plan** (unless next plan's risk goes up — then ask human)
4. Re-execute + re-verify
**Real-case example**:
```
Plan 1 (restart pod) failed. Evidence: routing table predicted ≈41, actually 3 ❌; cross-node publish still failing ❌.
Cause: hostPath persistence makes node skip mria full bootstrap on restart.
Escalating to Plan 3 (cluster leave + join): routing table 3 → 46 ✅; cross-node publish all pass ✅. Fix successful.
```
**Anti-patterns**: "should be fixed, you try" / "partially worked..." / silently switching / asking user to decide next step.
### Phase G: Closed-loop documentation [mandatory closing]
> Document immediately, do not push to tomorrow — bloody details fade fast.
**AI proactive opening (mandatory, do not wait for user)**:
```
✅ Fix verified. **Entering G now (mandatory)** — details fading fast.
Writing BUGxx.md from bug-pattern-diagnosis template (5 min).
Confirm: □ Document (default) → start writing □ Skip → say "skip", and understand: next time everyone starts from zero
```
**AI writes** `BUGxx.md` using `bug-pattern-diagnosis` template, **4 mandatory sections**:
- **Symptom quick-match** (verifiable, greppable)
- **Negative features** (when this case does NOT apply ← prevents misdiagnosis)
- **5-minute self-check commands** (next person can copy-paste)
- **Wrong turns this time** (why Plan 1 failed / why we thought it was X)
**Gate**: no response → script ⑨ + default to documenting. "Skip" → say "OK, I won't learn from this either".
**Anti-patterns**: not documenting / waiting for user to bring it up / case missing negative features and wrong turns.
---
## One-page diagram (compact)
```
[Pre-check] model = Opus 4.7 + capability complete ← any miss → script ① / ②
↓
[A] Flow alignment ─ open: "symptom/log/services" ─ Gate: user verifies diagram ─ red: don't dive into code
↓
[B] Symptom structuring ─ open: 7-item checklist ─ Gate: ≥5 answered + "did you reproduce" ─ red: must collect counter-signals
↓
[C] Boundary probing loop ─ open: "bisect/side-by-side/ask doubts" ─ Gate: user must answer ─ red: no subjective / no riding / expose doubts
↓
[D] Solution layout ─ open: "window/what cannot break/rollback" ─ Gate: user picks ─ red: AI doesn't decide / doesn't hide plans
↓
[E] Execute + verify ─ red: not verified = not fixed
↓ ──fixed──→ [G]
↓
└──not fixed──→ [F] AI declares failure + evidence + auto-escalates → back to E
↓
[G] Documentation ─ open: default to document ─ Gate: no response → default to document
```
---
## Anti-pattern quick reference (human / AI ↔ scripts)
**AI anti-patterns** (self-watch): skip A and dive into code / no side-by-side display / "should" as conclusion / continuing past counter-signal / self-persuading fast conclusion / D acts directly / no verification after execute / vague language hiding failure / delegating verification / not documenting.
**Human anti-patterns → AI script**:
| Human anti-pattern | Script |
|---|---|
| Weak model on complex bug | ① |
| Missing key capability | ② |
| Symptom too vague | ③ |
| Pushing past flow diagram | ④ |
| Skipping structured questions | ⑤ |
| Not answering / vague | ⑥ |
| Counter-signal too coarse | ⑦ |
| Asking AI to decide | ⑧ |
| Not documenting | ⑨ |
| Throwing bug to AI and walking away | ⑥+⑦ AI proactive ping |
| "Just fix per BUGxx" | "Cases inspire direction, not the answer. Start at A to align flow" |
> **AI must not enable non-cooperation. Using a script ≠ refusing collaboration — it makes the cost of non-cooperation visible so the user can decide.**
---
## Relationship with `bug-pattern-diagnosis`
`bug-pattern-diagnosis` = **case library** (illnesses already seen); this SKILL = **treatment manual** (how to see a patient).
**Typical chain**: user reports complex bug → this SKILL runs 7 phases → at Phase C use `bug-pattern-diagnosis` for inspiration → return to C and continue → success → at Phase G write new BUGxx.md via `bug-pattern-diagnosis` template. They feed each other.
---
## Self-evolution
After every investigation: any new red line? anti-pattern not covered? phase to split? Yes → proactively suggest update. **This SKILL was itself evolved using its own methodology** — that is its self-consistency property.
---
## One-line summary
> **Complex-bug debugging = Opus 4.7 × complete capabilities × 7 phases × 4 AI red lines × dual cooperation gating × closed-loop documentation.**
> **This SKILL constrains AI AND user. On non-cooperation, AI must call it out via the scripts and let the user choose to fix or skip — not push forward "while sick".**
复杂 bug 与 AI 协作排查的元方法论。当用户报告"诡异 / 间歇性 / 多层因素 / 重启不愈 / 多系统协作 / 已经排查很久没头绪"的 bug 时,启用本 SKILL 的 7 阶段协作工作流:"业务链路对齐 → 症状结构化 → 边界测试循环 → 方案摆台 → 执行验证 → 失效升级 → 闭环沉淀"。本...
---
name: complex-bug-debugging-with-ai
description: 复杂 bug 与 AI 协作排查的元方法论。当用户报告"诡异 / 间歇性 / 多层因素 / 重启不愈 / 多系统协作 / 已经排查很久没头绪"的 bug 时,启用本 SKILL 的 7 阶段协作工作流:"业务链路对齐 → 症状结构化 → 边界测试循环 → 方案摆台 → 执行验证 → 失效升级 → 闭环沉淀"。本 SKILL 是双向纪律——同时约束 AI(不主观论断、不沿假设编、方案失效要主动说、用户不配合时停止推进)和用户(验收链路图、答完结构化问题、提供精确反信号、主动决策方案)。任一方不配合,AI 必须按内置话术明确指出,不带病硬推。
---
# Complex Bug Debugging with AI(复杂 BUG 与 AI 协作排查的工程化 Harness)
## 这是什么
不是案例库,是**协作工作流本身**。
> 案例库 `bug-pattern-diagnosis` 回答"**这个 bug 是什么**"
> 本 SKILL 回答"**怎么和 AI 一起排查复杂 bug**"
**核心信念**:复杂 bug 单靠 AI 解不了,单靠人也解不了。AI 缺:领域直觉 / 业务上下文 / 反信号 / 决策权。人缺:跑遍 100 条命令的耐心。**只有人 × AI 协作 + 严格流程,才能稳定攻破。**
## 触发时机
用户描述下列情况时**主动激活**:
- "排查很久了 / 排查不下去了"
- "bug 很诡异 / 不可复现 / 间歇性"
- "重启又自己好了,但还会再出"
- "看起来是 X,但改了 X 又不行"
- "涉及多个服务 / 节点 / 集群"
- "现象层面看起来矛盾"
普通 NPE / 编译错误 / 怎么写函数 → **不要启用**,直接处理。
---
## 前置:模型与能力预检(开始前必做)
### 1. 模型必须是 Opus 4.7(或同等强度)
- 弱模型会沿第一个假设一路编(自洽但错),把用户带沟里
- Opus 4.7 会**主动反向怀疑自己**(如怀疑工作区代码 ≠ 部署代码,主动拉 jar 反编译比对)
- **当前模型不是 Opus 4.7 → 先告诉用户切换,不要硬上**
### 2. 能力体系完备性
排查能力的下限取决于**最弱的工具**:
| 能力 | 缺失影响 |
|---|---|
| 代码访问(Read / Grep) | 无法验证业务逻辑 |
| 基础设施(K8S MCP / SSH) | 无法看 pod / 节点 |
| 数据访问(DB MCP) | 只能信用户口述 |
| 日志访问(拉真实日志) | 止于"猜栈" |
| 网络/HTTP 调用 | 无法做实验 |
| 专业 SKILL(如 server-log-analysis) | 效率打折 |
**缺哪个补哪个,不要带病开工。**
---
## 四条贯穿全流程的 AI 自我约束红线
### ① 不主观论断
每个结论必须有**刚跑出来的数据 / 刚读到的代码**支撑。**禁止**用"应该是 / 可能是 / 通常是"当结论。**可以**说"基于刚才的 metrics,判断是 ..."。
### ② 不沿假设编
用户给的方向 ≠ 真相。自己上轮假设 ≠ 已确认。看到反信号("我也试过也不行" / 数据不符预期)**立即停下当前路径**,重新取证。
### ③ 方案失效要主动说
没修好 → **第一时间明说"方案 X 失效,证据是 ..."**,主动升级下一方案。**禁止**:"应该好了,你试试" / "部分生效..." / 默默换方案。
### ④ 用户不配合时停止推进
没换强模型 / 能力缺失 / 没回询问 / 没提供边界信息 → **不要带病开工**,按下面的话术明确指出。用户坚持不补 → 可继续,但**先标记"以下排查在 X 信息缺失下进行,结论可能偏差"**。
---
## 双向纪律:主动询问 + 用户配合度强制检查
> 协作不是单方面。AI 不能在用户不配合时硬推,也不能悄悄替用户决定。
### 主动询问原则
每进入一个阶段,AI **必须主动询问**该阶段需要的信息,**禁止**用户给一句模糊描述就一头扎进去。
### 9 套用户不配合信号 + AI 反馈话术(直接套用,不要现编)
#### ① 没用强模型
```
⚠️ 当前模型不是 Opus 4.7。弱模型容易沿假设编(自洽但错),建议先切换。
若坚持用,请额外警惕我的结论,对每个"应该是 ..."追问"数据支撑是什么?"。
```
#### ② 能力体系缺失
```
⚠️ 本次排查需要 [具体能力],当前未配置。
影响:[具体影响]。请先配置再继续。
若暂时配不了,我会基于你提供的文本日志排查,但可信度显著降低。
```
#### ③ 阶段 A:现象描述太模糊
```
⚠️ 现象太模糊,无法画链路图。请提供以下三项中至少两项:
1. 一句话现象("X 接口返回 500 / 设备发 register 没收到 reply")
2. 真实 log / 接口响应截图
3. 涉及的服务清单("前端 → 网关 → access-service → broker")
不补齐我只能停留在猜可能性阶段。
```
#### ④ 阶段 A:不验收链路图就推下一步
```
⚠️ 你还没确认链路图是否准确。如果链路图错了,后续讨论都基于错误前提。
请回"对"或"不对,重点是 XXX",再继续。
```
#### ⑤ 阶段 B:跳过结构化问题
```
⚠️ 你没回答以下结构化问题(我自己拿不到):
□ 复现率?□ 环境差异?□ 最近改动?□ 你自己复现过吗(金问题)?
没有这些信息我只能基于猜测做边界测试。请逐条回答。
```
#### ⑥ 阶段 C:AI 询问后用户没回答 / 答得模糊
```
⚠️ 我遇到一个**必须由你确认**的事实:
问题:[具体二选一问题]
影响:决定排查方向(A → 路径 X;B → 路径 Y)
请:1) 告诉我怎么查,我去查;2) 或明确说"不知道也查不到",我会按两种情况都假设展开。
不要直接换话题,否则诊断空间无法收敛。
```
#### ⑦ 阶段 C:反信号描述太模糊
```
⚠️ 你说"也不行 / 没问题"是关键反信号,但描述太模糊。请补充:
- 怎么试的?(命令 / 工具 / 步骤)
- 看到什么?(具体输出 / 错误码)
- 环境一致吗?
不要说"MQTTX 也不行",要说"MQTTX QoS 1 publish XXX,连 XXX broker,无报错但 reply 没收到"。
```
#### ⑧ 阶段 D:让 AI 替自己拍板
```
⚠️ 方案选择必须由你拍板,因为:
- 你比我清楚生产容忍度 / 不能影响什么 / 回滚能力
- 出问题承担后果的是你的团队,不是我
我已列清"修复力度 / 生产影响 / 回滚成本"。请基于"今天能接受多大影响"决定。
若实在没决策依据,告诉我"运维窗口 X / 不能影响 Y",我帮你筛但你拍板。
```
#### ⑨ 阶段 G:修复完不沉淀
```
⚠️ 本次细节正在快速从短期记忆里流失。强烈建议现在沉淀(5 分钟):
- 按 bug-pattern-diagnosis 模板写 BUGxx.md
- 重点:症状速查 / 反向特征 / 5 分钟自检命令 / 走过的弯路
不沉淀代价:下次类似 bug 你 / 同事 / AI 都要从零再排一次。
回"沉淀"或"不用",明确即可。
```
### 配合度门控(每阶段流转前自检)
| 流转 | Gate |
|---|---|
| A → B | 用户验收链路图了吗? |
| B → C | 答完结构化问题?补了"我也复现过"? |
| C 每轮 | 上轮疑问回答了?反信号够具体? |
| C → D | 决定性证据充分?AI 没自我说服? |
| D → E | 用户拍板了?还是让 AI 替决策? |
| E → F/G | 验证数据齐?修复前后并列展示? |
| G 完成 | 同意沉淀?BUGxx.md 完整? |
**任一 Gate 不通过 → 停下来按话术指出,不要硬推。**
---
## 7 阶段协作工作流
### 阶段 A:业务链路对齐【先建图,不修 bug】
> 双方对"链路"认知不一致 → 后续讨论都是鸡同鸭讲。
**AI 主动开场(必做)**:
```
按 SKILL 流程走(不需要可打断我)。阶段 A 我需要:
1. 一句话现象(不要先猜原因)
2. 真实 log / 接口响应 / 截图
3. 涉及哪些服务 / 链路
我画完链路图给你确认。
```
**AI 做**:
1. 主动询问 / Read 代码,画端到端链路图
2. 复述现象,确认"我理解的 = 你说的"
3. 明确列"已知 X"、"不知道 Y"
**人验收(强制 Gate)**:
- "对" → 进入 B
- "不对,重点是 XXX" → 重画
- 没明确验收 → 套话术 ④
**反模式**:上来就扎代码 / 没等验收就推下一步。
### 阶段 B:症状结构化 + 领域信息补齐
**AI 主动开场(必做)**:
```
进入 B。请逐条回答(缺哪条都会让排查走偏):
□ 复现率:100% / 偶发 / 特定条件?
□ 环境差异:本地能复现吗?
□ 多实例特征:单 / 多副本?
□ 最近改动:发版 / 扩容 / 配置变更 / 依赖升级?
□ 日志特征:集中 / 分散?时间窗?
□ 你自己复现过吗?(金问题)用什么手段?看到什么?
□ 你怀疑过哪些方向?已排除什么?
模糊或跳过的我会再追问。
```
**人补领域信息**:"这是 broker 集群" / "上周扩容了" / "我用 MQTTX 也试了也不行" ← **这种'我也复现了'是金子**。
**Gate**:用户回答 ≥5 条 → C;少于 3 条 → 套话术 ⑤;反信号模糊 → 套话术 ⑦。
**反模式**:不结构化就直接排查 / 用户跳过就脑补。
### 阶段 C:AI 自主边界测试循环【核心引擎】
> 复杂 bug 几乎不可能一次实验定位,必须循环收敛。
**AI 主动开场(必做)**:
```
进入 C。我会跑这个循环:现象 → 边界测试 → 数据并列展示 → 有疑惑就主动问你。
每轮我会:明确说"这轮验证什么假设"、数据并列展示、必须由你确认的事实主动停下问。
你随时打断说"等等,这个数据为什么 XXX?"是被鼓励的——能帮我避免自我说服。
本轮预计跑 [命令清单],需要 [具体能力]。能力齐了吗?
```
**AI 每轮**:
1. 设计**能二分诊断空间**的实验(不是穷举跑命令)
2. 自主执行:MCP / shell / 代码读取 / 跨节点对比
3. **数据并列展示**:
| 实验 | 预期 | 实际 | 一致? |
|---|---|---|---|
| 入口 A | 应成功 | 成功 ✅ | ✓ |
| 入口 B | 应成功 | 失败 ❌ | **✗ 异常点** |
4. 自检"实际 vs 假设是否完全吻合":
- 完全吻合 + 信息充分 → 初步结论 → D
- 任何"不太对"的数据 → **不要硬下结论**,列疑问问人
- 信息不够 → 设计下一轮
**人做**:看 AI 列的疑问 / 在 AI 自我说服时**主动打断**:"等等,那这个数据为什么 XXX?"
**Gate**:AI 询问的事实必须**回答或明确说"不知道"**。反信号必须具体。
**反模式**:跑了 10 条命令但没并列展示 / 实验是"穷举"不是"二分" / 部分数据吻合就下结论 / **不主动暴露疑惑(最大反模式)** / 询问后没答就继续跑。
### 阶段 D:方案设计 + 风险摆台【AI 摆台,人决策】
**AI 主动开场(必做)**:
```
进入 D。我列所有可行方案,**最终选哪个由你拍板**。
请告诉我:今天有运维窗口吗?哪些业务绝对不能影响?回滚能力如何?
若你说"你看着办" → 请先看下表"生产影响"列。我不替你拍板(出问题承担后果的是你)。
```
**AI 做**:列所有方案,**绝不替人拍板**:
| 方案 | 步骤 | 修复力度 | 生产影响 | 回滚成本 | 推荐度 | 理由 |
|---|---|---|---|---|---|---|
**Gate**:用户明确选 → E;说"你看着办" → 套话术 ⑧;催"快修"但没拍板 → "在你拍板前我不会动手"。
**反模式**:直接说"我建议方案 X"+ 操作 / 隐藏方案 / 没给生产影响评估。
### 阶段 E:执行 + 实时验证【边做边证】
> "操作完就觉得修好了"是最大的坑。
**AI 做**:
1. 执行修复
2. **立即重跑阶段 C 的决定性实验**(同命令、同输入)
3. 修复前后并列:
| 指标 | 修复前 | 修复后 | 符合预期? |
|---|---|---|---|
**反模式**:执行完不验证就说"应该好了" / 部分指标改善就说"修好了" / 验证甩给用户。
### 阶段 F:方案失效时的主动升级【最容易翻车】
**AI 做**:
1. 数据不符 → **立刻明说"方案 X 失效,证据是 ..."**
2. 分析失效原因
3. **自动升级下一方案**(除非升级方案风险等级提升,那时再请人决策)
4. 重新执行 + 验证
**真实案例范例**:
```
方案 1 (重启 pod) 失效。证据:路由表预期 ≈41,实际仍 3 ❌;跨节点 publish 仍失败 ❌。
原因:hostPath 持久化让节点重启后跳过 mria 全量 bootstrap。
升级方案 3 (cluster leave + join):路由表 3 → 46 ✅;跨节点 publish 全通过 ✅。修复成功。
```
**反模式**:"应该好了,你试试" / "方案 1 部分生效..." / 默默换方案 / 失效后让用户决定下一步。
### 阶段 G:闭环沉淀【强制收尾】
> 修复成功后必须立刻沉淀,不要拖到第二天——血的细节会很快忘。
**AI 主动开场(必做,不要等用户提)**:
```
✅ 修复验证成功。**强制进入 G**——细节正在快速流失。
我现在按 bug-pattern-diagnosis 模板写 BUGxx.md(5 分钟)。
确认:□ 同意沉淀(默认)→ 直接开写 □ 不沉淀 → 请明确说"不用",并理解:下次类似 bug 大家都从零再排
```
**AI 做**:用 `bug-pattern-diagnosis` 模板写 `BUGxx.md`,**4 块必写**:
- **症状速查表**(可验证、可 grep)
- **反向特征**(什么情况不是本案例 ← 防误诊)
- **5 分钟自检命令**(让下次直接抄)
- **本次走过的弯路**(为什么方案 1 失效 / 为什么以为是 X)
**Gate**:用户没回应 → 套话术 ⑨ + 默认沉淀。说"不用" → 明说"OK,下次我也学不到这次的经验"。
**反模式**:不沉淀 / 等用户提才动 / 案例只写"是什么 bug"不写反向特征和弯路。
---
## 一图流(紧凑版)
```
[前置] 模型 = Opus 4.7 + 能力完备 ← 缺任一项 套话术 ①/②
↓
[A] 链路对齐 ─ 开场问"现象/log/服务" ─ Gate: 用户验收链路图 ─ 红线: 不扎代码
↓
[B] 症状结构化 ─ 开场问 7 项清单 ─ Gate: 答 ≥5 条 + "你复现过吗" ─ 红线: 必须收反信号
↓
[C] 边界测试循环 ─ 开场说"二分实验/数据并列/有疑惑就问" ─ Gate: 用户必答询问 ─ 红线: 不主观/不沿假设/主动暴露疑惑
↓
[D] 方案摆台 ─ 开场问"运维窗口/不能影响什么/回滚" ─ Gate: 用户拍板 ─ 红线: AI 不决策/不隐藏方案
↓
[E] 执行+验证 ─ 红线: 不验证不算修复
↓ ──修好──→ [G]
↓
└──没修好──→ [F] AI 明说失效+证据 + 自动升级 → 回 E
↓
[G] 闭环沉淀 ─ 开场默认沉淀 ─ Gate: 没回应 → 默认沉淀
```
---
## 反模式速查(人 / AI 对照话术编号)
**AI 反模式**(自查):跳过 A 扎代码 / 命令输出无并列 / "应该" 当结论 / 看到反信号还沿原方向 / 自我说服快速下结论 / D 直接操作 / 执行完不验证 / 模糊词掩盖失效 / 验证甩给用户 / 修复完不沉淀。
**人反模式 → AI 应套话术**:
| 人反模式 | 话术 |
|---|---|
| 弱模型排查复杂 bug | ① |
| 缺关键能力 | ② |
| 现象太模糊 | ③ |
| 不验收链路图就推下一步 | ④ |
| 跳过结构化问题 | ⑤ |
| 询问后不答 / 答模糊 | ⑥ |
| 反信号太粗 | ⑦ |
| 让 AI 替拍板 | ⑧ |
| 修复完不沉淀 | ⑨ |
| 把问题甩给 AI 喝咖啡 | ⑥+⑦ AI 主动 ping |
| "你按 BUGxx 修一下" | "案例是思路启发不是答案,先按 A 对齐链路" |
> **AI 不能纵容用户不配合。套话术 ≠ 拒绝合作,而是让用户清楚"现在不合作的代价是什么"。**
---
## 与 `bug-pattern-diagnosis` 的关系
`bug-pattern-diagnosis` = **病例库**(看过的病);本 SKILL = **诊疗手册**(怎么看病)。
**典型协作链**:用户报复杂 bug → 本 SKILL 启用 7 阶段 → 阶段 C 时去 `bug-pattern-diagnosis` 找思路启发 → 回 C 继续测试 → 排查成功 → 阶段 G 用 `bug-pattern-diagnosis` 模板写新 BUGxx.md。两者互相调用、互相喂养。
---
## 自我演化
每次排查后评估:有新红线?有未覆盖的反模式?有阶段该拆得更细?有 → 主动建议用户更新本 SKILL。**本 SKILL 自身就是用本 SKILL 演化出来的**。
---
## 一句话总结
> **复杂 bug 排查 = Opus 4.7 × 能力完备 × 7 阶段 × 4 条 AI 红线 × 双向配合度门控 × 闭环沉淀。**
> **本 SKILL 同时约束 AI 和用户。检测到不配合时 AI 必须按话术指出,让用户决定补齐还是放弃——不要带病硬推。**
根据用户描述的 bug 现象(症状)匹配已沉淀的 bug 案例库,快速判断这是哪类问题、根因在哪、如何排查。每次成功诊断后会把新案例沉淀到案例库,持续积累经验。适用于用户报告"奇怪的报错 / 间歇性失败 / 某些环境才复现 / 日志很诡异"的场景。
---
name: bug-pattern-diagnosis
description: 根据用户描述的 bug 现象(症状)匹配已沉淀的 bug 案例库,快速判断这是哪类问题、根因在哪、如何排查。每次成功诊断后会把新案例沉淀到案例库,持续积累经验。适用于用户报告"奇怪的报错 / 间歇性失败 / 某些环境才复现 / 日志很诡异"的场景。
---
# Bug Pattern Diagnosis (Bug 症状诊断与经验沉淀)
## Skill 职责
这个 Skill 的作用是**像医生看病一样做 bug 诊断**:
1. **收集症状**:让用户描述 bug 的表现(报错信息、复现率、环境、日志特征等)
2. **回忆过往**:去 `experience/` 目录下翻阅以前遇到过的类似 bug,**作为经验记忆参考**
3. **独立诊断**:结合当前上下文自主排查,过往案例**只作为思路启发和假设来源**,绝不直接复用结论
4. **沉淀新经验**:每次成功定位新 bug 后,把它作为新的 `BUGxx.md` 写进案例库,供下次**参考**
**核心价值**:让过往踩坑的经验**启发**下次排查的方向,但不替代独立思考。相似症状不一定是同一个 bug,看起来一样的日志背后可能是完全不同的根因。
## 铁律:经验是参考,不是答案
> **这是本 Skill 最核心的使用原则。**
案例库里的 `BUGxx.md` 是**老医生的病历本**,不是**处方模板**。拿到新患者:
- ✅ **可以做的**:从病历本获取"这类症状通常值得怀疑什么"、"以前用什么方法查出过"、"有哪些容易踩的坑"
- ❌ **不能做的**:看到症状类似就直接抄根因、抄修复方案、抄代码改动
### 为什么不能直接引用
1. **症状可以相似,根因可能完全不同**:同样是"NPE + 间歇性 + 多副本",可能是本次的 header 漏传,也可能是 Redis 连接池偶发失败、也可能是 GC stop-the-world 期间的竞态,更可能是**几种问题叠加**。
2. **项目环境差异巨大**:同样的代码结构在不同版本、不同配置、不同依赖下行为可能迥异。
3. **AI 误判代价高**:直接按案例照抄会让用户基于错误前提改代码,排查路径被带偏,真正的 bug 反而被掩盖。
4. **经验的价值在于"启发思考"而非"给答案"**:好医生看病历是为了拓展思路,不是为了复制粘贴。
### 正确的使用姿势
| 场景 | 错误做法 | 正确做法 |
|---|---|---|
| 症状匹配度高 | "这是 BUG01,改 `invokeRemoteDeviceOpt` 加 `x-token-payload`" | "你描述的现象让我想起 BUG01 的一个特征——副本间日志不对称。建议你**先验证**:跨副本日志是否真的有'一个有栈一个没栈'?如果是,再顺着这个方向查" |
| 部分特征命中 | "虽然不完全一样,但按 BUG01 的方案应该能修" | "BUG01 里**有个排查技巧可能适用**——连续请求 100 次看成功率是不是约 `1/N`。先用这个验证一下是不是副本间状态不一致的问题" |
| 案例里有具体代码 | 把 BUG01 的修复代码贴给用户让他照抄 | "BUG01 的修复思路是补齐 header 透传,但具体到你的项目,要先确认:(1) 你的接收端实际依赖哪些 header?(2) 序列化方式是否和入口一致?这些答完才能写代码" |
## 案例库结构
```
bug-pattern-diagnosis/
├── SKILL.md ← 本文件(职责、流程、匹配规则)
└── experience/ ← 案例库
├── BUG01.md ← 每个案例一个文档
├── BUG02.md
└── ...
```
每个 `BUGxx.md` 都按固定结构写,方便快速检索:
1. **案例摘要**(一句话)
2. **症状 / 特征速查**(像病例的"阳性体征",用于匹配)
3. **详细说明**(病理机制 / 根因链路)
4. **排查方法论**(诊断流程 / 关键技术)
5. **修复方案**(根治 + 兜底 + 加固)
6. **预防清单**(Checklist 防止复发)
7. **同类 bug 的 Playbook**(遇到相似现象如何按步骤排查)
## 触发时机
用户描述下列类型的问题时,优先启用本 Skill:
- "某个接口**有时候**报错,有时候又正常"
- "线上复现了但本地/测试环境复现不了"
- "日志里的报错看起来很奇怪 / 对不上代码 / 栈指向不明"
- "多个 pod / 多实例 / 多副本 / 多机器之间行为不一致"
- "明明带了 xxx,服务端还提示 xxx 缺失"
- "偶尔超时 / 偶尔 500 / 偶尔权限不足"
- 任何形如"**交叉产生 / 间歇性 / 不确定性**"的描述
## 核心流程
### Step 1:读取案例库索引
读取 `experience/` 目录下**所有** `BUG*.md` 文件的"**症状 / 特征速查**"章节(每个文件的前 30-50 行通常就够),不要一开始读完整文件。
### Step 2:症状结构化
从用户描述里提取**关键特征**,至少覆盖以下维度:
| 维度 | 例子 |
|---|---|
| 错误信号 | NPE / 500 / 403 / 超时 / 数据错 / 死锁 |
| 复现率 | 100% / 50% / 偶发 / 特定条件 |
| 环境差异 | 本地不复现?测试复现?生产复现? |
| 多实例特征 | 单副本 / 多副本?副本数是几? |
| 日志分布 | 集中在一个实例 / 分散在多个实例 / 一个有栈一个没栈 |
| 触发条件 | 特定用户 / 特定参数 / 特定时段 |
| 最近改动 | 发版?配置变更?扩容缩容?依赖升级? |
### Step 3:案例回忆(不是匹配结论)
把结构化特征和案例库逐个**比对相似度**,但**不要**据此下结论。无论相似度多高,案例都只是**"值得参考的既往经验"**,不是"确定的答案"。
相似度处理原则:
- **症状高度吻合**(3 个以上关键特征命中)→ 把案例作为"**优先怀疑方向**",但要先引导用户**独立验证**关键特征是否真的成立
- **症状部分吻合**(1-2 个特征命中)→ 把案例作为"**可能的思路来源**",提醒用户该案例的一两个排查技巧可能适用
- **没有匹配**(0 个命中)→ 走通用方法论,从零排查
### Step 4:给出排查建议(不是诊断结论)
**不要**以"这是 BUGxx"的口吻下定论。要以"过往有类似案例,以下是**可能的方向和验证方法**"的口吻交流。
推荐的响应结构:
- 🧭 **当前症状的结构化复述**(确认你没理解错)
- 💭 **经验参考**:提到 1-2 个相关案例(简单说"这让我想到之前排查过的 BUGxx 有类似特征 X"),但**不贴根因、不贴修复代码**
- 🔬 **独立验证建议**:列出 2-3 个能快速验证假设的 quick check(连续请求统计成功率 / 跨副本日志对照 / 某个 header 抓包等)
- 🎯 **当前项目的具体排查路径**:基于用户的项目结构、代码特点给出**定制化**的下一步行动,而不是案例里的通用步骤
- ⚠️ **明确不确定性**:说清"这只是基于相似症状的猜测,最终根因需要用户配合验证才能确认"
### Step 4 的反面教材(禁止输出)
```
❌ "根据你描述的症状,这是 BUG01(多副本间歇性 NPE)。
根因是 invokeRemoteDeviceOpt 漏传 x-token-payload。
按以下代码改动即可:[直接贴 BUG01 里的代码]"
```
### Step 4 的正面范例
```
✅ "你的描述有几个特征值得注意:
1) 间歇性失败,成功率似乎接近 50%
2) 日志似乎是 token 相关但客户端明明带了 token
这让我想起过去排查的 BUG01,那个案例里类似症状的背后是
'多副本内部调用漏传身份 header'。但你的项目未必是同一个
问题,**我建议先验证几个关键假设**:
Q1: 你的服务副本数是几?成功率是否约等于 1/N?
Q2: 能否同时 tail 所有副本日志,看失败请求是否同时在
多个副本留痕、且日志详尽程度不对称?
你先做这两个验证,结果告诉我之后,我们再决定排查路径。"
```
### Step 5:沉淀新案例(可选)
当出现以下情况时,**主动询问用户**是否把本次排查固化成新案例:
- 案例库里完全没有匹配项,但本次成功定位了 bug
- 已有案例能匹配部分症状,但有显著新变种
- 排查过程中总结出了新的诊断方法论
用户同意后,按"**案例文档结构**"(下方模板)创建 `BUGxx.md`(编号按当前最大编号 +1)。
## 通用方法论(匹配不到已有案例时使用)
按以下顺序排查陌生 bug:
### 1. 定量化现象
- 成功率?连续请求 100 次统计
- 复现条件?能否用最小用例稳定复现
- 故障时间分布?集中在某个时段/某个用户/某台机器
### 2. 检查部署拓扑
- 副本数多少?成功率是否约等于 `1/N` 或 `(N-1)/N`?
- 单副本能否复现?不能 → 强烈指向副本间行为不一致
- 有无灰度/金丝雀?新老版本是否混部?
### 3. 跨实例日志对照
- 按 traceId / 时间窗口,把**所有相关实例**的日志拉出来并排看
- 留意"同一请求在两个实例留下错位证据"的情况
- 留意"不同实例日志详尽程度不一致"(一个有栈一个没栈)
### 4. 代码 vs 部署一致性
- 部署 jar 里的 `.class` 用 `javap -c -p` 反编译,和本地源码比对
- 镜像 tag、git commit hash 要对得上
- ConfigMap / 环境变量是否全副本一致
### 5. 协议层边界审计
如果怀疑上下文传播问题:
- 画出"身份/traceId/MDC/ThreadLocal"的完整生命周期
- 标注所有"跨线程 / 跨进程 / 跨实例"的边界
- 每个边界是否有显式的打包-解包机制
- 用 tcpdump / 接收端打印 `getHeaderNames()` 验证实际传递的 header
### 6. 对比同类代码
- 项目里有没有**做同样事**但**没出 bug** 的代码?
- Diff 出差异 → 差异处大概率是 bug 源头
## 案例文档结构(创建新 BUG*.md 时的模板)
```markdown
# BUGxx: <一句话标题,突出最特征性症状>
## 案例摘要
<一段话,200 字内。点明:现象、复现条件、根因类型、影响范围>
## 症状 / 特征速查(用于匹配)
> 遇到下列特征同时满足 N 条以上时,高概率是本案例
- [ ] 特征 1(具体、可验证)
- [ ] 特征 2
- [ ] 特征 3
- [ ] ...
### 关键日志指纹
<贴出典型的错误日志片段,让后续可以直接 grep 匹配>
### 不会出现的反向特征(排除项)
- 如果出现 <xxx>,则不是本案例
## 详细说明 / 根因链路
<病理机制。用图/表格/代码引用说明数据流、控制流、状态变化>
## 排查方法论
### 使用到的技术
- <方法 1:例如"跨副本日志对照">
- <方法 2:例如"javap 字节码比对")>
- ...
### 诊断步骤(按顺序)
1. ...
2. ...
3. ...
## 修复方案
### 根治
<核心修复点、代码示例、影响面评估>
### 兜底 / 防御
<二线防御措施,防止根治漏掉时仍不崩>
### 加固 / 清理
<长期性改进,代码规约、同类代码扫描>
## 预防清单(Checklist)
开发阶段:
- [ ] ...
部署阶段:
- [ ] ...
Review 阶段:
- [ ] ...
## 同类间歇性 bug 的 Playbook
遇到类似现象时按此顺序走:
1. 定量复现(10 分钟)
2. 查拓扑(5 分钟)
3. 日志对照(30 分钟)
4. ...
## 参考资料
- 相关文件路径
- 相关 PR/commit
- 相关 wiki
```
## 输出风格约束
- **先给方向,再给证据链**:告诉用户当前值得怀疑什么、为什么怀疑,而不是直接断言"这是什么 bug"
- **永远不用 100% 肯定的口吻下结论**:类似症状不代表同一根因,用"看起来像 / 值得怀疑 / 可能是 / 过往有类似案例"这种语气
- **建议永远可执行**:排查建议要给到具体命令、具体 quick check,不要只给抽象理论
- **鼓励用户先验证再改代码**:宁可多问几个"请你先看一下 xxx",也不要基于猜测直接让用户动代码
## 禁止事项(Hard Rules)
- ❌ **禁止直接引用案例里的修复代码**:不把 `BUGxx.md` 里的代码片段当成"现成答案"贴给用户。案例代码只是"那个项目的修法",不一定适合当前项目。
- ❌ **禁止以"这是 BUGxx"的口吻断言**:最多说"让我想起 BUGxx"、"和 BUGxx 有相似特征"。
- ❌ **禁止跳过独立验证**:即使 5 条特征全中,也要先让用户跑 1-2 个 quick check 验证关键假设,再进入修复讨论。
- ❌ **禁止把案例的"预防清单 / Playbook"整段复制输出**:这些是**思考素材**,每次要结合当前上下文裁剪改写。
## 案例沉淀约束
- 不要为了创建新案例而创建——只在真正有新价值时沉淀(全新的症状组合、全新的排查技巧)
- 不要照搬通用 bug 分类学术语(OOM、死锁之类),要用**症状描述**组织案例("同一请求在不同副本日志错位"比"状态一致性 bug"更容易匹配)
- 案例库按问题**现象**索引,不是按"技术栈"或"漏洞类型"索引
- 每个案例结尾明确标注**适用边界**和**反例**(什么情况不是本案例),帮助下次准确识别
FILE:experience/BUG01.md
# BUG01: 多副本间歇性 NPE —— 有状态业务 × 无状态部署 × 身份上下文漏传
## 案例摘要
前端携带合法 token 调用某接口,连续请求**约 50% 成功、50% 返回 `errorCode: 10001,服务异常`**。服务端是 k8s 多副本部署(Deployment + round-robin Service),报错副本日志里是 `ChainContextHolder.getTokenPayload()` 返回 null 触发的 NPE(完整堆栈),而同一请求的"入口副本"日志里只有一行业务错误(无堆栈)。
根因是"**应用层有状态路由**(partition owner)"、"**k8s 无状态轮询分发**"、"**内部转发协议漏传 `x-token-payload` header**"三层因素交叉导致的间歇性故障。**任何单层修复都无法根治,但改对一层协议就能完全解决**。
---
## 症状 / 特征速查(用于匹配)
> 下列特征命中 **4 条及以上** 时,高概率是本案例;命中 **2-3 条** 也值得优先按本案例思路排查。
### 客户端现象
- [ ] 同一 token、同一 body,连续请求 N 次,成功率**约等于 `1/副本数`** 或 `(N-1)/N`(两副本时常见 ~50%)
- [ ] 失败时返回业务错误码(如 10001 / INTERNAL_SERVER_ERROR / "服务异常"),不是 401/403/超时
- [ ] 本地单副本环境 **100% 成功**,无法复现
- [ ] 错误提示**像 token 问题**("未认证 / 身份缺失 / 用户不存在"),但客户端明明带了合法 token
### 服务端日志特征(最诊断性!)
- [ ] 同一个客户端请求,在**多个副本的日志里都有痕迹**(正常应该只在一个副本出现)
- [ ] 不同副本日志**信息不对称**:
- A 副本:`ERROR` + 完整异常堆栈(NPE / ClassCastException / IllegalState 等底层异常)
- B 副本:`ERROR` 或 `WARN` + 一行简短业务错误码,**无堆栈**
- [ ] 出现堆栈的副本里,NPE 指向**读取 ThreadLocal / MDC / 上下文存储**的代码(如 `ChainContextHolder.getXxx()`、`RequestContextHolder.getXxx()`、`MDC.get()`)
- [ ] 无堆栈的副本里,有类似 `response XXXX,XXX` 格式的错误日志(来自 RestTemplate / Feign 对内部 HTTP 响应的处理)
### 部署 / 架构特征
- [ ] 目标服务是 **k8s Deployment** 多副本(不是 StatefulSet)
- [ ] 服务内部使用 **Kafka Consumer Group** / **应用层分区路由** / **内存缓存 + partition 所有权**
- [ ] 代码里存在 `RestTemplate.exchange` / `Feign` 向**本服务的其它 pod IP** 发 HTTP 请求的代码
- [ ] 存在类似 "黄页表"(DB 或 Redis 里记录 `partition → ip:port`)
### 代码特征
- [ ] 接收端 Controller **硬解引用** ThreadLocal/上下文(如 `ChainContextHolder.getTokenPayload().getAccount()`)而未做 null-safe
- [ ] 存在内部调用用的 "feign_token" / "system_token" 字面量占位符(`Authorization: Bearer feign_token` 这类)
- [ ] 内部转发只传 `Authorization`,**没传** `x-token-payload` / `x-user-context` / `x-trace-id` 等身份 header
### 关键日志指纹(可直接 grep)
```text
# A 副本(接收端)典型日志
ERROR ... GlobalExceptionHandler - found system exception,null/<接口路径>
java.lang.NullPointerException: Cannot invoke "XxxPayload.getXxx()" because
the return value of "XxxContextHolder.getPayload()" is null
at <Controller>.<method>(<Controller>.java:NNN)
# B 副本(转发端)典型日志
ERROR ... <Service> - <methodName> response <errorCode>,<errorMessage>
(就这一行,后面没有 "at xxx.xxx" 的栈)
```
### 反向排除项
- 如果**单副本也 100% 失败** → 不是本案例,是纯代码 bug
- 如果**所有副本日志都有完整堆栈** → 不是本案例,是单机异常没有跨副本链路
- 如果**token 确实过期了**(日志里有 `token expired` / `invalid signature`)→ 不是本案例
- 如果故障时间段集中在**扩缩容/发版前后几分钟**内 → 可能是 Kafka rebalance 期间的瞬时问题,不是本案例
---
## 详细说明 / 根因链路
### 三层因素拆解
| 层 | 内容 | 单独看是否有问题 |
|---|---|---|
| **代码层** | 某业务接口的 Service 里有 partition 路由逻辑:`MonitorCacheMgr.send(tbCode)` 根据 `hash(tbCode)` 找到 partition owner pod,非 owner 通过内部 HTTP 转发到 owner | 不是 bug(有状态业务的合理设计) |
| **部署层** | k8s Deployment 部署 2+ 副本,Service 用 round-robin 分发外部请求。**副本对两个请求都是 50% 命中率,不考虑业务路由** | 不是 bug(k8s 标准用法) |
| **协议层** | 内部转发只带 `Authorization: Bearer feign_token`(字面量占位符),**没带 `x-token-payload`**。接收端 `AuthInterceptor` 只认 `x-token-payload`,取不到就让 `ChainContextHolder.TokenPayload = null` | **这是真正的 bug** |
### 请求"裂变"流程
前端视角发了 1 次请求,服务端实际处理了 2 次 HTTP 调用:
```text
前端 ──► 请求 #1 ──► k8s Service (round-robin)
│
50% │ 50%
┌────────────────┴────────────────┐
▼ ▼
Pod A (owner of tbCode) Pod B (NOT owner)
────────────────── ──────────────────
1. AuthInterceptor 1. AuthInterceptor
解析 x-token-payload ✓ 解析 x-token-payload ✓
2. Controller 取 account ✓ 2. Controller 取 account ✓
3. send() 自己是 owner 3. send() 算出 owner 是 Pod A
→ 本地 call() → 发起【请求 #2】转发到 Pod A
4. 业务逻辑执行成功 │
5. 返回 ApiResponse(0) ✅ │ ❌ 漏传 x-token-payload
▼
Pod A 作为接收端:
5. AuthInterceptor 找不到
x-token-payload
→ ChainContextHolder = null
6. Controller 硬取 .getAccount()
💥 NPE
7. GlobalExceptionHandler 捕获
→ 打印完整 NPE 堆栈(日志 ①)
→ 返回 ApiResponse(10001)
▲
│ HTTP 200 + body{errorCode:10001}
│
Pod B 收到响应:
8. errorCode != 0
log.error("...response 10001,服务异常")
(日志 ②,仅一行、无堆栈)
9. throw ServiceException
→ 返回 ApiResponse(10001) 给前端
```
### 为什么"有的成功有的失败"
k8s Service round-robin 每次请求独立决策落到哪个副本,和 token 内容无关:
- 运气好,直接落到 owner pod → 不触发内部转发 → 100% 成功
- 运气不好,落到非 owner pod → 触发内部转发 → 因为漏传 header → NPE → 失败
### 为什么"日志一个有栈一个没栈"
Spring 的 `GlobalExceptionHandler` 分两类处理异常:
| 异常类型 | 语义 | 日志级别 | 打堆栈 |
|---|---|---|---|
| `SystemException` / 未捕获 `Throwable` | 非预期系统异常(NPE、SQL 错等) | `ERROR` | ✅ 完整堆栈 |
| `ServiceException` / 业务异常 | 业务规则违反(预期内错误) | `WARN` 或 `ERROR` | ❌ 不打堆栈 |
- **Pod A**(接收端):真正发生 NPE → 走 SystemException 路径 → **有栈**
- **Pod B**(转发端):只是把 `errorCode` 包装成 `ServiceException` 再抛 → 业务异常路径 → **无栈**
### 为什么登录是无状态的,但还是有状态 bug
这是最容易混淆的点:
- **认证层(JWT token)确实无状态** —— 每个 pod 都能独立解析 token
- **业务层严重有状态** —— 因为:
1. Kafka Consumer Group 协议规定:一个 partition 只能被 group 里一个 consumer 消费 → pod 和 partition 强绑定
2. 高频设备数据必须在内存聚合(每秒几千上万条,写 Redis 扛不住)
3. 聚合数据只在消费该 partition 的 pod 内存里有 → 查询必须路由到拥有者
所以项目不得不在"无状态 k8s 部署"之上**手搓一套应用层路由**(分区黄页表 + pod 间 HTTP 转发),而这个转发协议就是 bug 的温床。
---
## 排查方法论
### 使用到的核心技术
| 技术 | 用途 | 何时用 |
|---|---|---|
| **边界测试定量化** | 连续 100 次请求统计成功率,判断是确定 bug 还是间歇性 bug | 第一步必做 |
| **副本数比对** | 成功率 vs `1/N` 关系,快速指向副本间行为不一致 | 看到间歇性立即用 |
| **跨副本日志对照** | 把同一 traceId / 时间窗口在所有副本的日志拉出来**并排看** | 怀疑内部调用链路时 |
| **源码 vs 部署字节码比对** | `javap -c -p` 反编译部署 jar,diff 本地源码 | 怀疑部署了错误版本 |
| **画身份传播生命周期图** | 标出所有跨线程/进程/实例的边界,审计每个边界是否有补偿 | 怀疑上下文丢失 |
| **Header 抓包验证** | `tcpdump` / 接收端入口打印 `getHeaderNames()` | 怀疑 header 传递不对时 |
| **同类代码对比** | 项目里做同样事没出 bug 的代码 → diff 出差异 | 有多个相似实现时 |
### 诊断步骤(按顺序)
#### 1. 定量化现象(10 分钟)
```bash
# 连续打 100 次同一请求(替换 URL 和 token)
for i in $(seq 1 100); do
curl -s -X POST "$URL" -H "Authorization: $TOKEN" -d "$BODY" \
| jq -r '.errorCode' | head -1
done | sort | uniq -c
```
看成功率:
- 100% / 0% → 不是本案例,走其它排查路径
- **约 50%(2 副本)/ 约 33%(3 副本)/ ... → 强烈指向本案例**
#### 2. 确认副本拓扑(5 分钟)
```bash
kubectl get deploy <service-name> -o json | jq '.spec.replicas'
kubectl get pods -l app=<service-name>
```
- 副本数 > 1 且 Service 类型是普通 ClusterIP/LoadBalancer → 命中"多副本 round-robin"条件
#### 3. 跨副本日志对照(30 分钟)
```bash
# 对每个副本并行 tail 日志,加入副本名前缀方便区分
for pod in $(kubectl get pods -l app=<service> -o name); do
kubectl logs -f "$pod" --tail=100 | sed "s/^/[$pod] /" &
done
wait
```
发起 1 次失败请求,观察:
- 是否**同时**在多个副本出现日志?
- 不同副本日志的**格式/详尽程度是否不对称**?
如果有以上特征 → 基本可以确认是本案例类型。
#### 4. 代码定位(30 分钟)
- 找到有堆栈的那个副本日志里指向的代码行(NPE 所在行)
- 找到无堆栈副本里 `log.error("...response XXX,XXX")` 的调用位置
- 看这段调用对应的 `restTemplate.exchange` / `Feign` 方法
- 检查 URL 是否形如 `http://<动态 IP>:<port>/...`(pod 间直连)
- 检查 Headers 是否缺失 `x-token-payload` / `x-user-context` 等身份 header
#### 5. 确认协议不一致(15 分钟)
- 抓 Gateway 经过的请求的完整 header 列表
- 抓内部转发时添加的 header 列表
- Diff 两者,找出缺失项
- 再看接收端 `AuthInterceptor` / `@PreAuthorize` / 上下文解析代码,确认依赖哪些 header
**Diff 差集就是要修复的 header 清单**。
---
## 修复方案
### 根治(P0,必做)
**在内部转发的出站 HTTP 代码里,补齐身份上下文 header**。参考实现:
```java
private Integer invokeRemoteDeviceOpt(JsonNode node, NodePartition nodePartition, RestTemplate restTemplate) {
String newUrl = String.format("http://%s:%s/rest/v1/access/device/product/services",
nodePartition.getIp(), nodePartition.getPort());
HttpHeaders headers = new HttpHeaders();
headers.add("Authorization", RequestHeaderConstant.HTTP_FEIGN_TOKEN_BEARER.getValue());
headers.add("Content-Type", "application/json");
headers.add("Accept", "application/json");
// 核心修复:透传身份上下文
TokenPayload tokenPayload = ChainContextHolder.getTokenPayload();
if (tokenPayload != null) {
TokenContext tokenContext = TokenContext.builder().payload(tokenPayload).build();
headers.add(RequestHeaderConstant.HTTP_TOKEN_PAYLOAD_HEADER.getValue(),
JSONUtil.INSTANCE.toJson(tokenContext));
} else {
log.warn("invokeRemoteDeviceOpt missing TokenPayload, forward to {} without x-token-payload", newUrl);
}
ChainContext ctx = ChainContextHolder.get();
if (ctx != null) {
if (StringUtils.isNotBlank(ctx.getRemoteIp())) {
headers.add(RequestHeaderConstant.HTTP_REMOTE_IP_HEADER.getValue(), ctx.getRemoteIp());
}
if (StringUtils.isNotBlank(ctx.getLocale())) {
headers.add(RequestHeaderConstant.HTTP_LANGUAGE.getValue(), ctx.getLocale());
}
}
// ... 后面 restTemplate.exchange 调用保持原样
}
```
**关键**:序列化方法必须和 Gateway 入口一致(例如都用 `JSONUtil.INSTANCE.toJson(TokenContext)`),否则接收端反序列化会失败。
### 兜底(P1,推荐)
**接收端 Controller 的 ThreadLocal 硬解引用加 null-safe**:
```java
@PostMapping(value = "/rest/v1/access/device/product/services")
public ApiResponse<Integer> innerDeviceOptRequest(@RequestBody JsonNode jsonNode) {
String tbCode = ChainContextHolder.getTbCode();
TokenPayload payload = ChainContextHolder.getTokenPayload();
String account = payload != null ? payload.getAccount() : null;
String tid = payload != null ? payload.getTid() : null;
// ...
}
```
防御意义:万一未来有新增调用方又漏传 header,不会直接崩,最坏是降级到 account=null。
### 加固(P2,长期)
- **抽公共工具方法** `forwardHeadersBuilder()`:统一"内部 HTTP 转发"的 header 组装,下次写类似代码不会再漏
- **清理死代码**:检查项目里是否有"意图处理内部调用但逻辑写错"的死代码(如 `url.startsWith("/inner/") && url.equals("feign_token")` 这种永远为 false 的条件)
- **新 `/inner/...` 专用路径**:长期看,让外部 API 和内部转发走**不同 URL 前缀**,接收端显式从 body 拿身份而不是依赖 header。参考已有的 `/inner/rest/...` 设计模式
---
## 预防清单(Checklist)
### 写"出站 HTTP 调用"时
- [ ] 这个调用**跨了网络边界**吗?(即使是同服务 pod 之间也算)
- [ ] 接收端是否依赖 `ThreadLocal` / `MDC` / 上下文存储?
- [ ] 我是否显式透传了身份 header(`x-token-payload` / `x-user-context`)、traceId、MDC?
- [ ] Header 组合是否和 **服务入口处**(Gateway / AuthFilter)注入的完全一致?
- [ ] 上下文为 null 时,降级策略清晰(log.warn + 继续 / 直接抛)
### 写"入站 Controller"时
- [ ] 本接口会被哪些调用方访问?外部?其它服务?自己的 pod?
- [ ] 所有调用方都能保证提供我依赖的 header 吗?
- [ ] 硬取 `ContextHolder.getXxx().getYyy()` 前,确认最坏情况下不会 NPE
- [ ] 或优先改用**显式参数** + `@Validated`,彻底避免依赖 ThreadLocal
### Review 有状态服务时
- [ ] 应用里有没有"会话状态 / partition 所有权 / 内存缓存"等隐性状态?
- [ ] 部署用 Deployment 还是 StatefulSet?匹配业务语义吗?
- [ ] 内部路由逻辑是否完整传播了所有必需的上下文?
- [ ] 能不能在**多副本环境**下,通过相同业务 key 连续请求 100 次测出故障?
### 部署阶段
- [ ] source 版本和 deployed 版本可校验?(镜像 tag、git commit hash)
- [ ] 灰度发布时有 canary pod 先跑?
- [ ] 有状态服务升级做过跨版本内部调用兼容测试?
---
## 同类间歇性 Bug 的 Playbook
遇到"偶发 NPE / 偶发 500 / 偶发 403"的模糊投诉时,按此顺序走:
### Step 1:定量复现(10 分钟)
连续 N 次请求,算成功率:
- 100% / 0% → 跳到 Step 5(稳定 bug 排查)
- 其它比例 → 继续
### Step 2:查副本数(5 分钟)
`kubectl get deploy` 看副本数。
- 成功率 ≈ `1/N` 或 `(N-1)/N`? → 副本间行为不一致
- 单副本能否复现?不能 → **强烈指向本案例类型**
### Step 3:跨副本日志对照(30 分钟)
- 发 1 次失败请求,同时 tail 所有副本日志
- 找"同一请求在多副本留痕 + 日志不对称"证据
- 找出"有堆栈副本 + 无堆栈副本"各自的代码位置
### Step 4:源码 vs 部署比对(15 分钟)
- 抓一个线上 pod 的 jar 反编译关键类
- diff 本地源码
- 任何差异都是重大线索(先怀疑部署错了版本)
### Step 5:画上下文传播链(30 分钟)
- 标出所有跨线程/进程/实例的边界
- 每个边界是否有显式打包 → 解包的机制
- 用抓包 / 入口打印 header 验证实际传递内容
### Step 6:对比同类代码(15 分钟)
- 找 5-10 处做相似事情的代码
- diff 出 bug 代码和正常代码的差异
### Step 7:最小复现 + 固化测试(1 小时)
- 构造"最小副本数 + 最小前置条件"的复现用例
- **固化成集成测试用例**,避免复发
---
## 参考资料
### 本案例相关文件路径(项目:cuavcloudservice)
- 根因修复点:`CuavCloudApplyService/CuavCloudService/.../application/device/DeviceService.java` (`invokeRemoteDeviceOpt`)
- NPE 现场:`CuavCloudApplyService/CuavCloudService/.../api/device/DeviceAccessController.java:313`
- 上下文定义:`cuavcloudcbb/.../context/ChainContextHolder.java`, `ChainContext.java`
- Header 常量:`cuavcloudcbb/.../constant/RequestHeaderConstant.java`
- Gateway 注入点:`cuavcloudservice/.../gateway/filter/AuthorizationFilter.java`(第 133/274/338/345 行)
- 接收端解析:`cuavcloudcbb/.../authentication/application/interceptor/AuthInterceptor.java`
- Partition 路由:`CuavCloudService/.../domain/cache/MonitorCacheMgr.java` (`send`), `domain/merchmant/NodePartitionMgr.java`
- 分区映射表:`t_kafka_partition` (DB) + `access.partition.cache.v4.{id}` (Redis)
### 关键概念
- JWT 无状态认证
- Kafka Consumer Group partition assignment
- k8s Deployment vs StatefulSet
- ThreadLocal 跨网络边界丢失
- `GlobalExceptionHandler` 对 SystemException / ServiceException 的不对称处理
### 一句话总结
> **k8s 无状态部署哲学和应用层有状态业务逻辑冲突时,内部 pod 间 HTTP 转发漏传 `x-token-payload`,副本 round-robin 让这个 bug 变成薛定谔猫;两个 pod 的日志各拿一半证据(接收端有堆栈、转发端只有业务错误码),跨 pod 对照才能还原完整链路。**
FILE:experience/BUG02.md
# BUG02: Netty 间歇性 eventLoop 终止后又自恢复 —— 动态编译打爆堆 × 重连放大
## 案例摘要
测试桩在压测/持续上报过程中,日志先出现 `RejectedExecutionException: event executor terminated`、`Force-closing a channel whose registration task was not accepted by an event loop`,随后又出现 `JdkCompiler` / `ProtobufProxy` 相关 `OutOfMemoryError: Java heap space`,但服务节点和 JVM 进程都**没有重启**,过一小段时间后连接又自动恢复、重新登录成功。根因不是远端服务彻底不可用,而是**本地进程在高频 protobuf 动态编译、重复 connect、重复心跳调度叠加下发生瞬时堆内存耗尽和连接风暴**,之后依靠 GC 回收和新连接重建表现出“自愈”。
---
## 症状 / 特征速查(用于匹配)
> 下列特征命中 **4 条及以上** 时,高概率是本案例;命中 **2-3 条** 也值得优先按本案例思路排查。
### 表面现象
- [ ] 日志先报 `event executor terminated` / `registration task was not accepted by an event loop`
- [ ] 同一时间窗内又出现 `OutOfMemoryError: Java heap space`
- [ ] OOM 没有把整个 JVM 打死,进程和节点都没重启
- [ ] 过几十秒到几分钟后,连接又自动恢复,后续还能看到“连接成功 / 登录成功”
- [ ] 故障看起来像“远端网络偶发失败”,但并不是所有连接都失败到底
### 日志特征(最诊断性!)
- [ ] 栈里同时出现 `com.baidu.bjf.remoting.protobuf`、`JdkCompiler.doCompile`、`ProtobufProxy.create`
- [ ] 业务栈里同时出现 `NettyUtil.encode(...)`
- [ ] `channelInactive -> reconnect -> connect` 调用链频繁出现
- [ ] 在 `connect()` 失败/关闭附近,能看到多次重复 `Connect to <host>,<port>`
- [ ] 同一个进程里既有 `RejectedExecutionException`,后面又有成功登录日志
### 代码特征
- [ ] 编码工具方法里每次都直接调用 `ProtobufProxy.create(cls)`,没有 `Codec` 缓存
- [ ] 同一个 `NettyClient` 可能被多个线程同时调用 `connect()`
- [ ] `channelInactive` 会直接触发重连
- [ ] 登录成功后会 `scheduleAtFixedRate(...)` 发心跳,但旧任务没有显式取消
- [ ] 测试桩/模拟器存在多个线程持续高频发送不同 protobuf 消息
### 关键日志指纹
```text
WARN ... AbstractChannel - Force-closing a channel whose registration task
was not accepted by an event loop
java.util.concurrent.RejectedExecutionException: event executor terminated
ERROR ... rejectedExecution - Failed to submit a listener notification task.
Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
An exception has occurred in the compiler (17.0.12)
java.lang.OutOfMemoryError: Java heap space
java.lang.IllegalStateException: Compilation failed. class:
com.xxx.$$JProtoBufClass, diagnostics: []
at com.baidu.bjf.remoting.protobuf.utils.compiler.JdkCompiler.doCompile(...)
INFO ... NettyClient - Connect to <host>,<port>
INFO ... NettyProxyClientHandler - result login ...
```
### 反向排除项
- 如果 JVM **直接退出**、进程被重启、容器有 OOMKilled 记录 → 不是本案例,是“进程级 OOM”
- 如果只有 `event executor terminated`,**没有任何** `JdkCompiler` / `ProtobufProxy` / `Java heap space` 痕迹 → 先排查网络、显式 shutdown、线程池生命周期
- 如果从头到尾都**无法重新登录** → 更像远端服务不可用或网络彻底中断,不是本案例
- 如果只在单次启动初始化阶段出现一次编译慢、之后完全稳定 → 只是冷启动开销,不一定是本案例
---
## 详细说明 / 根因链路
### 四层因素拆解
| 层 | 内容 | 单独看是否必然出 bug |
|---|---|---|
| **编码层** | `NettyUtil.encode()` 每次都调用 `ProtobufProxy.create(cls).encode(t)`,触发 JProtobuf 动态生成/编译 `$$JProtoBufClass` | 有风险,但不一定立刻出 bug |
| **消息模型层** | 某些消息类字段很多、嵌套深,首次动态编译成本高,内存峰值明显 | 有风险,但需要压力触发 |
| **连接层** | `getClient()`、`channelInactive()`、失败回调都可能继续 `connect()`,缺少单飞保护 | 会放大故障 |
| **调度层** | 登录成功后定时发心跳,重连多次成功时可能累计出多个心跳任务 | 会持续制造额外流量和编码压力 |
### 典型故障流程
```text
测试线程持续发消息
│
├─► 反复调用 NettyUtil.encode(...)
│ └─► ProtobufProxy.create(...)
│ └─► JdkCompiler.doCompile(...)
│
├─► 某个重消息/首编译时瞬时堆占用过高
│ └─► Java heap space
│
├─► 编译失败后,部分发送/连接回调继续执行
│ └─► connect() / addListener() / register() 仍在尝试提交任务
│
├─► 原 event loop 已不可用
│ └─► RejectedExecutionException: event executor terminated
│
├─► 但 JVM 没死,后续:
│ 1. 某些发送线程异常退出或节奏放缓
│ 2. GC 回收了一部分堆
│ 3. 新连接重新建立
│
└─► 日志上表现为:没重启,却“自己恢复”
```
### 为什么“节点没重启却自己好了”
这是本案例最容易让人误判的点:
1. **OOM 不等于 JVM 立刻退出**
本次 `OutOfMemoryError` 发生在 `JdkCompiler.doCompile(...)` 业务线程里,很多情况下只会让当前线程/当前任务失败,不会马上让整个进程退出。
2. **坏掉的是当时那个 eventLoop / channel,不是整个节点**
`event executor terminated` 说明那次连接依赖的事件循环不能再接任务了,不代表后续不能新建别的连接。
3. **压力是瞬时的,不一定持续**
某些消息类型第一次动态编译最贵;一旦高峰过去、GC 生效、部分线程停止,进程可能恢复到“还能继续跑”的状态。
4. **代码会不断尝试重连**
只要远端服务实际上可用,后面某次重连总可能成功,于是日志里会同时出现“刚才炸过”和“现在又登录成功”。
### 代码级证据链(当前项目)
#### 1. 每次编码都动态创建 protobuf codec
`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/util/NettyUtil.java`
- `encode(Class<T> cls, T t)` 直接 `return ProtobufProxy.create(cls).encode(t);`
- 没有 `ConcurrentHashMap<Class<?>, Codec<?>>` 之类的缓存层
#### 2. 连接入口缺少并发保护
`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/netty/NettyInstantService.java`
- `getClient()` 在 `!client.getConnectStatus()` 时直接 `client.connect()`
- 多个业务线程同时发现“未连接”时,可能并发触发 connect
`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/heart/NettyClientHeartHandler.java`
- `channelInactive()` 里直接 `this.clientReconnect.reconnect()`
`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/client/NettyClient.java`
- `connect()` 没有 `synchronized` / CAS / inFlight 标记
- 失败回调里还会 `eventLoop.schedule(() -> connect(), 40, TimeUnit.SECONDS)`
#### 3. 心跳调度会在每次成功登录后继续注册
`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/client/NettyProxyClientHandler.java`
- 登录成功后 `scheduleAtFixedRate(...)` 发心跳
- 没看到旧 `ScheduledFuture` 的保存和取消
- 如果多次重连成功,容易积累多个心跳任务
#### 4. 测试桩存在多线程高频报文
`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/sfl210/Sfl210TestService.java`
- `threadHeart`、`threadChildeHeart`、`threadUav` 三条线程持续发不同消息
- `sendDph120Heart(...)` 还会继续串发 posture / beam config
`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/groundstation/GroundStationRadarTestService.java`
- 心跳线程循环里连续发 `RadarHeart`、`RadarPosture`、`RadarBeamConfig`、`RadarUav`
---
## 排查方法论
### 使用到的核心技术
- **时间线对齐**:把 OOM、Netty 重连、登录成功日志放到同一秒级时间线看先后关系
- **堆内存证据收集**:GC 日志、堆 dump、容器内存曲线
- **连接风暴识别**:统计同一分钟内 `Connect to ...` 和 `result login` 的次数
- **热点消息定位**:找是哪一种 protobuf 消息的首次编码最容易触发编译/OOM
- **代码路径审计**:梳理 `getClient -> connect -> channelInactive -> reconnect -> scheduleAtFixedRate` 的闭环
### 诊断步骤(按顺序)
#### 1. 先确认是不是“自恢复型 OOM”
看故障时间窗内是否满足:
- 先有 `Java heap space`
- 再有 `event executor terminated`
- 后面又有连接/登录成功
- 期间没有进程重启、没有容器重建
如果是,优先按本案例走。
#### 2. 定量化连接风暴
在日志里统计 1 分钟窗口:
- `Connect to <host>,<port>` 出现多少次
- `result login` 出现多少次
- `channelInactive` / `close ...remoteAddress` 出现多少次
如果异常窗口里这些数字明显飙升,说明不是单次断线,而是“重连放大”。
#### 3. 锁定最重的 protobuf 类型
搜这些关键词:
- `Compilation failed. class:`
- `$$JProtoBufClass`
- `NettyUtil.encode(`
看 OOM 前最后一个失败的消息类型是谁。
如果总是集中在字段很多、嵌套复杂的消息模型上,强烈支持本案例。
#### 4. 看进程是否真的没死
验证:
- 容器/进程 uptime 没变化
- 没有 OOMKilled / restart count 增长
- 线程 dump 或日志里还能看到老线程继续工作
这一步能把“进程级 OOM”与“线程级/业务级 OOM”区分开。
#### 5. 审计 connect 入口是否可重入
重点检查:
- `getClient()` 是否可能被多个线程并发调用
- `channelInactive()` 是否直接 reconnect
- 失败 listener 是否延迟再次 connect
- 这些入口之间有没有“只允许一个 connect in flight”的保护
#### 6. 审计心跳任务生命周期
重点检查:
- 登录成功后是否每次都 `scheduleAtFixedRate`
- 旧 channel 关闭时是否取消旧心跳任务
- 是否存在“连接虽然换了,旧任务还在往旧 ctx/channel 发包”
---
## 修复方案
### 根治(P0,必做)
1. **为 `ProtobufProxy.create(...)` 增加 codec 缓存**
避免每次 encode/decode 都走动态编译。
2. **给 `NettyClient.connect()` 增加单飞保护**
任意时刻只允许一个连接建立流程在飞,避免多个线程同时 connect。
3. **心跳任务绑定连接生命周期**
登录成功后保存 `ScheduledFuture`;连接关闭/重连前先取消旧任务,再注册新任务。
### 兜底 / 防御(P1,推荐)
1. **对高风险消息做预热**
应用启动或测试启动前,主动对重消息模型调用一次 codec 初始化,避免首次真实发送时冷编译。
2. **在 OOM 后快速熔断发送**
某段时间内如果连续出现 `Compilation failed` / `Java heap space`,暂停发送新报文,避免继续放大。
3. **重连退避**
加指数退避、最大重试间隔和去抖,防止断链时把 event loop 和堆一起打满。
### 加固 / 清理(P2,长期)
- 把测试桩里所有 `while + sleep + getClient().send(...)` 的发包线程统一纳入调度器,避免裸线程失控
- 对“字段很多、嵌套重”的 protobuf 消息建立专门的预编译清单
- 为 Netty 连接管理增加观测指标:当前连接状态、重连次数、活跃心跳任务数、最近一次 OOM 时间
- 把 `connectStatus` 从“是否登录成功”拆成“连接中 / 已连接 / 已登录 / 已关闭”等更细粒度状态
---
## 预防清单(Checklist)
开发阶段:
- [ ] 凡是 `ProtobufProxy.create(...)` 高频调用点,都评估过是否需要缓存
- [ ] 首次发送成本高的消息类型,已经做过启动预热
- [ ] `connect()` 有并发保护,不会被多个线程同时打穿
- [ ] 定时任务与 channel 生命周期绑定,旧任务能被取消
- [ ] 裸线程循环发送前,评估过异常退出和背压策略
部署阶段:
- [ ] JVM `-Xms/-Xmx` 与压测峰值匹配,不靠默认堆大小裸跑
- [ ] 有 GC 日志、堆内存曲线和重启次数监控
- [ ] 能区分“进程重启恢复”与“进程未重启自恢复”
Review 阶段:
- [ ] 看到“偶发 Netty 断线”时,是否反查了同时间窗有没有 OOM/GC 异常
- [ ] 看到“自动恢复”时,是否警惕这是瞬时资源耗尽而非网络抖动
- [ ] 看到 `scheduleAtFixedRate` 时,是否检查了取消逻辑
---
## 同类间歇性 bug 的 Playbook
遇到“Netty 偶发断线但又自己恢复”的现象时,按此顺序走:
1. 定时间线(10 分钟)
把 `OOM`、`RejectedExecutionException`、`Connect to ...`、`result login` 对齐。
2. 看进程是否重启(5 分钟)
没重启却恢复,优先怀疑本案例这种“瞬时资源耗尽”。
3. 查堆和 GC(15 分钟)
看故障前后堆使用率、Full GC、停顿时间。
4. 统计重连和登录次数(15 分钟)
判断是不是连接风暴。
5. 查动态编译热点(20 分钟)
grep `$$JProtoBufClass` / `JdkCompiler.doCompile` / `Compilation failed`。
6. 审计连接闭环代码(30 分钟)
把所有 `connect` 入口、失败回调、`channelInactive`、定时任务注册点画成图。
7. 先做最小修复验证(1 小时)
优先试 `codec 缓存 + connect 单飞 + 取消旧心跳`,再看故障是否消失。
---
## 参考资料
### 本案例相关文件路径(项目:`d:\SpotterProNew\IdeaProjects`)
- 编码入口:`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/util/NettyUtil.java`
- Netty 客户端:`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/client/NettyClient.java`
- 重连桥接:`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/client/NettyClientReconnect.java`
- 心跳/断链重连:`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/heart/NettyClientHeartHandler.java`
- 登录后心跳调度:`cuavcloudcbb/CuavCloudBasicCBB/BasicToolsAdapterCBB/src/main/java/com/cuav/basictools/config/netty/client/NettyProxyClientHandler.java`
- 测试桩连接入口:`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/netty/NettyInstantService.java`
- 高压发送样例 1:`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/sfl210/Sfl210TestService.java`
- 高压发送样例 2:`cuavcloudservice/CuavCloudApplyService/CuavCloudTestService/src/main/java/com/cuav/cloud/test/application/groundstation/GroundStationRadarTestService.java`
- 重消息示例:`cuavcloudcbb/CuavCloudBasicCBB/InterfaceDefineCBB/src/main/java/com/cuav/cloud/protobuf/model/Dph120HeartList.java`
### 关键概念
- JProtobuf / `ProtobufProxy` 动态生成 `$$JProtoBufClass`
- `JdkCompiler.doCompile(...)` 的运行时编译开销
- Netty `EventLoop` 生命周期
- `RejectedExecutionException: event executor terminated`
- 瞬时 OOM 与进程未退出的“假性自恢复”
### 一句话总结
> **当测试桩把“高频动态 protobuf 编译”、“可重入 connect”、“未回收的心跳任务”叠在一起时,进程会先表现为 Netty event loop 终止和连接失败,随后暴露出 `JdkCompiler` 的堆 OOM;因为 JVM 未必退出、重连仍在继续,所以外在现象常常是“节点没重启,过一会儿又自己好了”。**
FILE:experience/BUG03.md
# BUG03: EMQX 集群单节点路由表"同步死亡" —— 持久化数据 × mria 增量复制断链 × 重启不愈
## 案例摘要
设备/客户端持续发送 MQTT 报文(如 `thing/product/<sn>/register`),但订阅在另一节点的服务(access-service)**完全收不到**消息,业务层表现为"设备一直注册不上、收不到 register_reply"。集群 `cluster status` 看起来正常(所有节点 `running`),ACL 默认全开,从健康节点 publish 测试也都正常——**唯独从某一个特定节点 publish 的消息会被该节点本地直接丢弃**(`dropped.no_subscribers` 异常高),而该节点的路由表条目数显著少于其他节点。
根因是 EMQX(5.x)那个节点的 **mria(mnesia 复制层)路由表与集群其他节点失同步**:节点视图里只有自己本地客户端的订阅,看不到远端节点的订阅,导致跨节点 publish 被本地判定为"无订阅者"丢弃。**单纯重启该节点并不能修复**——因为 hostPath 持久化的 mnesia 数据会让节点"自认为是老成员"跳过全量 bootstrap,必须通过 `cluster leave` + `cluster join` 强制全量重同步。
---
## 症状 / 特征速查(用于匹配)
> 下列特征命中 **4 条及以上** 时,高概率是本案例;命中 **2-3 条** 也值得优先按本案例思路排查。
### 业务层现象
- [ ] 设备 publish MQTT 报文(register / heart / report 等),订阅服务**完全没有日志记录**收到该报文
- [ ] 同一个设备的同类报文,**有时通有时不通**(取决于设备/客户端被 LB hash 到哪个 broker 节点)
- [ ] 服务侧的其它 topic(osd / events / state)**仍然在持续收消息**,唯独某些 topic 收不到 → 容易让人误以为"是这个 topic 的代码问题"
- [ ] 切换 client 重新连接(断开 → 重连)有概率"突然就好了",但用户复盘不出规律
- [ ] 应用层 ACL / 鉴权 / topic 拼写都查过,确认没问题
### Broker 端指标特征(最诊断性!)
- [ ] `emqx_ctl broker metrics` 显示**单一节点**的 `messages.dropped.no_subscribers` 占接收消息的比例**异常高**(>30%)
- [ ] 其他节点同一指标**几乎为 0**或极低(<1%)
- [ ] 故障节点 `messages.received` 不为 0,说明 publish 确实进了 broker
- [ ] HTTP API publish 到故障节点会显式返回 `{"message":"no_matching_subscribers","reason_code":16}`
- [ ] 同一条消息 publish 到健康节点立刻被订阅者收到
### 集群路由表特征
- [ ] `emqx_ctl cluster status` 显示**所有节点都 running、没有 stopped 节点**(看起来一切正常!)
- [ ] `emqx_ctl topics list | wc -l` 在不同节点结果**差异巨大**(如 3 vs 41)
- [ ] 故障节点的 `topics list` 里只有**本节点客户端**的订阅
- [ ] 健康节点的 `topics list` 里能看到**所有节点**客户端的订阅(包含跨节点的 wildcard 订阅)
- [ ] `emqx_ctl mnesia` 显示三个节点都在 `running db nodes` 里(mria 元数据看起来同步)
### 部署 / 架构特征
- [ ] EMQX **集群部署**(3 节点及以上)
- [ ] 部署在 **k8s StatefulSet**,使用 **hostPath / PV** 持久化 mnesia 数据目录(`/opt/emqx/data/mnesia/...`)
- [ ] Service 用普通 ClusterIP/LoadBalancer,客户端通过 LB hash 到任意节点
- [ ] 订阅者(服务侧)和发布者(设备)**不一定连在同一节点**
- [ ] 集群里至少有一个节点曾经经历过**异常重启 / OOM / 网络抖动**
### 关键日志/指标指纹
```text
# 故障节点 broker metrics(异常)
messages.dropped : 320863
messages.dropped.no_subscriber: 320863 # 跟 received 比 ~64%
messages.publish : 501240
messages.received : 501240
# 健康节点 broker metrics(正常)
messages.dropped : 1358
messages.dropped.no_subscriber: 1358 # 跟 received 比 ~0.03%
messages.received : 4079734
# HTTP API publish 到故障节点的响应
{"message":"no_matching_subscribers","reason_code":16}
# 故障节点的 topics list(残缺)
thing/product/SN-XXX/register -> emqx-1 # 只有本节点客户端的订阅
(缺失 wildcard 订阅 thing/product/+/register -> emqx-2)
# 健康节点的 topics list(完整)
thing/product/+/register -> emqx-2 # 能看到 wildcard 订阅
thing/product/+/osd -> emqx-2
thing/product/+/state -> emqx-2
... 数十条
```
### 反向排除项
- 如果 `cluster status` 里有 `stopped_nodes` → 不是本案例,先恢复节点本身
- 如果**所有节点**的 `dropped.no_subscribers` 都很高 → 不是本案例,是订阅者真的没起来或都崩了
- 如果故障节点的 `topics list` 跟健康节点**完全一致**(条数相同、内容相同)→ 不是本案例,去查订阅者侧(client 是否已断开)
- 如果是**共享订阅 `$queue/`** 且只是某个组员崩了 → 不是本案例,是组成员问题
- 如果故障是**所有 topic 全都收不到**(包括同节点客户端互发)→ 不是本案例,是节点本身彻底坏了
---
## 详细说明 / 根因链路
### 三层因素拆解
| 层 | 内容 | 单独看是否必然出 bug |
|---|---|---|
| **集群层** | EMQX 5.x 用 mria(mnesia + 复制层)做集群元数据同步,路由表 `emqx_route` 是 `ram_copies` 类型,靠 mria 实时复制 | 不是 bug(标准设计) |
| **持久化层** | 节点 mnesia 元数据持久化到 hostPath `/opt/emqx/data/mnesia/<node>/`,节点重启时会读取本地元数据 | 不是 bug(保留集群成员关系用) |
| **同步链路层** | 节点重启 / 网络抖动 / OOM 后,mria 增量复制链路损坏,但**节点本地数据让它"自认为已经在集群里",跳过全量 bootstrap**;后续增量也补不上 | **这是真正的 bug** |
### 为什么 `cluster status` 看起来是健康的
- `cluster status` 检查的是 **ekka 成员关系**(基于 erlang distribution 的 `net_kernel:nodes()`)
- 节点间 erlang RPC 通信正常 → ekka 认为成员关系健康 → `running_nodes` 列表完整
- 但 **mria 的路由表复制是另一个独立机制**:基于 `mria_lib:rpc_to_core_node/3` + 监听 `mnesia` 事件
- mria 复制断链时不会反映到 `cluster status`,但路由表会停留在断链时刻的快照(甚至更糟,只有本地数据)
### 故障节点的"假同步"状态
```text
emqx-0 节点视角:
├─ ekka 成员:[emqx-0, emqx-1, emqx-2] ✓ "我在集群里"
├─ mria running_nodes: [emqx-0, emqx-1, emqx-2] ✓ "数据库节点都在"
└─ emqx_route 表内容:
thing/product/SN-X/register -> emqx-0 ← 本地客户端订阅的(自己往里写的)
thing/product/SN-Y/osd -> emqx-1 ← 偶然同步过来的少量条目
(缺失大量 wildcard 订阅、其他节点客户端订阅)
emqx-1 / emqx-2 节点视角:
└─ emqx_route 表内容:完整 41 条
thing/product/+/register -> emqx-2 ← access-service 的订阅
thing/product/+/osd -> emqx-2
... 等所有节点客户端的订阅
```
### Publish 路由判定流程
```text
client publish thing/product/SN-X/register (QoS 1) → emqx-0
│
├─► emqx-0 的 broker 模块查询本节点 emqx_route 表
│ └─► 匹配 wildcard 订阅 thing/product/+/register
│ └─► ❌ 表里没有这条订阅!
│
├─► emqx-0 判定 no_matching_subscribers
│ ├─► 直接丢弃消息
│ ├─► messages.dropped.no_subscribers++
│ └─► HTTP API 返回 {"message":"no_matching_subscribers","reason_code":16}
│
└─► 设备 / MQTTX 视角:publish 成功(broker 回了 PUBACK,QoS 1 网络层 OK)
但 register_reply 永远不来(消息根本没到 access-service)
```
### 为什么"重启 pod 不修复"
这是本案例最坑的点:
1. **mnesia 数据持久化在 hostPath**
pod 重启后,`/opt/emqx/data/mnesia/<node>/` 还在
2. **节点启动时读取本地 mnesia schema**
发现自己已经是集群成员、有历史数据 → 跳过全量 bootstrap,进入"增量同步"模式
3. **但增量同步链路本身就是坏的**(这正是当初触发故障的根因)
→ 路由表保持残缺状态
4. 表象上:节点正常启动、`cluster status` 显示 running、表面上一切如常 → **bug 持续存在**
正确的修复流程必须**显式 leave 集群再 join**,触发 mria 全量 bootstrap:
```text
emqx_ctl cluster leave
└─► mria 主动断开复制链路、清空本地路由表
emqx_ctl cluster join <other-node>
└─► mria 触发全量 bootstrap:从 core node 拉取所有表的完整快照
```
### 为什么"客户端重连有时能好"
- 如果客户端重连时 LB hash 到健康节点 → publish 经健康节点路由 → 正常
- 如果重连仍 hash 到故障节点 → publish 仍被丢弃
- 用户感知"运气好就好了",没法稳定复现 → 误诊为客户端问题或网络抖动
---
## 排查方法论
### 使用到的核心技术
| 技术 | 用途 | 何时用 |
|---|---|---|
| **跨节点指标对照** | `emqx_ctl broker metrics` 在每个节点跑一遍,对比 `dropped.no_subscribers` 比例 | 第一步必做,5 分钟见效 |
| **跨节点路由表条数对照** | `emqx_ctl topics list \| wc -l` 看每个节点路由表大小差异 | 强烈指向单节点同步问题 |
| **跨节点 publish 实验** | 用 EMQX HTTP API 从每个节点 publish 同一条消息,看哪个失败 | 隔离问题节点的"金标准" |
| **HTTP API publish 错误码识别** | `no_matching_subscribers` (reason_code 16) 是路由表残缺的强信号 | 看到这个错误码立刻怀疑本案例 |
| **客户端在线指标 + 订阅对照** | `clients/{id}/subscriptions` 看订阅者实际订阅的 topic 和 QoS | 排除"订阅者根本没订阅"的可能 |
| **mnesia 状态审计** | `emqx_ctl mnesia` 看 mria 各表状态、节点角色(core / replicant) | 怀疑 mria 同步问题时 |
### 诊断步骤(按顺序)
#### 1. 业务侧确认现象(5 分钟)
- 在订阅服务的入口 logger(如 access-service 的 `InboundMessageRouter`)grep 目标 topic
- 确认:**某些 topic 完全收不到消息**,但**其他 topic 仍持续在收**
- 这一步排除"服务整体挂了"的可能
#### 2. broker 端跨节点指标对照(5 分钟)
```bash
# 对每个 emqx 节点跑:
for i in 0 1 2; do
echo "=== emqx-$i ==="
kubectl exec -n cuav-cloud cuav-base-emqx-$i -- \
emqx_ctl broker metrics 2>/dev/null | grep -E 'dropped|received|publish' | head -10
done
```
看 `messages.dropped.no_subscribers / messages.received` 比例:
- 所有节点 <1% → 不是本案例
- **某一个节点 >30% 而其他节点接近 0%** → 强烈指向本案例
#### 3. 跨节点路由表条数对照(2 分钟)
```bash
for i in 0 1 2; do
echo "=== emqx-$i topics count ==="
kubectl exec -n cuav-cloud cuav-base-emqx-$i -- \
emqx_ctl topics list 2>/dev/null | wc -l
done
```
如果差距 >5 条(尤其是出现 3 vs 41 这种数量级差距)→ **基本确认本案例**
#### 4. 跨节点 publish 实验(金标准,10 分钟)
通过 EMQX HTTP API(`/api/v5/publish`)**直接访问每个节点的 dashboard 端口** publish 同一条消息(同 topic / 同 payload,加唯一 tid 区分),同时在订阅者侧 grep tid:
```bash
# 通过 headless service 直连每个 pod
for i in 0 1 2; do
EMQX_HOST="cuav-base-emqx-$i.cuav-base-emqx-headless.cuav-cloud.svc.cluster.local"
TOKEN=$(curl -sS -X POST http://EMQX_HOST:18083/api/v5/login \
-H 'Content-Type: application/json' \
-d '{"username":"admin","password":"<dashboard-pwd>"}' \
| sed -E 's/.*"token":"([^"]+)".*/\1/')
echo "--- publish to emqx-$i ---"
curl -sS -X POST "http://EMQX_HOST:18083/api/v5/publish" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"topic":"<target-topic>","payload":"VEVTVA==","payload_encoding":"base64","qos":1,"retain":false}'
done
```
观察响应:
- `{"id":"..."}` → 消息成功路由
- `{"message":"no_matching_subscribers","reason_code":16}` → **该节点路由表残缺,铁证!**
再去订阅侧日志 grep tid,确认哪些节点的 publish 真的到达了。
#### 5. 直接审计故障节点的路由表(5 分钟)
```bash
# 在故障节点上:
emqx_ctl topics list | head -50
emqx_ctl topics list | grep -E 'thing/product/\+/' # 看 wildcard 订阅
```
对比健康节点:
- 故障节点缺失大部分 wildcard 订阅
- 故障节点只有少量本节点客户端的订阅
#### 6. 审计 mria 状态(10 分钟)
```bash
emqx_ctl mnesia
```
关注:
- `running db nodes` 是否完整
- `master node tables` 是否为空(空是正常的)
- `emqx_route` 表的 `ram_copies` 节点列表
通常 mria 状态看起来"正常"——这正是 bug 的迷惑性。
---
## 修复方案
### 根治(P0,必做)
**对故障节点执行 cluster leave + cluster join,强制 mria 全量重新同步**:
```bash
# 在故障节点(如 emqx-0)上执行:
emqx_ctl cluster leave
# 输出:Leave the cluster successfully.
# 此时该节点 cluster status 只剩自己
emqx_ctl cluster join cuav-base@cuav-base-emqx-1.cuav-base-emqx-headless.cuav-cloud.svc.cluster.local
# 输出:Join the cluster successfully.
# mria 会触发全量 bootstrap,从 core node 拉取所有表
```
**关键**:
- `cluster leave` 不会清掉本地数据文件,但会让节点"忘记自己是老成员"
- 重新 `join` 时 mria 会做全量 bootstrap,路由表会从 core node 完整复制过来
- 影响面:连在该节点的客户端会断线重连一次(约 1-2 秒),重连后 LB 可能分到任意节点
**不要做的事**:
- ❌ 直接 `kubectl delete pod` —— 持久化数据导致重启不修复
- ❌ 直接清掉 `/opt/emqx/data/mnesia/<node>/` 后重启 —— 风险大,可能丢配置(认证表、admin 账户等是 disc_copies)
- ⚠️ 实在不行才用:先 `cluster leave`,停 pod,清 hostPath,再启动 → 等价于全新加入
### 兜底 / 防御(P1,推荐)
1. **加监控告警(最重要)**
- 各节点 `messages.dropped.no_subscribers / messages.received` 比例(>5% 告警)
- 各节点 `emqx_ctl topics list | wc -l` 的两两差值(差值 >5 告警)
- 这两个指标是本案例的**唯一可靠预警信号**,cluster status 完全不可靠
2. **客户端侧自动重连 + 退避**
万一节点路由表残缺,客户端断线重连有机会落到健康节点,缓解业务影响
3. **关键 publish 加 ACK 验证**
重要业务消息(如 register)publish 后等待 reply,超时未到则告警/重试
### 加固 / 清理(P2,长期)
1. **EMQX 升级**
5.1.3 是相对早期的 5.x 版本,mria 同步在后续版本(5.4+)有大量稳定性修复。评估升级到 5.4+ / 5.6+
2. **运维剧本固化**
把"路由表残缺诊断 + leave/join 修复"固化为 runbook,让运维能 5 分钟内独立处理
3. **避免使用 hostPath**
改用 PVC + 标准 StorageClass。hostPath 会让节点漂移时数据丢失/错位,加剧 mria 同步问题
4. **集群拓扑审计**
评估是否需要拆分 core/replicant 角色(5.x 支持),避免所有节点都是 core 时同步链路过于复杂
---
## 预防清单(Checklist)
### EMQX 部署/运维阶段
- [ ] 部署 EMQX 5.x 集群时使用 PVC 而非 hostPath
- [ ] 监控 `messages.dropped.no_subscribers` 占比,单节点 >5% 告警
- [ ] 监控各节点 `topics list` 条数差异,差值 >5 告警
- [ ] 不要只靠 `cluster status` 判断集群健康(它不能反映 mria 同步状态)
- [ ] 节点异常重启 / OOM / 网络抖动后,主动跑一次跨节点 publish 验证
### 服务接入 EMQX 时
- [ ] 服务侧 MQTT client 配置自动重连 + 指数退避
- [ ] 关键 publish(register / 命令下发)加 ACK 验证机制,不要 fire-and-forget
- [ ] 服务侧关键 topic 的"过去 N 分钟收到消息数"做指标,0 持续 N 分钟告警
### Review / 上线阶段
- [ ] 看到"某 topic 设备一直发但服务收不到、其他 topic 正常" → 立刻怀疑 broker 路由问题
- [ ] 看到"重启某 broker 节点后问题没好" → 不要继续重启,跑跨节点指标对照
- [ ] 看到"客户端重连一下就好了,但又会复发" → 强烈怀疑某个节点路由表问题,做跨节点 publish 实验
---
## 同类间歇性 bug 的 Playbook
遇到"MQTT 设备发了消息服务收不到 / 收消息时灵时不灵 / 某些 topic 通某些不通"的现象时,按此顺序走:
### Step 1:业务侧确认是哪一类"收不到"(5 分钟)
订阅服务的入口 logger(统一打印 received topic 那种)grep 目标 topic:
- 完全 0 条 → 消息没到服务侧,问题在 broker 路由 / 订阅链路
- 有部分能收 → 问题更可能在 publisher 侧(QoS、断连、重发等)
### Step 2:跨节点 broker 指标对照(5 分钟)
每个节点跑 `emqx_ctl broker metrics`,对比 `dropped.no_subscribers` 比例:
- 单节点异常高 → **强烈指向本案例**(路由表同步故障)
- 全节点都低 → 问题不在 broker 路由,去查订阅者
### Step 3:跨节点路由表条数对照(2 分钟)
`emqx_ctl topics list | wc -l` 在每个节点跑一遍。差异巨大 → **基本确认本案例**。
### Step 4:跨节点 publish 实验确认故障节点(10 分钟)
用 HTTP API 从每个节点 publish 同一条消息,看哪个返回 `no_matching_subscribers`。
### Step 5:执行 cluster leave + join 修复(2 分钟)
在故障节点上:`emqx_ctl cluster leave` → `emqx_ctl cluster join <healthy-node>`。
### Step 6:再次跨节点 publish 验证修复(5 分钟)
重复 Step 4,确认所有节点 publish 都成功。
### Step 7:补告警 + 写复盘(30 分钟)
- 加监控指标(dropped 比例、topics 条数差)
- 写故障复盘,固化排查 runbook
---
## 参考资料
### 本案例相关 K8S 资源(项目:cuav-cloud)
- StatefulSet:`cuav-base-emqx`(namespace `cuav-cloud`)
- Pod:`cuav-base-emqx-0` / `cuav-base-emqx-1` / `cuav-base-emqx-2`
- Service:`cuav-base-emqx`(普通 ClusterIP)+ `cuav-base-emqx-headless`(headless,按 pod 直连)
- 持久化:`hostPath: /opt/cloud/hostPath/emqx-ha/emqx-data`
- EMQX 版本:5.1.3
- Dashboard 端口:18083;MQTT 端口:1883
### 服务侧相关代码
- 入站消息统一入口(确认是否真没收到):`cuavcloudcbb/CuavCloudBasicCBB/AccessAdapterCBB/src/main/java/com/cuav/access/adapter/mqtt/InboundMessageRouter.java`
- MQTT 调用日志:`/logs/ms_log/cuav-cloud-access-service/mqtt_call.log`
- 日志配置:`cuavcloudopensource/CuavCloudOpenSource/OpenSourceDependency/src/main/resources/logback-spring.xml`(`MQTT_LOG` appender)
### 关键 EMQX 命令速查
```bash
# 集群成员关系(不可靠:mria 同步问题时也显示正常)
emqx_ctl cluster status
# mria 状态(看 ram_copies / disc_copies 节点分布)
emqx_ctl mnesia
# 路由表(关键诊断指标)
emqx_ctl topics list | wc -l
emqx_ctl topics list | grep '<topic-pattern>'
# broker 全局指标
emqx_ctl broker metrics | grep -E 'dropped|received|publish'
# 客户端订阅清单
curl -H "Authorization: Bearer $TOKEN" \
http://emqx:18083/api/v5/clients/<clientid>/subscriptions
# HTTP API publish(绕过 client,直接走 broker 内部路由)
curl -X POST http://emqx:18083/api/v5/publish \
-H "Authorization: Bearer $TOKEN" \
-d '{"topic":"...","payload":"<base64>","payload_encoding":"base64","qos":1}'
# 修复命令(在故障节点执行)
emqx_ctl cluster leave
emqx_ctl cluster join <healthy-node-name>
```
### 关键概念
- EMQX 5.x mria(mnesia + 复制层)架构
- mnesia `ram_copies` vs `disc_copies` vs `null_copies`
- `emqx_route` 表(路由表)的全集群复制机制
- core node vs replicant node 角色
- ekka 集群成员管理 vs mria 数据复制(两套独立机制)
- HTTP API publish 的 `no_matching_subscribers` (reason_code 16)
- StatefulSet hostPath 持久化对集群一致性的副作用
### 一句话总结
> **EMQX 集群"看起来健康"`cluster status` 全 running、ACL 全开,但某节点 publish 全部被吞、其他节点正常——铁证是 `dropped.no_subscribers` 单节点 >30% 且 `topics list` 条数异常少。这不是连接问题、不是订阅问题、不是 ACL 问题,而是 mria 路由表与集群失同步。重启 pod 因为持久化数据会跳过全量 bootstrap 而无效,必须 `cluster leave` + `cluster join` 强制重同步才能修复。**
Diagnoses bugs by matching user-reported symptoms against a curated library of past bug cases under experience/, using prior cases as memory references (neve...
---
name: bug-pattern-diagnosis-en
description: Diagnoses bugs by matching user-reported symptoms against a curated library of past bug cases under experience/, using prior cases as memory references (never as copy-paste answers). Accumulates new cases after each successful investigation. Use when the user reports "weird error / intermittent failure / only reproduces in specific environments / strange logs".
---
# Bug Pattern Diagnosis (Symptom-Based Triage & Experience Accumulation)
## Skill Purpose
This skill works **like a doctor diagnosing a patient**:
1. **Collect symptoms** — Ask the user to describe the bug (error messages, reproduction rate, environment, log patterns, etc.)
2. **Recall past cases** — Look up similar bugs in the `experience/` folder as **reference memory**
3. **Diagnose independently** — Investigate using the current project's context. Past cases are only **sources of inspiration and hypotheses**, never reused as-is
4. **Accumulate new experience** — After successfully identifying a new bug, save it as a new `BUGxx.md` for future **reference**
**Core value**: Let past troubleshooting experience **inspire** the direction of new investigations — but never replace independent thinking. Similar symptoms do not guarantee the same root cause. Logs that look identical may hide completely different failures.
## Iron Rule: Experience Is Reference, Not the Answer
> **This is the single most important principle of this skill.**
The `BUGxx.md` files under `experience/` are **an old doctor's case notebook**, not **a prescription template**. When examining a new patient:
- ✅ **What you MAY do**: Use the notebook to learn "what conditions are worth suspecting given these symptoms", "what methods have worked before", "what traps are easy to fall into"
- ❌ **What you MUST NOT do**: Copy the root cause, copy the fix, or copy code changes just because the symptoms look similar
### Why Direct Reuse Is Forbidden
1. **Symptoms can match, root causes may not**: "NPE + intermittent + multi-replica" may be a missing header in one case, a Redis connection pool flake in another, or a GC stop-the-world race in yet another — or **multiple overlapping problems**.
2. **Projects differ wildly**: The same code pattern behaves differently across versions, configs, and dependencies.
3. **AI misdiagnosis is expensive**: Copy-pasting a case misleads the user into changing code based on a wrong premise, masking the real bug.
4. **The value of experience is in *prompting thought*, not *giving answers***: Good doctors read case files to broaden their thinking, not to copy-paste treatment plans.
### Correct Usage Posture
| Situation | Wrong approach | Right approach |
|---|---|---|
| Symptoms match closely | "This is BUG01, modify `invokeRemoteDeviceOpt` and add `x-token-payload`" | "Your description reminds me of BUG01 — one of its signatures was asymmetric logs across replicas. **Before proceeding**, can you verify: do the failing requests actually produce 'stack trace on one replica, one-liner on another'?" |
| Partial feature match | "Not exactly, but the BUG01 fix should still work" | "There's a **diagnostic trick from BUG01** that might apply here — send 100 requests in a row and see if the success rate is ≈ `1/N`. Let's validate that first." |
| Case contains specific code | Paste BUG01's fix and ask user to apply | "BUG01's fix idea was propagating headers, but for your project you need to first confirm: (1) What headers does your receiver actually depend on? (2) Is the serialization identical to the gateway's? We'll need these answers before writing any code." |
## Case Library Structure
```
bug-pattern-diagnosis-en/
├── SKILL.md ← This file (purpose, flow, matching rules)
└── experience/ ← Case library
├── BUG01.md ← One document per case
├── BUG02.md
└── ...
```
Each `BUGxx.md` follows a fixed structure for fast lookup:
1. **Case summary** (one paragraph)
2. **Symptom / signature checklist** (like "positive findings" in a medical case — used for matching)
3. **Detailed explanation** (pathology / root-cause chain)
4. **Diagnostic methodology** (investigation flow / key techniques)
5. **Remediation plan** (fix + safety net + hardening)
6. **Prevention checklist** (to avoid recurrence)
7. **Playbook for similar intermittent bugs** (step-by-step for analogous cases)
## When to Activate This Skill
Prefer this skill when the user describes issues like:
- "This endpoint **sometimes** errors and sometimes works"
- "Reproduces in production but not locally/in testing"
- "The error in the logs looks weird / doesn't match the code / the stack points nowhere useful"
- "Multiple pods / instances / replicas / machines behave inconsistently"
- "I clearly sent xxx, but the server says xxx is missing"
- "Occasional timeouts / occasional 500s / occasional permission denied"
- Anything framed as "**intersection / intermittent / non-deterministic**"
## Core Flow
### Step 1: Read the Case Library Index
Read the **symptom / signature checklist** section from **all** `BUG*.md` files under `experience/` (the first 30-50 lines of each file typically suffice). Do not read full files upfront.
### Step 2: Structure the Symptoms
Extract **key features** from the user's description, covering at least these dimensions:
| Dimension | Examples |
|---|---|
| Error signal | NPE / 500 / 403 / timeout / data corruption / deadlock |
| Reproduction rate | 100% / 50% / sporadic / specific conditions |
| Environment delta | Doesn't repro locally? repros in test? in prod? |
| Multi-instance traits | Single replica / multiple? how many? |
| Log distribution | Concentrated on one instance / scattered / one has stack one doesn't |
| Triggering conditions | Specific user / specific params / specific time window |
| Recent changes | New release? config change? scale up/down? dependency upgrade? |
### Step 3: Recall Cases (Not Match Conclusions)
Compare the structured features against each case library entry for **similarity**, but **never** treat the result as a conclusion. No matter how high the similarity, a case is only a **"prior experience worth consulting"**, not a **"confirmed answer"**.
Similarity handling:
- **High overlap** (3+ key features match) → Treat the case as a **"priority suspicion direction"**, but guide the user to **independently verify** whether those key features actually hold
- **Partial overlap** (1-2 features match) → Treat the case as a **"possible source of ideas"**; mention that one or two of its diagnostic tricks may apply
- **No match** (0 features) → Go to the general methodology and investigate from scratch
### Step 4: Offer Investigation Suggestions (Not Diagnostic Conclusions)
**Do not** assert "this is BUGxx". Use the tone of "past cases had similar traits, here are **possible directions and ways to verify**".
Recommended response structure:
- 🧭 **Restate the structured symptoms** (confirm you understood correctly)
- 💭 **Experience reference**: Mention 1-2 relevant cases briefly ("this reminds me of BUGxx which had a similar trait X") — but **do not paste root cause or fix code**
- 🔬 **Independent verification steps**: List 2-3 quick checks the user can run to validate hypotheses (success rate over 100 requests / cross-replica log comparison / header packet capture, etc.)
- 🎯 **Project-specific investigation path**: Based on the user's project structure and code style, give **customized** next steps — not the generic steps from the case
- ⚠️ **State uncertainty clearly**: Say plainly "this is just a hypothesis based on similar symptoms; the real root cause needs your verification"
### Step 4 — Negative Example (DO NOT OUTPUT)
```
❌ "Based on your symptoms, this is BUG01 (multi-replica intermittent NPE).
The root cause is that invokeRemoteDeviceOpt drops x-token-payload.
Apply this fix: [pastes BUG01's code directly]"
```
### Step 4 — Positive Example
```
✅ "A few things in your description stand out:
1) Intermittent failure with success rate near 50%
2) The error looks token-related, but the client clearly sent a token
This reminds me of a past case (BUG01) where similar symptoms were
caused by 'multi-replica internal calls dropping an identity header'.
Your project isn't necessarily the same thing, so **let's validate
a couple of hypotheses first**:
Q1: How many replicas does your service have? Does the success rate
approximate 1/N?
Q2: Can you tail logs from all replicas simultaneously and see if
the failing request leaves traces on multiple replicas with
asymmetric detail levels?
Run those two checks, share the results, and we'll pick a path."
```
### Step 5: Accumulate a New Case (Optional)
Proactively **ask the user** whether to capture this investigation as a new case when:
- No case in the library matched, but the bug was successfully identified
- An existing case partially matched, but there's a significantly new variant
- A new diagnostic technique emerged during investigation
With user consent, create `BUGxx.md` (numbering = current max + 1) following the **case document template** below.
## General Methodology (When No Case Matches)
Investigate unknown bugs in this order:
### 1. Quantify the Symptom
- Success rate? Fire 100 consecutive requests and tally
- Reproduction conditions? Can a minimal case reproduce reliably?
- Time-of-failure distribution? Specific time window / user / machine?
### 2. Inspect Deployment Topology
- How many replicas? Does success rate ≈ `1/N` or `(N-1)/N`?
- Can a single replica reproduce? If not → strong signal of inter-replica inconsistency
- Any canaries / gray releases? Is there a mix of old and new versions?
### 3. Cross-Instance Log Comparison
- By traceId / time window, pull logs from **all relevant instances** and view side-by-side
- Look for cases where "the same request leaves dislocated evidence across instances"
- Watch for **asymmetric verbosity** (one instance has full stack, another has a one-liner)
### 4. Source vs Deployed Binary Parity
- `javap -c -p` the deployed jar's key classes and diff against local source
- Verify image tag, git commit hash match expectations
- Check ConfigMaps / env vars are identical across all replicas
### 5. Protocol Boundary Audit
If you suspect context propagation issues:
- Draw the full lifecycle of "identity / traceId / MDC / ThreadLocal"
- Mark every "cross-thread / cross-process / cross-instance" boundary
- For each boundary, does an explicit pack → unpack mechanism exist?
- Use tcpdump / print `getHeaderNames()` at the receiver to verify what headers actually arrive
### 6. Compare Against Sibling Code
- Does the project contain **similar code** that does **not** have this bug?
- Diff the buggy code vs the healthy code — the delta is often the culprit
## Case Document Template (for new BUG*.md files)
```markdown
# BUGxx: <One-line title emphasizing the most distinctive symptom>
## Case Summary
<One paragraph, under 200 words: phenomenon, reproduction conditions, root-cause type, blast radius>
## Symptom / Signature Checklist (for matching)
> If N+ of these hold simultaneously, high probability of this case
- [ ] Feature 1 (concrete, verifiable)
- [ ] Feature 2
- [ ] Feature 3
- [ ] ...
### Key Log Fingerprints
<Paste typical error log snippets so future investigations can grep-match them>
### Exclusion Criteria (Negative Signals)
- If <xxx> appears, this is NOT this case
## Detailed Explanation / Root Cause Chain
<The pathology. Use diagrams / tables / code references to show data flow, control flow, state transitions>
## Diagnostic Methodology
### Techniques Used
- <Technique 1, e.g. "cross-replica log comparison">
- <Technique 2, e.g. "javap bytecode diff">
- ...
### Diagnostic Steps (in order)
1. ...
2. ...
3. ...
## Remediation Plan
### Root-Cause Fix
<The core fix, with code reference, impact assessment>
### Safety Net / Defense in Depth
<Secondary defenses so a missed fix still doesn't crash>
### Hardening / Cleanup
<Long-term improvements, coding standards, sibling-code sweeps>
## Prevention Checklist
During development:
- [ ] ...
During deployment:
- [ ] ...
During review:
- [ ] ...
## Playbook for Similar Intermittent Bugs
When symptoms look similar, walk through:
1. Quantify reproduction (10 min)
2. Check topology (5 min)
3. Cross-replica log diff (30 min)
4. ...
## References
- Relevant file paths
- Relevant PRs / commits
- Relevant wiki
```
## Output Style Constraints
- **Lead with direction, not with the conclusion**: Tell the user what's worth suspecting and why, instead of asserting "this is bug X"
- **Never speak in 100% certain terms**: Similar symptoms ≠ same root cause. Use phrases like "looks like / worth suspecting / could be / past cases suggest"
- **Recommendations must be executable**: Provide concrete commands or quick checks, not abstract theory
- **Make the user verify before coding**: Better to ask 2 "please confirm..." questions than to let them edit code based on a guess
## Hard Rules (Forbidden Actions)
- ❌ **Never paste the fix code from a case directly**: The code in `BUGxx.md` is "how that project fixed it", not necessarily right for the current project
- ❌ **Never assert "this is BUGxx"**: At most say "this reminds me of BUGxx" / "has similar traits to BUGxx"
- ❌ **Never skip independent verification**: Even if 5 of 5 features match, run at least 1-2 quick checks before discussing fixes
- ❌ **Never copy-paste the case's prevention checklist or playbook wholesale**: These are **thinking material**; tailor and rewrite each time based on the current context
## Case Accumulation Rules
- Don't create a case just to create one — only capture truly novel value (new symptom combinations, new diagnostic techniques)
- Don't use generic bug taxonomy terms (OOM, deadlock) — use **symptom descriptions** to organize cases ("same request leaves dislocated evidence across replicas" matches better than "state consistency bug")
- Index the library by **phenomenon**, not by "tech stack" or "vulnerability class"
- End every case with **applicability boundaries** and **counter-examples** (what situations are NOT this case), to sharpen future identification
FILE:experience/BUG01.md
# BUG01: Intermittent Multi-Replica NPE — Stateful Business × Stateless Deployment × Dropped Identity Context
## Case Summary
A client sends a request with a valid token to an endpoint. Repeated requests succeed **roughly 50% of the time** and fail the rest with `errorCode: 10001, "service exception"`. The server runs multiple k8s replicas (Deployment + round-robin Service). The failing replica's logs contain a full NPE stack caused by `ChainContextHolder.getTokenPayload()` returning null, while the "entry replica" for the same request only has a one-line business error (no stack).
The root cause is the intersection of three layers:
1. **Application-layer stateful routing** (partition ownership)
2. **k8s stateless round-robin distribution**
3. **Internal forwarding protocol dropping the `x-token-payload` header**
**No single-layer fix can fully resolve this**, but fixing the protocol layer alone is sufficient.
---
## Symptom / Signature Checklist (for matching)
> When **4+** of these features are present, this case is a high-probability match. **2-3** features is still worth prioritizing this direction for investigation.
### Client-Side Phenomena
- [ ] Same token, same body, fired N times — success rate is **approximately `1/replicaCount`** or `(N-1)/N` (≈ 50% with 2 replicas)
- [ ] Failures return a business error code (e.g. 10001 / INTERNAL_SERVER_ERROR / "service exception") — not 401/403/timeout
- [ ] Single-replica local environment has **100% success** — cannot reproduce locally
- [ ] The error **looks like a token problem** ("unauthenticated / identity missing / user not found"), yet the client clearly sent a valid token
### Server-Side Log Features (MOST diagnostic!)
- [ ] A single client request leaves traces on **multiple replicas** in their logs (it should normally appear on just one)
- [ ] The different replicas' logs are **asymmetric in verbosity**:
- Replica A: `ERROR` + full exception stack (NPE / ClassCastException / IllegalState against some low-level code)
- Replica B: `ERROR` or `WARN` + a one-line short business error code, **no stack**
- [ ] On the replica that has a stack, the NPE points to code that **dereferences ThreadLocal / MDC / a context store** (e.g. `ChainContextHolder.getXxx()`, `RequestContextHolder.getXxx()`, `MDC.get()`)
- [ ] On the replica without a stack, the log looks like `response XXXX,XXX` (from RestTemplate/Feign handling an internal HTTP response)
### Deployment / Architecture Features
- [ ] Target service is a **k8s Deployment** with multiple replicas (not StatefulSet)
- [ ] The service uses **Kafka Consumer Groups** / **application-layer partition routing** / **in-memory cache with partition ownership**
- [ ] The code has `RestTemplate.exchange` / `Feign` calls that send HTTP **to this service's other pod IPs**
- [ ] There's some kind of "yellow pages table" (DB or Redis recording `partition → ip:port`)
### Code Features
- [ ] The receiver controller **directly dereferences** ThreadLocal/context (e.g. `ChainContextHolder.getTokenPayload().getAccount()`) without null-safety
- [ ] A "feign_token" / "system_token" literal placeholder exists for internal calls (e.g. `Authorization: Bearer feign_token`)
- [ ] Internal forwarding only sends `Authorization`, **missing** `x-token-payload` / `x-user-context` / `x-trace-id` etc.
### Key Log Fingerprints (directly greppable)
```text
# Replica A (receiver) — typical log
ERROR ... GlobalExceptionHandler - found system exception,null/<endpoint-path>
java.lang.NullPointerException: Cannot invoke "XxxPayload.getXxx()" because
the return value of "XxxContextHolder.getPayload()" is null
at <Controller>.<method>(<Controller>.java:NNN)
# Replica B (forwarder) — typical log
ERROR ... <Service> - <methodName> response <errorCode>,<errorMessage>
(that's it — no "at xxx.xxx" stack frames after it)
```
### Exclusion Criteria (Negative Signals)
- **100% failure on a single replica** → NOT this case; it's a plain code bug
- **All replicas show full stacks** → NOT this case; no cross-replica chain
- **Token actually expired** (logs contain `token expired` / `invalid signature`) → NOT this case
- **Failures concentrate within minutes of scale events / deploys** → Possibly Kafka rebalance transient issue, not this case
---
## Detailed Explanation / Root Cause Chain
### The Three-Layer Decomposition
| Layer | What it is | Is it a bug on its own? |
|---|---|---|
| **Code layer** | Some business endpoint's service has partition routing logic: `MonitorCacheMgr.send(tbCode)` hashes `tbCode` to find the partition owner pod; non-owners forward over internal HTTP to the owner | Not a bug (legitimate design for stateful business) |
| **Deployment layer** | k8s Deployment with 2+ replicas; Service uses round-robin to distribute external traffic. Replicas have 50/50 hit rate regardless of business routing | Not a bug (standard k8s usage) |
| **Protocol layer** | Internal forwarding only carries `Authorization: Bearer feign_token` (literal placeholder); **does not carry `x-token-payload`**. Receiver's `AuthInterceptor` only honors `x-token-payload`; if missing, `ChainContextHolder.TokenPayload = null` | **This is the actual bug** |
### Request "Fission" Flow
The client sees 1 request; the server actually processes 2 HTTP calls:
```text
Client ──► Request #1 ──► k8s Service (round-robin)
│
50% │ 50%
┌───────────────┴───────────────┐
▼ ▼
Pod A (owner of tbCode) Pod B (NOT owner)
────────────────── ──────────────────
1. AuthInterceptor 1. AuthInterceptor
parses x-token-payload ✓ parses x-token-payload ✓
2. Controller reads account ✓ 2. Controller reads account ✓
3. send() — I am the owner 3. send() — owner is Pod A
→ local call() → initiates Request #2 to Pod A
4. business logic executes │
5. Returns ApiResponse(0) ✅ │ ❌ drops x-token-payload
▼
Pod A as receiver:
5. AuthInterceptor can't find
x-token-payload
→ ChainContextHolder = null
6. Controller dereferences .getAccount()
💥 NPE
7. GlobalExceptionHandler catches
→ logs full NPE stack (log ①)
→ returns ApiResponse(10001)
▲
│ HTTP 200 + body{errorCode:10001}
│
Pod B receives response:
8. errorCode != 0
log.error("...response 10001,service exception")
(log ②, one line, no stack)
9. throw ServiceException
→ returns ApiResponse(10001) to client
```
### Why the Failure Is Intermittent
k8s Service round-robin picks the target replica for each request independently of token content:
- Lucky — hits the owner directly → no forwarding → 100% success
- Unlucky — hits a non-owner → forwarding triggered → dropped header → NPE → failure
### Why the Logs Are Asymmetric
Spring's `GlobalExceptionHandler` categorizes exceptions into two classes:
| Exception type | Semantics | Log level | Stack printed |
|---|---|---|---|
| `SystemException` / uncaught `Throwable` | Unexpected system error (NPE, SQL fault, etc.) | `ERROR` | ✅ Full stack |
| `ServiceException` / business exception | Business rule violation (expected error) | `WARN` or `ERROR` | ❌ No stack |
- **Pod A** (receiver): Actual NPE → SystemException path → **stack present**
- **Pod B** (forwarder): Just wraps `errorCode` into `ServiceException` → business exception path → **no stack**
### Why "Login Is Stateless" Doesn't Save You
The common confusion:
- **Auth layer (JWT token) is indeed stateless** — every pod can independently decode the token
- **Business layer is heavily stateful** — because:
1. Kafka Consumer Group protocol: one partition can only be consumed by one consumer in the group → pods bind tightly to partitions
2. High-frequency device data must aggregate in memory (thousands per second, Redis can't absorb it)
3. Aggregate data lives only in the pod consuming that partition → queries must route to the owner
So the project has to hand-roll application-layer routing (partition yellow pages + pod-to-pod HTTP forwarding) on top of stateless k8s deployment — and this forwarding protocol is exactly where bugs breed.
---
## Diagnostic Methodology
### Core Techniques Used
| Technique | Purpose | When to use |
|---|---|---|
| **Quantified reproduction** | Fire 100 requests in a row, tally the success rate; confirm deterministic vs intermittent | Required first step |
| **Replica-count comparison** | Match success rate against `1/N`; directly points at inter-replica inconsistency | Immediately when intermittency is observed |
| **Cross-replica log correlation** | Pull logs from all replicas by traceId / time window, view **side by side** | When internal call chains are suspected |
| **Source vs deployed-binary diff** | `javap -c -p` the deployed jar, diff against local source | When wrong-version deployment is suspected |
| **Context-propagation lifecycle diagram** | Mark every cross-thread/process/instance boundary and audit each for compensation | When context loss is suspected |
| **Header packet capture** | `tcpdump` / print `getHeaderNames()` at the receiver | When header transfer is suspected |
| **Sibling code comparison** | Find 5-10 similar implementations in the project; diff the bug vs the healthy ones | When multiple similar implementations exist |
### Diagnostic Steps (in order)
#### 1. Quantify the Symptom (10 min)
```bash
# Fire 100 identical requests (replace URL, token, body)
for i in $(seq 1 100); do
curl -s -X POST "$URL" -H "Authorization: $TOKEN" -d "$BODY" \
| jq -r '.errorCode' | head -1
done | sort | uniq -c
```
Read the success rate:
- 100% / 0% → NOT this case; take a different path
- **≈ 50% (2 replicas) / ≈ 33% (3 replicas) / ... → strong signal of this case**
#### 2. Confirm Replica Topology (5 min)
```bash
kubectl get deploy <service-name> -o json | jq '.spec.replicas'
kubectl get pods -l app=<service-name>
```
- replicaCount > 1 with ordinary ClusterIP/LoadBalancer Service → "multi-replica round-robin" condition satisfied
#### 3. Cross-Replica Log Correlation (30 min)
```bash
# Tail every replica in parallel, prefix each line with the pod name
for pod in $(kubectl get pods -l app=<service> -o name); do
kubectl logs -f "$pod" --tail=100 | sed "s/^/[$pod] /" &
done
wait
```
Fire one failing request and observe:
- Does the log appear **simultaneously** on multiple replicas?
- Is **verbosity asymmetric** between replicas?
Both present → strongly consistent with this case type.
#### 4. Localize the Code (30 min)
- Find the file/line pointed to by the NPE stack in the "has-stack" replica
- Find the `log.error("...response XXX,XXX")` call site in the "no-stack" replica
- Inspect the `restTemplate.exchange` / `Feign` method surrounding that logging
- Check the URL — is it in the form `http://<dynamic-ip>:<port>/...` (pod-to-pod direct)?
- Inspect the Headers — missing `x-token-payload` / `x-user-context`?
#### 5. Confirm Protocol Mismatch (15 min)
- Capture the full header list of a request as it passes through the Gateway
- Capture the headers added during internal forwarding
- Diff the two and identify what's missing
- Trace the receiver's `AuthInterceptor` / `@PreAuthorize` / context-parsing code to confirm which headers it depends on
**The difference set is exactly the header list to fix.**
---
## Remediation Plan
### Root-Cause Fix (P0, required)
**Propagate identity context headers in the outbound internal HTTP call.** Reference implementation:
```java
private Integer invokeRemoteDeviceOpt(JsonNode node, NodePartition nodePartition, RestTemplate restTemplate) {
String newUrl = String.format("http://%s:%s/rest/v1/access/device/product/services",
nodePartition.getIp(), nodePartition.getPort());
HttpHeaders headers = new HttpHeaders();
headers.add("Authorization", RequestHeaderConstant.HTTP_FEIGN_TOKEN_BEARER.getValue());
headers.add("Content-Type", "application/json");
headers.add("Accept", "application/json");
// Core fix: propagate the identity context
TokenPayload tokenPayload = ChainContextHolder.getTokenPayload();
if (tokenPayload != null) {
TokenContext tokenContext = TokenContext.builder().payload(tokenPayload).build();
headers.add(RequestHeaderConstant.HTTP_TOKEN_PAYLOAD_HEADER.getValue(),
JSONUtil.INSTANCE.toJson(tokenContext));
} else {
log.warn("invokeRemoteDeviceOpt missing TokenPayload, forward to {} without x-token-payload", newUrl);
}
ChainContext ctx = ChainContextHolder.get();
if (ctx != null) {
if (StringUtils.isNotBlank(ctx.getRemoteIp())) {
headers.add(RequestHeaderConstant.HTTP_REMOTE_IP_HEADER.getValue(), ctx.getRemoteIp());
}
if (StringUtils.isNotBlank(ctx.getLocale())) {
headers.add(RequestHeaderConstant.HTTP_LANGUAGE.getValue(), ctx.getLocale());
}
}
// ... subsequent restTemplate.exchange remains unchanged
}
```
**Key**: Serialization must match the Gateway entry point exactly (e.g. both use `JSONUtil.INSTANCE.toJson(TokenContext)`), otherwise the receiver's deserialization will fail.
### Safety Net (P1, recommended)
**Make receiver-side ThreadLocal dereferences null-safe:**
```java
@PostMapping(value = "/rest/v1/access/device/product/services")
public ApiResponse<Integer> innerDeviceOptRequest(@RequestBody JsonNode jsonNode) {
String tbCode = ChainContextHolder.getTbCode();
TokenPayload payload = ChainContextHolder.getTokenPayload();
String account = payload != null ? payload.getAccount() : null;
String tid = payload != null ? payload.getTid() : null;
// ...
}
```
Defensive value: if a future caller also forgets to propagate headers, the process won't crash — it will merely degrade to `account=null`.
### Hardening (P2, long-term)
- **Extract a common utility** `forwardHeadersBuilder()` to unify the "internal HTTP forwarding" header assembly so nobody forgets again
- **Remove dead code**: scan for "intended to handle internal calls but with broken logic" (e.g. `url.startsWith("/inner/") && url.equals("feign_token")` — always false)
- **Separate URL namespace**: long-term, route external APIs and internal forwards to **different URL prefixes**; let the receiver pull identity from the body explicitly rather than relying on headers. Existing `/inner/rest/...` pattern is a good reference
---
## Prevention Checklist
### When writing "outbound HTTP calls"
- [ ] Does this call **cross a network boundary**? (Even pod-to-pod within the same service counts)
- [ ] Does the receiver depend on `ThreadLocal` / `MDC` / context storage?
- [ ] Did I explicitly propagate identity headers (`x-token-payload` / `x-user-context`), traceId, MDC?
- [ ] Is the header set **identical** to what's injected at the service entry point (Gateway / AuthFilter)?
- [ ] When context is null, is the degradation strategy clear (log.warn + continue vs throw)?
### When writing "inbound Controllers"
- [ ] Which callers will hit this endpoint? External? Other services? My own pods?
- [ ] Can every caller guarantee to supply the headers I depend on?
- [ ] Before dereferencing `ContextHolder.getXxx().getYyy()`, is the worst case NPE-safe?
- [ ] Better yet, should I use **explicit parameters** + `@Validated` and avoid ThreadLocal dependencies entirely?
### When reviewing stateful services
- [ ] Does the app have hidden state (session / partition ownership / in-memory cache)?
- [ ] Is it deployed as Deployment or StatefulSet? Does it match the business semantics?
- [ ] Does internal routing fully propagate all required context?
- [ ] In a **multi-replica environment**, can the same business key produce 100 repeated requests to test for failure?
### Deployment stage
- [ ] Is the source version ↔ deployed version verifiable? (image tag, git commit hash)
- [ ] During gray releases, does a canary pod run first?
- [ ] For stateful service upgrades, has cross-version internal-call compatibility been tested?
---
## Playbook for Similar Intermittent Bugs
When you hear vague reports like "occasional NPE / occasional 500 / occasional 403", walk through:
### Step 1: Quantify the Reproduction (10 min)
Fire N requests and compute the success rate:
- 100% / 0% → skip to Step 5 (deterministic bug path)
- Any other ratio → continue
### Step 2: Check Replica Count (5 min)
`kubectl get deploy` — how many replicas?
- Success rate ≈ `1/N` or `(N-1)/N`? → inter-replica inconsistency
- Can a single replica reproduce it? If not → **strongly indicates this case type**
### Step 3: Cross-Replica Log Correlation (30 min)
- Fire one failing request while tailing logs from all replicas
- Look for "same request leaves traces on multiple replicas + asymmetric verbosity"
- Identify the code locations for both "replica with stack" and "replica without stack"
### Step 4: Source vs Deployed Diff (15 min)
- Pull the jar from a live pod and decompile the critical classes
- Diff against local source
- Any difference is a major lead (suspect wrong-version deployment first)
### Step 5: Draw the Context Propagation Chain (30 min)
- Mark every cross-thread / cross-process / cross-instance boundary
- Check every boundary for an explicit pack → unpack mechanism
- Validate with packet capture or header-printing at the receiver
### Step 6: Sibling Code Comparison (15 min)
- Find 5-10 code sites doing similar things
- Diff the buggy code vs the healthy ones
### Step 7: Minimal Reproduction + Test Fixation (1 hour)
- Build a "minimum replica count + minimum precondition" repro case
- **Turn it into an integration test** to prevent regression
---
## References
### Relevant File Paths (project: cuavcloudservice)
- Root-cause fix site: `CuavCloudApplyService/CuavCloudService/.../application/device/DeviceService.java` (`invokeRemoteDeviceOpt`)
- NPE crash site: `CuavCloudApplyService/CuavCloudService/.../api/device/DeviceAccessController.java:313`
- Context definitions: `cuavcloudcbb/.../context/ChainContextHolder.java`, `ChainContext.java`
- Header constants: `cuavcloudcbb/.../constant/RequestHeaderConstant.java`
- Gateway injection points: `cuavcloudservice/.../gateway/filter/AuthorizationFilter.java` (lines 133/274/338/345)
- Receiver parsing: `cuavcloudcbb/.../authentication/application/interceptor/AuthInterceptor.java`
- Partition routing: `CuavCloudService/.../domain/cache/MonitorCacheMgr.java` (`send`), `domain/merchmant/NodePartitionMgr.java`
- Partition mapping store: `t_kafka_partition` (DB) + `access.partition.cache.v4.{id}` (Redis)
### Key Concepts
- JWT stateless authentication
- Kafka Consumer Group partition assignment
- k8s Deployment vs StatefulSet
- ThreadLocal lost across network boundaries
- Asymmetric `GlobalExceptionHandler` treatment of SystemException vs ServiceException
### One-Liner Summary
> **When k8s's stateless deployment philosophy clashes with application-layer stateful business logic, internal pod-to-pod HTTP forwarding drops `x-token-payload`, and replica round-robin turns the bug into Schrödinger's cat. The two pods each hold half the evidence (stack on the receiver, business error on the forwarder); only cross-pod log correlation reveals the full chain.**
Install, configure, start, stop, and verify local or remote development infrastructure across Windows, Linux, and macOS by executing commands through a unifi...
---
name: universal-shell-deployer
description: Install, configure, start, stop, and verify local or remote development infrastructure across Windows, Linux, and macOS by executing commands through a unified workflow. Use when the user asks to set up databases, MinIO, ZLMediaKit, Docker, Redis, PostgreSQL, MySQL, Nginx, Node.js, Java, or other developer environments on local machines or remote hosts.
---
# Universal Shell Deployer
## Purpose
Use this skill for cross-platform environment setup driven by command execution.
Targets may be:
- local Windows, Linux, or macOS
- remote Linux or macOS over SSH
- remote Windows over PowerShell Remoting or another configured command bridge
This skill is configuration-first. Always read the sibling `config.json` before planning or executing changes.
## Required Inputs
Before executing anything, identify:
1. target node
2. target service or recipe
3. desired action: `install`, `configure`, `start`, `stop`, `restart`, `status`, `verify`, `uninstall`
4. whether elevated privileges are allowed
5. whether the task is local-only, remote-only, or mixed
If any of these are unclear, ask the user.
## Config Contract
Read `config.json` in the same directory and use it as the single source of truth for:
- default behavior
- node definitions
- recipe preferences
- execution history
- current state
Update `config.json` after meaningful progress so future runs can resume from the last known state.
## Execution Workflow
### 1. Load config
Read `config.json` and resolve:
- `defaults`
- `nodes`
- `recipes`
- `state`
If the requested node does not exist, ask whether to create it before proceeding.
### 2. Select node
Choose a node by name. Respect the node's:
- `transport`
- `os`
- `shell`
- `workdir`
- `packageManager`
- privilege policy
If the node says `enabled: false`, do not use it without user confirmation.
### 3. Select recipe
If a named recipe exists, use it as the default implementation.
Prefer the recipe's:
- install method
- package names
- service names
- ports
- environment variables
- health checks
If no recipe exists, build a minimal plan using the node defaults and keep it idempotent.
### 4. Plan before execution
State the concrete execution plan before changing the system:
- which node will be used
- which transport will be used
- which commands will run
- what success looks like
- what state fields will be updated
Break risky changes into small steps.
### 5. Execute by transport
#### Local
Run commands directly in the correct shell:
- Windows: prefer `powershell`
- Linux/macOS: prefer `bash`
#### SSH
Use `ssh` and keep commands non-interactive where possible.
Prefer:
- explicit usernames
- explicit hostnames
- idempotent shell commands
- small batches instead of one giant script
#### Windows remote
Use the configured command bridge in the node definition. Default to PowerShell Remoting semantics unless the user configured something else.
### 6. Verify every step
After install or configuration, verify with one or more of:
- package version
- service status
- listening port
- process existence
- HTTP health endpoint
- storage login or test query
Never mark a step complete without a verification signal.
### 7. Persist state
Write back useful state into `config.json`, such as:
- `state.lastSelectedNode`
- `state.lastRecipe`
- `state.lastAction`
- `state.lastResults`
- `state.installations`
- `state.services`
Record failures with timestamps and short error summaries.
## Cross-Platform Rules
- Prefer the package manager declared in the node config.
- Do not assume `sudo` on Windows.
- Do not assume `systemctl` exists on all Linux hosts.
- Do not assume Homebrew exists on macOS unless config says so.
- For containers, prefer `docker compose` when already configured in the node or recipe.
- For direct binary installs, pin the version only when the user asked for it or the recipe requires it.
## Safety Rules
- Never run destructive commands unless the user explicitly approved them.
- Never wipe data directories during reinstall unless the user asked for a reset.
- For database setup, prefer creating dedicated users and data directories.
- For internet downloads, prefer official release URLs recorded in the recipe or confirmed by the user.
- Surface privilege escalation clearly before executing privileged commands.
## Recommended State Shape
When updating `config.json`, use these sections consistently:
- `nodes.<node>.connection`: stable connection metadata
- `nodes.<node>.overrides`: node-specific behavior overrides
- `recipes.<recipe>`: reusable install/config templates
- `state.installations.<node>.<service>`: install status and version
- `state.services.<node>.<service>`: running status and health
- `state.lastResults.<node>`: last action summary
## Suggested Recipe Coverage
Keep recipes for common environment services:
- `redis`
- `postgresql`
- `mysql`
- `minio`
- `zlmediakit`
- `docker`
- `nginx`
- `nodejs`
- `python`
- `java`
Each recipe should define:
- package names by platform
- service names by platform
- default ports
- install steps
- configuration targets
- verification commands
## Response Style
When using this skill:
1. briefly confirm the target node and action
2. show the planned commands before execution
3. execute in small validated steps
4. summarize what changed
5. mention which `config.json` fields were updated
## Example Requests
- "在本机 Windows 安装 MinIO 并开机启动"
- "在远程 Ubuntu 机器安装 PostgreSQL 16"
- "帮我给 dev-linux-01 安装 ZLMediaKit 并验证 1935 和 8080 端口"
- "在 macOS 上补齐 Docker、Redis、Node.js 开发环境"
- "更新 prod-edge-02 上的 MinIO 配置但不要删除数据"
FILE:config.json
{
"version": 1,
"defaults": {
"node": "local-default",
"transport": "local",
"os": "auto",
"shell": "auto",
"packageManager": "auto",
"workdir": "",
"timeoutSeconds": 1800,
"useSudo": false,
"confirmBeforePrivilegeEscalation": true,
"verifyAfterEachStep": true,
"preferDockerCompose": false,
"downloadRoot": "",
"dataRoot": "",
"env": {}
},
"nodes": {
"local-default": {
"enabled": true,
"transport": "local",
"os": "windows",
"shell": "powershell",
"packageManager": "winget",
"workdir": "",
"tags": [
"local",
"default"
],
"connection": {
"host": "localhost",
"port": null,
"username": "",
"authRef": ""
},
"overrides": {
"useSudo": false,
"preferDockerCompose": false,
"timeoutSeconds": 1800,
"downloadRoot": "",
"dataRoot": ""
}
},
"dev-linux-01": {
"enabled": false,
"transport": "ssh",
"os": "linux",
"shell": "bash",
"packageManager": "apt",
"workdir": "/opt/devstack",
"tags": [
"remote",
"linux",
"example"
],
"connection": {
"host": "192.168.1.10",
"port": 22,
"username": "root",
"authRef": "ssh-dev-linux-01"
},
"overrides": {
"useSudo": false,
"preferDockerCompose": true,
"timeoutSeconds": 2400,
"downloadRoot": "/opt/downloads",
"dataRoot": "/data"
}
},
"dev-win-01": {
"enabled": false,
"transport": "psremoting",
"os": "windows",
"shell": "powershell",
"packageManager": "winget",
"workdir": "C:\\devstack",
"tags": [
"remote",
"windows",
"example"
],
"connection": {
"host": "win-dev-01",
"port": 5985,
"username": "Administrator",
"authRef": "ps-dev-win-01"
},
"overrides": {
"useSudo": false,
"preferDockerCompose": true,
"timeoutSeconds": 2400,
"downloadRoot": "C:\\downloads",
"dataRoot": "D:\\data"
}
}
},
"recipes": {
"redis": {
"enabled": true,
"installMode": "package-manager",
"packages": {
"windows": [
"Redis"
],
"linux": [
"redis-server"
],
"macos": [
"redis"
]
},
"services": {
"windows": [
"Redis"
],
"linux": [
"redis-server"
],
"macos": [
"redis"
]
},
"ports": [
6379
],
"verify": [
"redis-server --version",
"redis-cli ping"
]
},
"postgresql": {
"enabled": true,
"installMode": "package-manager",
"packages": {
"windows": [
"PostgreSQL.PostgreSQL"
],
"linux": [
"postgresql",
"postgresql-contrib"
],
"macos": [
"postgresql@16"
]
},
"services": {
"windows": [
"postgresql-x64-16"
],
"linux": [
"postgresql"
],
"macos": [
"postgresql@16"
]
},
"ports": [
5432
],
"verify": [
"psql --version"
]
},
"mysql": {
"enabled": true,
"installMode": "package-manager",
"packages": {
"windows": [
"Oracle.MySQL"
],
"linux": [
"mysql-server"
],
"macos": [
"mysql"
]
},
"services": {
"windows": [
"MySQL80"
],
"linux": [
"mysql"
],
"macos": [
"mysql"
]
},
"ports": [
3306
],
"verify": [
"mysql --version"
]
},
"minio": {
"enabled": true,
"installMode": "binary-or-container",
"ports": [
9000,
9001
],
"env": {
"MINIO_ROOT_USER": "minioadmin",
"MINIO_ROOT_PASSWORD": "change-me"
},
"verify": [
"minio --version"
]
},
"zlmediakit": {
"enabled": true,
"installMode": "binary-or-source",
"ports": [
1935,
554,
8080,
8443
],
"verify": [
"ffprobe -version"
]
}
},
"state": {
"lastSelectedNode": "local-default",
"lastRecipe": "",
"lastAction": "",
"lastResults": {},
"installations": {
"local-default": {},
"dev-linux-01": {},
"dev-win-01": {}
},
"services": {
"local-default": {},
"dev-linux-01": {},
"dev-win-01": {}
},
"failures": []
}
}
Unified bug investigation and closure by combining source code, database, server logs, and software platform query capabilities. Use when users require evide...
--- name: multi-capability-bug-closure description: >- Unified bug investigation and closure by combining source code, database, server logs, and software platform query capabilities. Use when users require evidence-based conclusions from real data rather than static code analysis only. --- # Multi-Capability Bug Closure Skill ## When to Use Use this skill when the user requires: - Proactive use of available capability systems to complete bug investigation - Real evidence from databases and server logs - Verifiable conclusions and root-cause analysis - A full closure loop: locate -> prove -> conclude -> recommend This skill must prioritize real runtime evidence and must not conclude from code reading alone. ## Mandatory Guiding Prompt (Keep Original Intent) > Proactively use the currently available capability system. Complete problem localization and evidence-based reasoning by yourself (preferably with real database and server-side data), then provide conclusions and causes. Do not conclude based on code reading alone. Since you can query databases, server logs, and actual software platform operations, complete the full closed-loop investigation autonomously. I only care about final conclusions backed by evidence. --- ## Prerequisite Check: Four Capability Systems (Required) Before each investigation, verify all four capabilities are available and readable: 1. Source code access capability (for understanding and localization) 2. Database read capability (for deep validation) 3. Server log download and analysis capability (for behavioral and timeline evidence) 4. Software platform query capability (for business-side verification) ### Validation Criteria - **Source code capability**: can read related code files in the workspace (at least one core router/service file). - **Database capability**: can execute at least one read-only SQL query (for example, `select 1`). - **Server log capability**: can read log-skill configuration and access a target log directory or sample logs. - **Platform query capability**: related platform skill is readable, and its docs/API description are readable. ### If Any Capability Is Missing (Must Notify User) If any capability is missing, first notify that evidence closure cannot be guaranteed, then provide supplementation recommendations: - Source code capability: provide relevant source context or key directories directly - Database capability: add MCP services for the target system (Postgres/SQLite/Redis, etc.) - Server log capability: install or add [server-log-analysis](https://clawhub.ai/hgvgfgvh/server-log-analysis) - Platform query capability: add a query skill or MCP tools for the target business platform Do not provide a final root-cause conclusion before capabilities are complete. Provide only an evidence-gap checklist. --- ## Standard Workflow (Required) 1. **Structure the problem** - Extract time window, service name, error keywords, and identifiers (SN/ID/traceId). 2. **Log-side evidence (server)** - Pull raw logs in the target time window. - Measure frequency, grouping, and continuity of anomalies. 3. **Database-side evidence** - Query object existence, relationships, status, and timestamps in key tables. - Cross-check across databases when needed. 4. **Code-side localization** - Locate exception enums, throw paths, and routing branch conditions. - Use code only to explain mechanism, not to replace real-data conclusions. 5. **Platform-side validation** - Validate object status and business configuration through platform queries. 6. **Merge the evidence chain** - Connect evidence in this order: logs -> database -> code mechanism -> platform validation. 7. **Output final conclusion** - Classify as data issue / code issue / configuration issue / environment issue. 8. **Provide fix recommendations and verification standards** - Give executable steps and clear pass criteria for closure. --- ## Output Template (Required) ```markdown ## Problem Summary [One-sentence summary of symptom and impact] ## Key Evidence - Log evidence: [time, service, key lines, frequency stats] - Database evidence: [tables, conditions, query results] - Code evidence: [trigger path and branch conditions] - Platform evidence: [API/config validation results] ## Root-Cause Assessment [Final root cause + category: data/code/config/environment] ## Confidence [High/Medium/Low] + [reason] ## Fix Recommendations 1. [Actionable step] 2. [Actionable step] ## Verification Criteria - [Metric 1, e.g. error count reaches zero] - [Metric 2, e.g. mapping restored] ``` --- ## Constraints - Do not fabricate evidence. - Do not skip database or log evidence and jump straight to conclusions. - If a capability is unavailable, explicitly declare the evidence gap and request supplementation. - For sensitive credentials, prefer environment variables or secret managers; do not expose plaintext in outputs.
统一调用源码、数据库、服务器日志、软件平台查询等多能力体系进行 BUG 定位与闭环论证。用于用户要求“必须基于真实数据给出结论与证据链”,而非仅代码静态分析的场景。
--- name: multi-capability-bug-closure description: >- 统一调用源码、数据库、服务器日志、软件平台查询等多能力体系进行 BUG 定位与闭环论证。用于用户要求“必须基于真实数据给出结论与证据链”,而非仅代码静态分析的场景。 --- # 多能力体系 BUG 定位闭环技能 ## 适用场景 当用户要求: - 主动调用现有能力体系完成定位 - 提供数据库与服务器日志等真实证据 - 输出可复核的结论与根因 - 形成“定位 -> 论证 -> 结论 -> 建议”的闭环 本技能必须优先使用真实运行数据,不得仅基于代码阅读直接定论。 ## 强制指导提示词(原文保留) > 我的意思是你积极主动调用目前的能力体系,你自己完成问题定位和证据论证(最好有实际的数据库与服务器数据支撑),再给出结论与原因,而不是仅查看代码就完成定论。你不是已经可以查看数据库、服务器日志、软件平台实际查询操作了吗?请自我完成定位闭环,我只在乎你最终提供的问题定位结论和证据支撑。 --- ## 执行前置:四能力体系可用性检查(必须) 每次执行前,必须验证以下四项是否同时具备且可读: 1. 源码文件能力(便于 AI 理解与定位) 2. 数据库读取能力(便于真实数据验证) 3. 服务器日志下载与分析能力(便于时序与行为证据) 4. 软件平台操作查询能力(便于业务侧联动验证) ### 检查标准 - **源码文件能力**:可在当前工作区读取相关代码文件(至少 1 个核心路由/服务文件)。 - **数据库读取能力**:可成功执行 1 条只读 SQL(如 `select 1`)。 - **服务器日志能力**:可读取日志技能配置并成功访问目标日志目录或样本日志。 - **软件平台查询能力**:对应平台 Skill 可读,且其文档/接口说明可读取。 ### 缺失能力时的提示与推荐(必须提示用户) 若任一能力缺失,先提示“能力不完整,无法保证证据闭环”,并给出补充建议: - 源码文件:请直接提供项目源码上下文或关键目录 - 数据库读取能力:补充对应服务的 MCP(Postgres/SQLite/Redis 等) - 服务器日志下载并分析能力:安装或补充 [server-log-analysis](https://clawhub.ai/hgvgfgvh/server-log-analysis) - 软件平台操作查询能力:补充对应业务平台的查询 Skill 或 MCP 工具 能力未补齐前,不给“最终根因定论”,仅给“当前证据不足清单”。 --- ## 标准工作流(必须执行) 1. **问题结构化** - 抽取时间窗口、服务名、错误关键词、对象标识(SN/ID/traceId)。 2. **日志侧取证(服务器)** - 在目标时间窗口拉取原始日志。 - 统计异常频次、对象分组、连续性(是否持续复现)。 3. **数据库侧取证** - 查询对象在关键表中的存在性、关联关系、状态、时间戳。 - 必要时跨库交叉验证(access/cloud/data)。 4. **代码侧定位** - 定位异常枚举、抛出路径、路由与分支条件。 - 只用于解释机制,不替代真实数据结论。 5. **平台侧验证** - 用平台 Skill 验证接口/对象状态、业务配置是否匹配。 6. **证据链合并** - 按“日志证据 -> 数据库证据 -> 代码机制 -> 平台验证”串联。 7. **输出结论** - 区分:数据问题 / 代码问题 / 配置问题 / 环境问题。 8. **给出修复建议与复验标准** - 明确可执行步骤与“修复完成判定条件”。 --- ## 输出模板(必须) ```markdown ## 问题摘要 [一句话说明问题现象与影响] ## 关键证据 - 日志证据:[时间、服务、关键行、频次统计] - 数据库证据:[表、条件、查询结果] - 代码证据:[触发路径与分支条件] - 平台证据:[接口或配置核验结果] ## 根因判断 [最终根因 + 归类:数据/代码/配置/环境] ## 置信度 [高/中/低] + [原因] ## 修复建议 1. [可执行步骤] 2. [可执行步骤] ## 复验标准 - [指标1:例如某错误归零] - [指标2:例如对象映射恢复] ``` --- ## 约束 - 不得伪造证据,不得跳过数据库或日志取证直接定论。 - 如某能力不可用,必须先声明证据链缺口并请求补齐。 - 涉及敏感凭据时,优先使用环境变量/密钥管理,不在输出中扩散明文。
Connect to remote servers over SSH, read sibling config.yaml to understand service metadata and log locations, download only required log snippets to local t...
---
name: server-log-analysis
description: Connect to remote servers over SSH, read sibling config.yaml to understand service metadata and log locations, download only required log snippets to local temp for analysis, and diagnose issues from evidence. Use when users ask to troubleshoot remote service logs, investigate backend exceptions, or perform SSH-based log diagnostics.
---
# Server Log Analysis
## Purpose
Use this Skill to investigate service issues when logs are stored on remote servers.
This Skill assumes:
- The agent can connect to servers via SSH or equivalent remote execution tooling.
- `config.yaml` in this Skill directory defines service metadata, log paths, and business context.
- Before deep analysis, relevant log snippets should be copied to local `temp/` first.
## Required Reading
- Read `config.yaml` first.
- Read `reference.md` when field details or command patterns are needed.
## Core Workflow
1. Read `config.yaml`.
2. Map the user issue to one or more configured services.
3. Define the smallest necessary investigation scope:
- target service
- target host
- relevant time window
- candidate log files
4. Connect to the target server via SSH or available remote tools.
5. Perform remote checks before downloading:
- file existence and file size
- last modified time
- whether keyword filtering or tail output is sufficient
6. Download only minimal required log snippets to configured local `temp/`.
7. Analyze local copies for errors, timing correlation, repeated failures, and likely root cause.
8. Output concise diagnosis with conclusions, evidence, uncertainty, and follow-up actions.
## Investigation Rules
- Prioritize service definitions and business context in `config.yaml`; do not guess.
- Prefer remote filtering before full download:
- narrow time window first
- then filter by keywords
- use tail first for recent incidents
- Download full logs only when snippets are insufficient.
- Local filenames should clearly include service, host, and time range.
- Unless explicitly requested, do not fetch sensitive files, binaries, or unrelated large archives.
- For cross-service issues, analyze primary service first, then expand to dependencies.
## Service Selection
When user intent is ambiguous:
1. Use service `aliases`, `keywords`, and `description` in `config.yaml`.
2. Pick the service with the highest semantic match.
3. If still unclear, ask the user which service to inspect before remote connection.
## Remote Pre-Check Checklist
Before downloading logs, confirm:
- host configuration matches target service
- configured log files exist
- which log file was updated most recently
- whether rolling logs must be included
- whether issue is recent or historical
Common remote checks include:
- file metadata checks
- recent log tail checks
- quick keyword search
- time-window extraction
- process/service status when needed
## Local Download Rules
Store downloaded logs under configured `local_temp_dir`.
Recommended filename format:
`<service>__<host>__<log_name>__<time_hint>.log`
Priority order:
1. recent tail logs
2. keyword-filtered snippets
3. explicit time-window snippets
4. full file as last resort
## Analysis Focus
Focus on:
- startup failures
- repeated exceptions
- timeout and connection issues
- resource pressure signals
- failures in DB/cache/message queue/DNS/HTTP upstream dependencies
- config errors exposed by stack traces or startup logs
- timestamp alignment across related services
The response should include:
- issue summary
- key evidence
- preliminary cause
- confidence level
- next verification steps
## Security Constraints
- Treat `config.yaml` as operations metadata; do not store plaintext secrets.
- Prefer environment variables, key files, or external secret managers for SSH credentials.
- Unless explicitly requested, do not modify remote files or restart services.
- Unless requested, do not auto-delete downloaded logs.
## Exception Handling
If remote access fails:
1. Clearly state which step failed.
2. State target host and service.
3. Ask user for correct SSH access method, network path, or credentials.
If configured log path does not exist:
1. Clearly identify missing path.
2. Check whether alternate paths are configured for the same service.
3. Ask user whether deployment paths changed.
## Quick Execution Order
Always follow this order:
1. Read `config.yaml`.
2. Identify service and host.
3. Perform remote log pre-checks.
4. Copy minimal required logs to `temp/`.
5. Analyze locally.
6. Summarize conclusions with evidence.
FILE:config.yaml
version: 1
analysis:
local_temp_dir: temp/server-log-analysis
default_time_window: 2h
default_tail_lines: 3000
max_download_mb_per_file: 50
prefer_remote_filter: true
preserve_downloads: true
connections:
default-server:
host: hostname
port: 22
username: root
password: password
services:
example-service:
description: >
Example business service. Replace with a clear business description,
including service responsibility, key upstream/downstream dependencies,
and which user-facing failures usually indicate this service should be
investigated first.
aliases:
- example
- demo-service
keywords:
- login
- timeout
- order
connection: default-server
workdir: /opt/example-service
startup_command: systemctl status example-service
investigation_hints:
- Prioritize startup-failure related checks first.
- Correlate timeout errors with upstream gateway logs.
- During peak traffic, watch for database connection exhaustion.
log_files:
- name: a
path: /logs/ms_log/a/
format: plain
priority: high
purpose: Logs for the microservice handling direct device access.
download_profiles:
recent-tail:
description: Quickly fetch recent logs
mode: tail
lines: 3000
recent-errors:
description: Keyword-focused log snippets for quick diagnosis
mode: keyword
keywords:
- ERROR
- WARN
- Exception
- timeout
- refused
- failed
bounded-window:
description: Extract a time window around the incident
mode: time-range
before: 15m
after: 15m
FILE:reference.md
# Reference Notes
## Configuration Design
`config.yaml` is the operations configuration hub for this Skill.
Recommended content:
- connection targets
- service descriptions
- log file paths
- related configuration files
- investigation hints
- default download strategies
Do not include:
- plaintext passwords
- private keys
- one-off incident narratives (put those in separate reports)
## Field Definitions
### `analysis`
- `local_temp_dir`: local directory for downloaded log snippets
- `default_time_window`: default time range when user does not provide one
- `default_tail_lines`: default number of recent lines to fetch
- `max_download_mb_per_file`: soft limit for single-file download size
- `prefer_remote_filter`: whether to filter remotely before downloading
- `preserve_downloads`: whether to keep local files after analysis
### `connections`
Each connection entry represents an SSH target or access endpoint.
Recommended fields:
- `host`
- `port`
- `username`
- `auth.method`
- `auth.password_env` or key reference
- `notes`
For jump hosts, extend with:
- `jump_host`
- `jump_port`
- `jump_username`
### `services`
Each service should be documented so it can be understood without tribal knowledge.
Recommended fields:
- `description`
- `aliases`
- `keywords`
- `connection`
- `workdir`
- `startup_command`
- `investigation_hints`
- `log_files`
- `related_files`
- `related_services`
### `log_files`
Each log entry should include not only path but also usage intent.
Recommended fields:
- `name`
- `path`
- `format`
- `priority`
- `purpose`
## Recommended Remote Investigation Flow
1. Map the issue to a service in `config.yaml`.
2. Confirm the connection target.
3. Check file existence and file size sanity.
4. Inspect recent tail output or keyword matches remotely first.
5. Download only minimal required snippets.
6. Analyze locally.
7. Expand scope only when evidence is insufficient.
## Recommended Download Strategy
Expand scope in this order:
1. recent lines from highest-priority logs
2. snippets around issue keywords
3. logs from an explicit time window
4. rotated logs if issue started earlier
5. full-file download only when necessary
## Service Description Example
`Provides user login, token refresh, and session validation. Common failures include Redis session errors, downstream auth timeouts, and startup misconfiguration.`
## Investigation Hint Examples
- `When login fails, correlate gateway and auth-service logs.`
- `If startup fails after deployment, inspect application-prod.yml and systemd environment variables.`
- `If latency spikes, correlate application timeout logs with database connection exhaustion.`
## Output Template
Use a concise structure:
```markdown
## Issue Summary
[One-sentence description]
## Key Evidence
- [Log evidence 1]
- [Log evidence 2]
## Preliminary Assessment
[Most likely cause]
## Confidence
[High/Medium/Low]
## Suggested Next Steps
- [Suggestion 1]
- [Suggestion 2]
```
## Notes
- Keep service names, aliases, and external descriptions consistent.
- Prefer canonical service names in responses.
- Update `config.yaml` promptly when log paths or deployment topology changes.
通过 SSH 连接远程服务器,读取同级 config.yaml 理解服务信息与日志位置,按需下载相关日志片段到本地 temp 目录,并分析日志定位问题。适用于用户要求排查远程服务日志、分析服务端异常或基于 SSH 访问进行日志诊断的场景。
---
name: server-log-analysis
description: 通过 SSH 连接远程服务器,读取同级 config.yaml 理解服务信息与日志位置,按需下载相关日志片段到本地 temp 目录,并分析日志定位问题。适用于用户要求排查远程服务日志、分析服务端异常或基于 SSH 访问进行日志诊断的场景。
---
# 服务器日志分析
## 目的
当日志保存在远程服务器上时,使用此 Skill 进行服务问题排查。
本 Skill 默认满足以下前提:
- agent 可以通过 SSH 或等价的远程执行工具连接服务器
- 本 Skill 目录下的 `config.yaml` 已定义服务信息、日志路径和业务背景
- 详细分析前,应先按需将日志片段复制到本地 `temp/` 目录
## 必读文件
- 先读取 `config.yaml`
- 需要字段说明或命令模式时,再读取 `reference.md`
## 核心流程
1. 读取 `config.yaml`
2. 将用户问题匹配到一个或多个已配置服务
3. 确定最小必要排查范围:
- 目标服务
- 目标主机
- 相关时间窗口
- 可能涉及的日志文件
4. 通过 SSH 或可用远程工具连接目标服务器
5. 下载前先做远程检查:
- 文件是否存在、文件大小
- 最近修改时间
- 是否仅靠关键词过滤或 tail 即可满足分析
6. 只下载最小必要日志片段到配置中的本地 `temp/` 目录
7. 在本地副本上分析错误、时间关联、重复故障和可能根因
8. 输出简洁的诊断结论、证据、不确定性和后续建议
## 排查规则
- 优先使用 `config.yaml` 中的服务定义和业务说明,不要凭空猜测
- 优先在远端过滤,再考虑整文件下载:
- 先缩小时间窗口
- 再按关键词筛选
- 最近故障优先使用 tail
- 只有在片段不足以判断时才下载完整日志
- 本地保存的日志文件名要清晰体现服务、主机和时间范围
- 除非用户明确要求,否则不要拉取敏感文件、二进制文件或无关的大型归档
- 如果问题跨多个服务,先分析主服务,再扩展到相关依赖服务
## 服务选择
当用户请求不明确时:
1. 使用 `config.yaml` 中的服务 `aliases`、`keywords` 和 `description`
2. 优先选择语义匹配度最高的服务
3. 仍不明确时,先询问用户要排查哪个服务,再进行远程连接
## 远程检查清单
下载日志前,先确认:
- 主机配置是否与目标服务匹配
- 配置中的日志文件是否存在
- 哪个日志文件最近有更新
- 是否存在需要关注的滚动日志
- 本次问题是近期故障还是历史问题
常见远程检查方式包括:
- 文件元信息检查
- 查看最近日志尾部
- 快速关键词搜索
- 按时间窗口提取日志
- 必要时查看进程或服务状态
## 本地下载规则
下载后的日志统一保存在配置中的 `local_temp_dir` 下。
推荐命名格式:
`<service>__<host>__<log_name>__<time_hint>.log`
优先级顺序如下:
1. 最近日志尾部
2. 关键词过滤片段
3. 指定时间窗口片段
4. 完整文件,作为最后手段
## 分析重点
重点关注:
- 启动失败
- 重复异常
- 超时和连接类问题
- 资源压力指标
- 数据库、缓存、消息队列、DNS、HTTP 上游等依赖故障
- 堆栈或启动日志暴露的配置错误
- 关联服务之间的时间戳对齐情况
回答结构应包含:
- 问题摘要
- 关键证据
- 初步原因
- 置信度
- 下一步验证建议
## 安全约束
- 将 `config.yaml` 视为运维元数据,不要把明文敏感信息放进去
- SSH 凭据优先使用环境变量、密钥文件或外部密钥管理方案
- 未经用户明确要求,不要修改远端文件,也不要重启服务
- 未经用户要求,不要自动删除已下载日志
## 异常处理
如果远程访问失败:
1. 明确说明失败发生在哪一步
2. 说明目标主机和目标服务
3. 请求用户提供正确的 SSH 访问方式、网络路径或凭据
如果配置的日志路径不存在:
1. 明确指出缺失的路径
2. 检查同一服务是否配置了可替代路径
3. 询问用户部署路径是否发生变化
## 快速执行顺序
每次都按下面顺序执行:
1. 读取 `config.yaml`
2. 识别服务和主机
3. 远程检查日志文件
4. 将最小必要日志复制到 `temp/`
5. 在本地分析
6. 汇总结论并给出证据
FILE:config.yaml
version: 1
analysis:
local_temp_dir: temp/server-log-analysis
default_time_window: 2h
default_tail_lines: 3000
max_download_mb_per_file: 50
prefer_remote_filter: true
preserve_downloads: true
connections:
default-server:
host: hostname
port: 22
username: root
password: password
services:
example-service:
description: >
示例业务服务。请替换为清晰的业务说明,包括服务职责、主要上下游依赖,
以及哪些用户侧故障通常意味着需要优先检查该服务。
aliases:
- example
- demo-service
keywords:
- login
- timeout
- order
connection: default-server
workdir: /opt/example-service
startup_command: systemctl status example-service
investigation_hints:
- 优先检查启动失败类问题。
- 将超时报错与上游网关日志进行关联分析。
- 高峰期重点关注数据库连接耗尽问题。
log_files:
- name: a
path: /logs/ms_log/a/
format: plain
priority: high
purpose: 设备直接接入的微服务的日志
download_profiles:
recent-tail:
description: 快速抓取最近日志
mode: tail
lines: 3000
recent-errors:
description: 面向快速诊断的关键词日志片段
mode: keyword
keywords:
- ERROR
- WARN
- Exception
- timeout
- refused
- failed
bounded-window:
description: 围绕问题时间点提取日志窗口
mode: time-range
before: 15m
after: 15m
FILE:reference.md
# 参考说明
## 配置设计
`config.yaml` 是这个 Skill 的运维配置总表。
建议在其中维护:
- 连接目标
- 服务说明
- 日志文件路径
- 相关配置文件
- 排查提示
- 默认下载策略
不要放入:
- 明文密码
- 私钥
- 一次性事故说明这类应写入单独报告的信息
## 字段说明
### `analysis`
- `local_temp_dir`:下载日志片段的本地目录
- `default_time_window`:用户未提供时间范围时的默认时间窗口
- `default_tail_lines`:默认抓取的最近日志行数
- `max_download_mb_per_file`:单文件下载大小的软限制
- `prefer_remote_filter`:是否优先在远端先过滤再下载
- `preserve_downloads`:分析完成后是否保留本地文件
### `connections`
每个连接项代表一个 SSH 目标或访问入口。
推荐字段:
- `host`
- `port`
- `username`
- `auth.method`
- `auth.password_env` or key reference
- `notes`
如果需要跳板机,可扩展如下字段:
- `jump_host`
- `jump_port`
- `jump_username`
### `services`
每个服务都应尽量写到无需额外口口相传信息也能理解。
推荐字段:
- `description`
- `aliases`
- `keywords`
- `connection`
- `workdir`
- `startup_command`
- `investigation_hints`
- `log_files`
- `related_files`
- `related_services`
### `log_files`
每个日志项不仅要写路径,还要写明用途。
推荐字段:
- `name`
- `path`
- `format`
- `priority`
- `purpose`
## 推荐远程排查流程
1. 将问题匹配到 `config.yaml` 中的服务
2. 确认连接目标
3. 检查文件是否存在、文件大小是否合理
4. 先在远端查看最近 tail 或关键词匹配结果
5. 下载最小必要日志片段
6. 在本地进行分析
7. 证据不足时再扩大范围
## 推荐下载策略
按以下顺序逐步扩大范围:
1. 先取最高优先级日志的最近若干行
2. 再抽取问题关键词附近片段
3. 再拉取指定时间窗口日志
4. 如果问题更早开始,补充滚动日志
5. 只有必要时才下载完整文件
## 服务说明示例
`提供用户登录、令牌刷新和会话校验能力。常见问题包括 Redis 会话失败、下游认证超时和启动配置错误。`
## 排查提示示例
- `登录失败时,联动检查 gateway 与 auth-service 日志。`
- `如果部署后启动失败,检查 application-prod.yml 和 systemd 环境变量。`
- `如果响应时间突然升高,将应用超时日志与数据库连接耗尽进行关联分析。`
## Output Template
Use a concise structure:
```markdown
## 问题摘要
[一句话描述]
## 关键证据
- [日志证据 1]
- [日志证据 2]
## 初步判断
[最可能原因]
## 置信度
[高/中/低]
## 建议下一步
- [建议 1]
- [建议 2]
```
## 备注
- 服务名、别名和对外描述尽量保持术语一致
- 回复中优先使用统一的服务标准名称
- 当日志路径或部署拓扑发生变化时,要及时更新 `config.yaml`