@clawhub-quochungto-93dad49abd
Decompose a system into well-defined components using structured discovery techniques. Use this skill whenever the user is designing a new system from requir...
---
name: component-identifier
description: Decompose a system into well-defined components using structured discovery techniques. Use this skill whenever the user is designing a new system from requirements, breaking down a monolith into modules, deciding how to organize code into packages/services, asking "what components should this system have?", or struggling with component granularity — even if they don't use the word "component."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/component-identifier
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- architecture-characteristics-identifier
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [8]
tags: [software-architecture, architecture, components, decomposition, modularity, domain-driven-design]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "System requirements, user stories, or domain description — the skill guides discovery from there"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can analyze existing component structure."
---
# Component Identifier
## When to Use
You're designing a system and need to figure out what the building blocks should be — what components, modules, or services to create and how they relate. Typical situations:
- New system from requirements — "we have these user stories, what components do we need?"
- Monolith restructuring — "our code is a mess, how should we reorganize?"
- Pre-requisite for architecture style selection — components feed into quantum analysis
- Team is falling into the Entity Trap — creating `UserManager`, `OrderManager` instead of real components
Before starting, verify:
- Do you have requirements, user stories, or at least a domain description? If not, help gather them first.
- Do you know the architecture characteristics? If not, use `architecture-characteristics-identifier` first — characteristics affect component division.
## Context & Input Gathering
### Input Sufficiency Check
The skill needs to know WHO uses the system and WHAT they do. Without actors and actions, component identification is guesswork.
Check the user's prompt for:
- System purpose and domain
- Users/roles/actors
- Key workflows or use cases
- Any existing structure (if restructuring)
### Required Context (must have — ask if missing)
- **System purpose:** What does this system do?
→ Check prompt for: domain description, problem statement
→ If missing, ask: "In one sentence, what is the main purpose of this system?"
- **Actors/users:** Who uses this system?
→ Check prompt for: user types, roles, personas
→ If missing, ask: "Who are the main types of users? For example: customers, admins, operators, external systems?"
- **Key workflows:** What do users DO with the system?
→ Check prompt for: user stories, features, use cases, actions
→ If missing, ask: "What are the 5-7 most important things users do with this system? For example: place an order, submit a review, process a payment."
### Observable Context (gather from environment)
- **Existing codebase:** If restructuring, scan for current component structure
→ Look for: package directories, service folders, module boundaries
→ Reveals: current partitioning (technical vs domain), coupling patterns
- **Architecture characteristics:** If already identified, use them to inform component division
→ Look for: output from `architecture-characteristics-identifier`
→ Reveals: which parts need different quality attributes
### Default Assumptions
- If no existing system → greenfield, use domain partitioning (industry-standard default)
- If actors unclear → assume at least: end user, admin, external system
- If partitioning preference not stated → recommend domain partitioning (book's recommendation for modern architectures)
### Sufficiency Threshold
```
SUFFICIENT when: system purpose + at least 3 actors + at least 5 workflows are known
PROCEED WITH DEFAULTS when: system purpose is known but actors/workflows are sparse
MUST ASK when: system purpose is unclear or no workflows are stated
```
## Process
### Step 1: Choose Partitioning Style
**ACTION:** Decide between technical partitioning (layers) and domain partitioning (workflows).
**WHY:** This is the most fundamental decision — it determines the shape of everything else. Technical partitioning (Presentation → Business Rules → Persistence) was the standard for decades, but domain partitioning (organized by business workflows) has become the industry standard for both monoliths and microservices. Domain partitioning makes it easier to migrate to distributed architecture later, aligns with how the business thinks, and produces components with higher functional cohesion.
| Style | Organizes by | Best for | Watch out for |
|-------|-------------|----------|--------------|
| **Technical** | Layers: presentation, business, persistence | Simple CRUD apps, teams familiar with layered patterns | Domains smeared across layers, hard to migrate |
| **Domain** | Workflows: order processing, inventory, shipping | Modern apps, microservice-ready, cross-functional teams | Customization code appears in multiple places |
**IF** the user hasn't specified → recommend domain partitioning with explanation.
**IF** the user has an existing technically-partitioned system → note the trade-offs of restructuring.
### Step 2: Identify Actors and Actions
**ACTION:** List all actors (users, roles, external systems) and map their actions.
**WHY:** Components should align with what users DO, not what data exists. The Actor/Actions approach (from the Rational Unified Process) starts from real usage patterns, not database tables. This prevents the Entity Trap — the most common component identification mistake. If you start from "what data do we store?", you get `UserManager`, `OrderManager` (an ORM, not an architecture). If you start from "what do users do?", you get `PlaceOrder`, `ProcessPayment`, `ManageInventory` (real workflows).
**Alternative:** For event-heavy systems, use Event Storming instead — map domain events first, then group into components.
Output a table:
```
| Actor | Actions |
|-------|---------|
| Customer | Browse catalog, place order, track delivery, submit review |
| Store owner | Manage inventory, set prices, view reports |
| Payment system | Process payment, issue refund |
```
### Step 3: Map Actions to Initial Components
**ACTION:** Group related actions into candidate components. Each component should represent a cohesive workflow.
**WHY:** The goal is a coarse-grained substrate — not the final design. The likelihood of getting the perfect design on the first attempt is "disparagingly small" (the book's words). What you're building is a starting hypothesis to iterate on. Grouping related actions ensures each component has a clear, unified purpose — high functional cohesion.
Rules for grouping:
- Actions that always happen together → same component
- Actions performed by the same actor on the same domain concept → likely same component
- Actions that need different quality attributes → likely different components
### Step 4: Assign Requirements to Components
**ACTION:** Map each requirement/user story to the component that handles it. Look for mismatches.
**WHY:** This is the validation step — if a requirement doesn't fit cleanly into any component, either the requirement spans too many concerns or the component boundaries are wrong. Requirements that force you to touch 3+ components for a single user action indicate the wrong granularity.
Watch for:
- Requirements that don't fit any component → create a new one
- Requirements that span many components → either the requirement is too broad or components need restructuring
- Components with no requirements → remove them (they're imaginary)
### Step 5: Analyze Architecture Characteristics Per Component
**ACTION:** Check if different components need different quality attributes. Components with different characteristics may need to be in different deployment units (quanta).
**WHY:** This is where component identification connects to quantum analysis. If the Order Processing component needs high elasticity (flash sales) but the Reporting component needs only batch processing, they have different characteristic profiles. This difference suggests they should be separate quanta — which drives the monolith vs distributed decision. Without this step, you might design components that look clean but can't be deployed or scaled appropriately.
**IF** components have uniform characteristics → they can stay in one deployment unit (monolith is fine).
**IF** components have different characteristics → flag for `architecture-quantum-analyzer`. These may become separate quanta.
### Step 6: Check for the Entity Trap
**ACTION:** Review the component design for signs of the Entity Trap anti-pattern.
**WHY:** The Entity Trap is the #1 component identification mistake. It happens when the architect creates components that mirror database entities (`UserManager`, `OrderManager`, `ProductManager`) with CRUD operations instead of real workflow components. This produces an ORM, not an architecture — high coupling, low cohesion, no clear behavior boundaries. The fix is to refocus on workflows: "what does the system DO?" not "what does it STORE?"
Detection checklist:
- [ ] Components are named `[Entity]Manager` or `[Entity]Service`
- [ ] Each component has primarily Create/Read/Update/Delete operations
- [ ] Components map 1:1 to database tables
- [ ] No workflow or behavioral logic is captured
**IF** Entity Trap detected → restructure around workflows using Step 2's actors/actions.
### Step 7: Assess Granularity and Iterate
**ACTION:** Evaluate whether each component is the right size. Restructure if needed.
**WHY:** There is no formula for the right granularity — it requires iterative refinement. Too fine-grained = too much communication between components (chatty architecture). Too coarse-grained = too many responsibilities per component (bloated modules). The sweet spot is components where each handles one cohesive workflow without excessive external calls.
Signs of wrong granularity:
- **Too fine:** A single user action requires calling 5+ components
- **Too coarse:** A single component handles 10+ unrelated responsibilities
- **Just right:** Each component handles 2-5 related actions with minimal cross-component calls
This step feeds back to Step 3 — iterate until stable.
## Inputs
- System requirements, user stories, or domain description
- Architecture characteristics (from `architecture-characteristics-identifier` or user input)
- Optionally: existing codebase to restructure
## Outputs
### Component Identification Report
```markdown
# Component Design: {System Name}
## Partitioning Style
{Domain / Technical} — {reasoning}
## Actors and Actions
| Actor | Actions |
|-------|---------|
| {actor} | {action1, action2, action3} |
## Identified Components
| Component | Responsibility | Key actions | Architecture characteristics |
|-----------|---------------|-------------|----------------------------|
| {name} | {what it does} | {actions it handles} | {relevant -ilities} |
## Requirement Mapping
| Requirement/Story | Component(s) | Notes |
|-------------------|-------------|-------|
| {requirement} | {component} | {any concerns} |
## Entity Trap Check
{Pass / Warning} — {reasoning}
## Granularity Assessment
{Assessment of component sizing — any too fine or too coarse?}
## Characteristic Variance
| Component | Primary characteristic | Differs from others? |
|-----------|---------------------|:---:|
| {component} | {characteristic} | Yes/No |
{If variance detected: flag for quantum analysis}
## Component Relationship Map
{Text diagram showing how components communicate and depend on each other}
```
## Key Principles
- **Workflows, not entities** — Components should represent what the system DOES, not what it STORES. "Process Order" is a component. "Order Manager" is an Entity Trap. Start from actors and actions, not from the database schema.
- **Domain partitioning by default** — The industry trend is firmly toward domain partitioning for both monoliths and microservices. Technical partitioning (layers) smears domains across all layers and makes migration to distributed architecture difficult. Unless you have a specific reason for layers, use domain partitioning.
- **Iteration is the process** — The chance of getting the right component design on the first attempt is near zero. Build a hypothesis, map requirements, find the mismatches, restructure. Component identification is inherently iterative — don't expect to be done in one pass.
- **Different characteristics = different components** — If two parts of the system need different quality attributes (one needs high availability, another needs high throughput), they should be separate components. This separation is what enables them to become separate quanta if needed.
- **Granularity has no formula** — Too fine = chatty. Too coarse = bloated. There's no mathematical answer. The right size is where each component handles one cohesive workflow without excessive cross-component calls. Use the iterative cycle to converge.
- **Ask about workflows, not data** — When gathering input from stakeholders, ask "what do your users DO?" not "what data do you have?" The first question reveals components. The second reveals the Entity Trap.
## Examples
**Scenario: Online auction system (Going, Going, Gone)**
Trigger: "We're building an online auction platform. What components do we need?"
Process: Asked about actors — identified Bidder, Auctioneer, System Admin. Mapped actions: Bidder (view items, place bids, track bids), Auctioneer (create auction, start/stop, manage items), Admin (manage users, view reports). Grouped into components: BidCapture, BidTracking, AuctionSession, ItemManagement, UserManagement, Reporting. Analyzed characteristics — discovered BidCapture needs different characteristics for bidders (high elasticity) vs auctioneers (high reliability). Split BidCapture into BidderCapture + AuctioneerCapture. Entity Trap check: passed — components are workflow-based, not entity-based. Flagged characteristic variance for quantum analysis.
Output: 7 components with characteristic analysis showing the BidderCapture/AuctioneerCapture split and quantum implications.
**Scenario: Detecting the Entity Trap**
Trigger: "Here's our current design: UserManager, OrderManager, ProductManager, PaymentManager. Each handles CRUD for its entity. Does this look right?"
Process: Immediately identified the Entity Trap — all components are [Entity]Manager with CRUD operations. This is an ORM, not an architecture. Asked about actors and workflows: who uses this system and what do they do? Discovered workflows: "browse catalog and place order" (spans Product + Order + Payment), "process payment and update inventory" (spans Payment + Product). Restructured around workflows: OrderProcessing (browse → select → checkout), PaymentProcessing (charge → confirm → receipt), InventoryManagement (stock → reorder → catalog), UserAuthentication. Entity Trap check: resolved.
Output: Restructured from 4 entity-based to 4 workflow-based components with explanation of why the original design was an Entity Trap.
**Scenario: Greenfield with sparse requirements**
Trigger: "We're building an employee scheduling app for a hospital. That's all I know so far."
Process: Insufficient information — asked clarifying questions one at a time: (1) "Who are the main users?" → nurses, doctors, HR admin, department heads. (2) "What are the key things these users do?" → request shifts, swap shifts, approve PTO, generate compliance reports, view schedules. (3) "Are there parts with different performance/availability needs?" → yes, the schedule viewer needs to be always-on (nurses check between rounds) but reporting is weekly batch. Used Actor/Actions to identify: ShiftScheduling, ShiftSwapping, PTOManagement, ComplianceReporting, ScheduleViewing. Flagged ScheduleViewing vs ComplianceReporting as having different availability characteristics.
Output: 5 components with input gathering process documented, showing how asking the right questions leads to better component design.
## References
- For component discovery techniques in detail, see [references/discovery-techniques.md](references/discovery-techniques.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-architecture-characteristics-identifier`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
Systematically analyze trade-offs across quality attribute dimensions for architecture decisions. Use this skill whenever the user is comparing architecture...
---
name: architecture-tradeoff-analyzer
description: Systematically analyze trade-offs across quality attribute dimensions for architecture decisions. Use this skill whenever the user is comparing architecture options, weighing competing quality attributes (performance vs scalability, simplicity vs flexibility), making any structural technology decision, evaluating monolith vs distributed, choosing communication patterns, or asking "what are the trade-offs?" — even if they don't explicitly say "trade-off analysis."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-tradeoff-analyzer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [1, 2, 4, 5, 18, 19]
tags: [software-architecture, architecture, trade-offs, decision-making, quality-attributes]
depends-on: [] # Foundation skill — no dependencies
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Architecture decision context from the user — what options they're considering and why"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can read it for context."
---
# Architecture Trade-off Analyzer
## When to Use
You're in a situation where an architecture decision needs to be made and there are competing concerns. Typical triggers:
- The user is choosing between architecture styles (monolith vs microservices, event-driven vs request-reply)
- The user mentions competing quality attributes ("we need it fast AND scalable AND simple")
- The user is presenting a single option as "the best" — they haven't analyzed trade-offs yet
- The user is stuck in analysis paralysis or decision avoidance
- An architecture decision needs to be documented with rationale
Before starting, verify:
- Is there actually a decision to make? (If the user is just asking for information, explain concepts instead)
- Are there at least 2 viable options to compare? (If only one option exists, help identify alternatives first)
## Context
### Required Context (must have before proceeding)
- **The decision to make:** What architectural choice is being considered? Ask the user if not stated.
- **At least 2 viable options:** You can't analyze trade-offs with only one option. If the user presents just one, help them identify at least one alternative.
### Observable Context (gather from environment if available)
- **Codebase structure:** If a project exists, scan for architecture patterns already in use
→ Look for: directory structure, config files (docker-compose, k8s manifests), service boundaries
→ If unavailable: treat as greenfield
- **Existing architecture docs:** Check for ADRs, architecture diagrams, tech specs
→ Look for: `docs/`, `architecture/`, `adr/`, `*.adr.md`, README architecture sections
→ If unavailable: start from the user's verbal description
- **Team and operational context:** Team size, deployment frequency, cloud provider, budget constraints
→ If unavailable: ask the user for the top constraints
### Default Assumptions
- If no quality attributes specified → ask the user to pick their top 3 driving concerns
- If no team context → assume a small team (3-8 developers)
- If no existing architecture → assume greenfield project
- If no deployment context → assume cloud-native
## Process
### Step 1: Frame the Decision
**ACTION:** Clearly state the architectural decision that needs to be made. Name the competing options.
**WHY:** Fuzzy decisions produce fuzzy analysis. "Should we use microservices?" is too vague. "Should our order processing system use microservices or a modular monolith, given that we need independent scaling of the payment module?" is a decision you can actually analyze. Framing also prevents scope creep — the analysis stays focused on THIS decision.
**IF** the user hasn't specified options → help them enumerate at least 2-3 viable alternatives before proceeding.
### Step 2: Identify Relevant Quality Attributes
**ACTION:** Determine which architecture characteristics (quality attributes) are affected by this decision. Focus on the user's top 3 driving characteristics.
**WHY:** Not all quality attributes matter for every decision. Analyzing 15 attributes produces noise. The "Top-3 Rule" from architecture practice says: keep driving characteristics to three maximum. This forces prioritization and makes trade-offs visible. Each additional characteristic you design support for complicates the overall system — like flying a helicopter where every control affects every other control.
Common quality attribute categories:
- **Operational:** availability, scalability, performance, reliability, elasticity
- **Structural:** maintainability, extensibility, modularity, testability, deployability
- **Cross-cutting:** security, observability, simplicity, cost, time-to-market
For the full taxonomy, see [references/quality-attributes.md](references/quality-attributes.md).
**IF** the user says "all of them are important" → push back. Ask: "If you could only optimize for 3, which would they be?" This reveals the real priorities.
### Step 3: Analyze Each Option's Advantages
**ACTION:** For each architectural option, list what it does WELL across the identified quality attributes.
**WHY:** Start with advantages to build a fair picture. Most teams already lean toward one option — this step validates their intuition and builds confidence that you understand the options before challenging them.
### Step 4: Hunt for the Negatives
**ACTION:** For each option, actively search for disadvantages, risks, and hidden costs. This is the critical step.
**WHY:** "Programmers know the benefits of everything and the trade-offs of nothing. Architects need to understand both." The natural human bias is to see advantages of the preferred option and disadvantages of alternatives. An architect's core job is to overcome this bias. If you can't articulate the downsides of your chosen approach, you haven't analyzed it deeply enough. If you think there are no trade-offs, you haven't found them yet (First Law, Corollary 1).
Probe for:
- What gets WORSE when you optimize for the advantages?
- What new failure modes does this option introduce?
- What operational burden does it create?
- What happens when the system scales 10x?
- What coupling does this option hide?
### Step 5: Build the Trade-off Matrix
**ACTION:** Create a comparison table with options as columns and quality attributes as rows. For each cell, mark whether the option supports (+), hurts (-), or is neutral (=) for that attribute, with a brief justification.
**WHY:** A visual matrix makes trade-offs undeniable. When you see that Option A is (+) on scalability but (-) on simplicity and (-) on cost, while Option B is the reverse, the decision becomes a conscious prioritization rather than a gut feeling. This is also the primary artifact stakeholders can review.
Format:
```
| Quality Attribute | Option A | Option B |
|-------------------|---------------|---------------|
| Scalability | + (why) | - (why) |
| Simplicity | - (why) | + (why) |
| Cost | - (why) | + (why) |
| Deployability | + (why) | = (neutral) |
```
### Step 6: Identify Synergies and Conflicts
**ACTION:** Note where quality attributes reinforce each other (synergies) or conflict (tensions) within each option.
**WHY:** Trade-offs aren't just between options — they exist WITHIN options too. Security almost always hurts performance. Scalability often conflicts with simplicity. Recognizing these internal tensions prevents surprise later. It also reveals "least worst" opportunities — options where the internal conflicts are most manageable for your specific context.
### Step 7: Apply the "Least Worst" Principle
**ACTION:** Recommend an option based on which has the most acceptable set of trade-offs for THIS specific context. Frame as "least worst" not "best."
**WHY:** There is no "best" architecture — only trade-offs. The goal is the option whose downsides are most tolerable given the team's constraints, business goals, and risk appetite. Framing as "least worst" sets honest expectations: you're choosing which problems you'd rather have, not eliminating problems. This prevents "Covering Your Assets" — the anti-pattern where architects avoid decisions out of fear of being wrong.
**The decision depends on:** deployment environment, business drivers, company culture, budgets, timeframes, developer skill set, and operational maturity.
**After stating the recommendation, always include a "Context Sensitivity" analysis:** explicitly state what would change the recommendation. For example: "If the team had 2+ years of microservices experience, we'd recommend Option A instead." This prevents the recommendation from being treated as universal truth — it's context-dependent, and the conditions under which it would flip should be transparent.
**Reference named anti-patterns when relevant:** When the analysis reveals a decision-making dysfunction (fear of deciding, repeated debates, lost rationale), name the anti-pattern explicitly:
- **Covering Your Assets** — avoiding decisions out of fear of being wrong
- **Groundhog Day** — same decisions debated repeatedly because rationale wasn't recorded
- **Email-Driven Architecture** — decisions lost because they were communicated only via email, not documented
- **Analysis Paralysis** — over-analyzing without deciding (the flip side of Covering Your Assets)
Naming the anti-pattern helps teams recognize and break the pattern.
### Step 8: Document the Decision
**ACTION:** Produce a Trade-off Analysis Report (see Output format below). If the decision is architecturally significant, also produce an ADR summary.
**WHY:** "Why is more important than how" (Second Law). Future developers can look at a system and figure out HOW it's structured. What they can't figure out is WHY those choices were made. Without documentation, teams fall into the "Groundhog Day" anti-pattern — the same decisions get debated repeatedly because nobody recorded the rationale.
A decision is architecturally significant if it affects: structure, nonfunctional characteristics, dependencies, interfaces, or construction techniques.
## Inputs
- The architectural decision to analyze (from user)
- Quality attributes to evaluate (from user, or discovered in Step 2)
- Optionally: existing codebase, architecture docs, team context
## Outputs
### Trade-off Analysis Report
```markdown
# Trade-off Analysis: {Decision Title}
## Decision
{Clear statement of the decision being analyzed}
## Options Considered
1. **{Option A}** — {one-line description}
2. **{Option B}** — {one-line description}
3. **{Option C}** — {if applicable}
## Driving Quality Attributes
1. {Top priority attribute}
2. {Second priority}
3. {Third priority}
## Trade-off Matrix
| Quality Attribute | Option A | Option B | Option C |
|-------------------|----------|----------|----------|
| {Attribute 1} | + reason | - reason | = reason |
| {Attribute 2} | - reason | + reason | + reason |
| {Attribute 3} | + reason | = reason | - reason |
## Synergies and Conflicts
- {Option A}: {attribute X} reinforces {attribute Y} because...
- {Option B}: {attribute X} conflicts with {attribute Z} because...
## Recommendation
**{Recommended option}** — the least worst choice for this context because:
- {Primary justification tied to top driving attribute}
- {Secondary justification}
- {Acknowledged downsides and why they're acceptable}
## Risks of This Choice
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| {Risk 1} | Low/Med/High | Low/Med/High | {Concrete mitigation} |
| {Risk 2} | Low/Med/High | Low/Med/High | {Concrete mitigation} |
## Context Sensitivity
This recommendation assumes: {key assumptions}.
- **If {constraint X changed}** → we'd recommend {Option Y} instead because {reason}
- **If {constraint Z changed}** → the trade-off balance shifts toward {Option W} because {reason}
## Architecture Decision Record (if architecturally significant)
- **Status:** Proposed
- **Context:** {forces at play}
- **Decision:** {active voice, with full WHY justification}
- **Consequences:** {positive AND negative trade-offs}
```
## Key Principles
- **Everything is a trade-off** — If you think you've found something that isn't, you haven't looked hard enough. This isn't pessimism; it's the foundational reality of architecture that enables honest decision-making.
- **Why over how** — Document the reasoning, not just the outcome. Future developers can reverse-engineer how a system works; they can't reverse-engineer why it was built that way. WHY prevents repeated debates.
- **Least worst, not best** — Never shoot for the "best" architecture. Aim for the one with the most acceptable set of trade-offs for your specific context. This framing sets honest expectations and prevents decision paralysis.
- **Top-3 quality attributes** — Resist the urge to optimize for everything. Each additional quality attribute you support complicates the system. Force stakeholders to choose their top 3 driving characteristics.
- **Hunt the negatives** — The value of a trade-off analysis is in the disadvantages you discover, not the advantages. Advantages are easy to see. Disadvantages require deliberate searching. An analysis with no negatives is an incomplete analysis.
- **Context is everything** — "It depends" is a valid answer. The same trade-off analysis on the same options will produce different recommendations for different teams, budgets, timelines, and business goals. Never copy an architecture decision from another project without re-analyzing the trade-offs in YOUR context. Always state what would change the recommendation if constraints shifted.
- **Name the dysfunction** — When you spot a decision-making anti-pattern (fear of deciding, repeated debates, lost decisions), name it explicitly. Covering Your Assets, Groundhog Day, Email-Driven Architecture, and Analysis Paralysis are common patterns that teams fall into. Naming the pattern is the first step to breaking it.
## Examples
**Scenario: Messaging pattern for auction system**
Trigger: "Should we use pub/sub topics or point-to-point queues for our bidding system?"
Process: Framed decision as topics vs queues for bid distribution. Identified driving attributes: extensibility, security, monitoring. Built matrix showing topics excel at extensibility and decoupling, but queues excel at security (isolated access), heterogeneous contracts (per-consumer formats), and monitoring (per-queue metrics). Identified conflict: extensibility vs security within topics.
Output: Recommended queues for the payment and analytics services (security-critical), topics for the bid streaming service (extensibility-critical). Trade-off: accept tighter coupling in exchange for security and monitoring control.
**Scenario: Monolith vs microservices for sandwich ordering app**
Trigger: "We're building a simple ordering system but customization per franchise is key. Microservices?"
Process: Framed as modular monolith vs microkernel. Top 3 attributes: customization, simplicity, cost. Matrix showed both options are acceptable — modular monolith is simpler and cheaper but customization requires explicit override design; microkernel maps naturally to customization (plug-in per franchise) but adds infrastructure complexity. Applied least-worst: for a small team with budget constraints, modular monolith with a customization override endpoint is the least worst — it's simpler, cheaper, and customization can be explicitly designed with fitness functions.
Output: Trade-off analysis report + ADR recommending modular monolith with override endpoint. Explicitly documented the trade-off: accepting manual customization management in exchange for simplicity and low cost.
**Scenario: Synchronous vs asynchronous for high-scale system**
Trigger: "We're redesigning our auction platform. REST everywhere or should we mix in message queues?"
Process: Framed as synchronous-default vs async-where-needed. Top 3: reliability, scalability, simplicity. Matrix: sync is simpler and easier to debug but creates cascading failures under load; async scales better and buffers spikes but introduces data synchronization complexity, potential deadlocks, and harder debugging. Applied context: the payment service needs reliability buffering (many auctions ending simultaneously), but the session management service is simple request/response.
Output: Recommended mixed approach: synchronous by default (simpler), asynchronous for payment processing and bid capture (reliability-critical, spike-prone). Documented the "use synchronous by default, asynchronous when necessary" principle.
## References
- For the full list of quality attributes and their definitions, see [references/quality-attributes.md](references/quality-attributes.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/quality-attributes.md
# Architecture Quality Attributes Reference
Use this reference when you need the full list of quality attributes for trade-off analysis. Organized by category.
## Operational Characteristics
These affect how the system runs in production.
| Attribute | Definition | Trade-off tension |
|-----------|-----------|-------------------|
| **Availability** | System uptime level (e.g., 99.9%) | Higher availability = more redundancy = higher cost |
| **Continuity** | Disaster recovery capability | Better DR = more infrastructure = higher cost/complexity |
| **Performance** | Response time, throughput, capacity | Better performance often conflicts with security (encryption overhead) and maintainability (optimized code is harder to read) |
| **Recoverability** | Time to recover from failure | Fast recovery = more sophisticated backup/failover = cost |
| **Reliability** | Data integrity, fail-safe behavior | Higher reliability often requires synchronous processing = lower throughput |
| **Robustness** | Error handling, boundary conditions | More robust = more defensive code = slower development velocity |
| **Scalability** | Handle growing load | Better scalability typically requires distributed architecture = more complexity, more cost |
| **Elasticity** | Handle sudden bursts | Elastic systems need auto-scaling infrastructure = cloud cost variability |
## Structural Characteristics
These affect how the codebase is organized and evolved.
| Attribute | Definition | Trade-off tension |
|-----------|-----------|-------------------|
| **Configurability** | End-user customization ease | More configurable = more code paths = harder to test |
| **Extensibility** | Adding new functionality | More extensible = more abstraction = initial complexity overhead |
| **Maintainability** | Ease of changes | Better maintainability often requires more modular architecture = more inter-service communication |
| **Modularity** | Separation of concerns | More modular = more boundaries = more coordination overhead |
| **Testability** | Ease of testing | Better testability requires clean interfaces = more upfront design effort |
| **Deployability** | Ease and frequency of releases | Better deployability often requires microservices or containers = operational complexity |
| **Portability** | Run on multiple platforms | More portable = more abstraction layers = potential performance loss |
| **Upgradeability** | Ease of version upgrades | Easier upgrades = more backward compatibility = code bloat |
## Cross-Cutting Characteristics
These span multiple categories.
| Attribute | Definition | Trade-off tension |
|-----------|-----------|-------------------|
| **Security** | Protection against threats | More security = more encryption/indirection = worse performance |
| **Accessibility** | Support for all users | Better accessibility = more implementation effort = slower delivery |
| **Observability** | Visibility into system behavior | More observability = more instrumentation = slight performance overhead |
| **Simplicity** | Ease of understanding | Simpler systems are easier to maintain but may sacrifice scalability/flexibility |
| **Cost** | Total cost of ownership | Lower cost often means accepting trade-offs in scalability, reliability, or performance |
| **Time-to-market** | Speed of initial delivery | Faster delivery often means accepting technical debt |
## Common Trade-off Pairs
These quality attributes frequently conflict:
| Attribute A | vs | Attribute B | Why they conflict |
|-------------|:---:|-------------|-------------------|
| Performance | vs | Security | Encryption, indirection, and access control add latency |
| Scalability | vs | Simplicity | Distributed systems scale better but are fundamentally more complex |
| Scalability | vs | Cost | More instances, more infrastructure, more operational overhead |
| Maintainability | vs | Performance | Clean, readable code runs slower than hand-optimized code |
| Extensibility | vs | Simplicity | Abstraction layers for future flexibility add current complexity |
| Reliability | vs | Performance | Synchronous processing ensures data integrity but limits throughput |
| Deployability | vs | Simplicity | Independent deployments require service boundaries = more moving parts |
| Time-to-market | vs | Maintainability | Shortcuts speed delivery but create technical debt |
Guide the systematic selection of an architecture style by evaluating domain needs, architecture characteristics, quantum count, data constraints, and organi...
---
name: architecture-style-selector
description: Guide the systematic selection of an architecture style by evaluating domain needs, architecture characteristics, quantum count, data constraints, and organizational factors against all major architecture styles (layered, pipeline, microkernel, service-based, event-driven, space-based, microservices). Use this skill whenever the user is choosing an architecture pattern, deciding between monolith and distributed, comparing architecture styles (e.g., "event-driven vs microservices"), asking "which architecture should we use?", starting a new system and considering options, or reconsidering their current architecture — even if they don't use the phrase "architecture style."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-style-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- architecture-characteristics-identifier
- architecture-quantum-analyzer
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
tags: [software-architecture, architecture, style-selection, monolith, distributed, microservices, event-driven, decision-making]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "System description, requirements, and organizational context — the skill guides the entire selection process"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can analyze current architecture."
---
# Architecture Style Selector
## When to Use
You need to choose or recommend an architecture style for a system. This is the culminating architecture decision — it integrates characteristics analysis, quantum analysis, and feasibility checking into a concrete style recommendation. Typical situations:
- New system — "we're building X, what architecture should we use?"
- Architecture evaluation — "should we use microservices or event-driven?"
- Migration — "we've outgrown our monolith, what should we move to?"
- Validation — "we chose microservices, was that right?"
- Post-analysis — you've identified characteristics and quanta, now pick the style
Before starting, verify:
- Are architecture characteristics identified? If not, use `architecture-characteristics-identifier` first — you need the top 3 driving characteristics.
- Is quantum analysis done? If components need different quality attributes, use `architecture-quantum-analyzer` first — quantum count determines monolith vs distributed.
- If the user is only asking whether distribution is feasible (not which style), use `distributed-feasibility-checker` instead.
## Context & Input Gathering
### Input Sufficiency Check
This skill synthesizes multiple analysis dimensions. You can proceed with partial information and fill gaps during the process, but certain inputs dramatically improve the recommendation quality.
### Required Context (must have — ask if missing)
- **System purpose and domain:** What does this system do?
→ Check prompt for: domain description, problem statement, business context
→ If missing, ask: "What does your system do? What problem does it solve?"
- **Driving architecture characteristics:** What quality attributes matter most? (Top 3)
→ Check prompt for: scalability, performance, availability, deployability, elasticity, etc.
→ If available from prior `architecture-characteristics-identifier` output, use those
→ If missing, ask: "What are the top 3 quality attributes this system must excel at? For example: (a) scalability, (b) performance, (c) simplicity, (d) deployability, (e) fault tolerance, (f) elasticity, (g) evolutionary/agility, (h) cost"
- **Number of architecture quanta:** Do different parts need different characteristics?
→ Check prompt for: quantum analysis results, mentions of "some parts need X while others need Y"
→ If available from prior `architecture-quantum-analyzer` output, use that
→ If missing, ask: "Do all parts of your system share the same quality attribute needs, or do some parts need different characteristics? For example: 'the order processing needs high scalability but reporting just needs batch processing.'"
### Important Context (strongly recommended — ask if easy to obtain)
- **Team size and experience:** How many developers? What architectures has the team built before?
→ Check prompt for: team mentions, experience level, technology familiarity
→ If missing, ask: "How large is your development team, and what architecture styles has your team worked with before?"
- **Data architecture constraints:** Can data be partitioned? Are ACID transactions required across workflows?
→ Check prompt for: database mentions, transaction requirements, consistency needs
→ If missing and relevant (distributed styles under consideration), ask: "Does your system require strict transactional consistency across different workflows, or can parts tolerate eventual consistency?"
### Observable Context (gather from environment)
- **Existing architecture:** If restructuring, scan for current patterns
→ Look for: package structure (layered? domain?), service folders, docker-compose, k8s manifests
→ Reveals: current style, migration starting point
- **Infrastructure maturity:** What deployment and monitoring tools exist?
→ Look for: CI/CD configs, monitoring configs, container orchestration
→ Reveals: operational readiness for distributed styles
### Default Assumptions
- If characteristics unknown → ask before proceeding (this is critical input)
- If quantum count unknown → assume single quantum (default to monolith evaluation first)
- If team experience unknown → assume moderate experience (can handle service-based but not full microservices)
- If data constraints unknown → assume shared database is acceptable
### Sufficiency Threshold
```
SUFFICIENT: system purpose + top 3 characteristics + quantum count are known
PROCEED WITH DEFAULTS: system purpose + characteristics are known, quantum unclear
MUST ASK: system purpose OR driving characteristics are missing
```
## Process
### Step 1: Determine Monolith vs Distributed
**ACTION:** Based on quantum analysis, make the first and most impactful fork in the decision tree.
**WHY:** This is the single most important architectural decision. Every subsequent choice flows from it. The book is explicit: if a single set of architecture characteristics suffices for the entire system (one quantum), a monolith offers real advantages — simpler deployment, simpler testing, simpler debugging, lower cost. Distribution should only be chosen when different parts genuinely need different quality attributes, requiring multiple independent deployment units. Getting this wrong is expensive: choosing distributed when monolith suffices adds unnecessary operational complexity; choosing monolith when distribution is needed creates a bottleneck that's painful to refactor later.
**Decision logic:**
- **One quantum** (all components share the same characteristic profile) → **Monolith** likely sufficient. Evaluate layered, pipeline, and microkernel.
- **Multiple quanta** (components need different characteristics) → **Distributed** likely needed. Evaluate service-based, event-driven, space-based, and microservices.
- **Uncertain** → Default to monolith evaluation first. It's easier to extract services from a well-structured monolith than to merge poorly separated microservices.
**IF** monolith → proceed to Step 2A.
**IF** distributed → proceed to Step 2B.
**IF** uncertain → evaluate monolith options first (Step 2A), then check if they can support the requirements. If not, proceed to Step 2B.
### Step 2A: Evaluate Monolithic Styles
**ACTION:** Score the three monolithic styles against the driving characteristics using the comparison matrix. Check for domain/architecture isomorphism.
**WHY:** Each monolithic style has a distinct profile. Layered excels at simplicity and cost but scores 1 on nearly everything else. Pipeline is ideal for linear data processing but can't handle complex interactions. Microkernel is the best monolith for extensibility and customization. Choosing between them isn't arbitrary — it's driven by which style's natural strengths align with your driving characteristics, and whether the problem domain's shape naturally maps to the architecture's topology (isomorphism).
Consult the comparison matrix in [references/style-comparison-matrix.md](references/style-comparison-matrix.md) for detailed ratings.
**Evaluation for each candidate:**
| Style | Consider when... | Eliminate when... |
|-------|-----------------|-------------------|
| **Layered** | Simplicity and low cost are primary drivers; requirements still evolving; team is small | Scalability, elasticity, or deployability are driving characteristics |
| **Pipeline** | Data flows linearly through processing stages; ETL, content processing, orchestration | Workflows are not linear; complex user interaction patterns |
| **Microkernel** | High customizability needed; regional/client variations; plug-in extensibility | Need independent scaling of parts; multiple quanta required |
**Isomorphism check:** Does the problem domain naturally match the architecture topology?
- Data transformation pipeline → Pipeline
- Customizable product with rules/variants → Microkernel
- Simple business app, uncertain requirements → Layered (iterate from here)
**Output:** 1-2 candidate monolithic styles with characteristic scores, or a conclusion that monolith cannot meet the requirements (proceed to Step 2B).
### Step 2B: Evaluate Distributed Styles
**ACTION:** Score the four distributed styles against the driving characteristics. Factor in data architecture and communication style.
**WHY:** Distributed styles have dramatically different profiles. Service-based is the pragmatic middle ground — good at most things, extreme at nothing, and preserves ACID transactions through shared database. Event-driven excels at performance and scalability but is the hardest to test. Space-based handles extreme elasticity through in-memory processing but at high cost. Microservices maximize independence but require the most operational maturity. The right choice depends on which characteristics you need to MAXIMIZE, not just "support."
Consult the comparison matrix in [references/style-comparison-matrix.md](references/style-comparison-matrix.md) for detailed ratings.
**Evaluation for each candidate:**
| Style | Consider when... | Eliminate when... |
|-------|-----------------|-------------------|
| **Service-based** | Need pragmatic distribution without full microservices complexity; ACID transactions needed; team transitioning from monolith | Need extreme scalability or elasticity; need per-service technology diversity |
| **Event-driven** | Performance and scalability are primary drivers; natural event flow in domain; real-time processing | Need request-reply semantics; team lacks async debugging experience; strong consistency required everywhere |
| **Space-based** | Extreme and unpredictable elasticity needs; variable load patterns; cost is secondary | Predictable, steady load; budget-constrained; data consistency is critical |
| **Microservices** | Maximum team autonomy; independent deployability; different tech stacks per service; mature DevOps | Small team; immature DevOps; highly coupled domain; need ACID transactions across services |
**Data architecture sub-decision:** Where should data live?
- **Shared database** → Service-based (simplest, preserves ACID)
- **Logically partitioned** → Service-based with domain-scoped schemas
- **Per-service databases** → Microservices or event-driven (requires eventual consistency)
**Communication sub-decision:** Synchronous or asynchronous?
- **Synchronous** (REST, gRPC) → Convenient but creates runtime coupling; limits scalability
- **Asynchronous** (events, messaging) → Better scalability and decoupling; harder to debug
- **Hybrid** → Most common in practice; synchronous for queries, async for commands/events
### Step 3: Check Organizational Fit
**ACTION:** Validate that the candidate style(s) match the team's capabilities and organizational constraints.
**WHY:** The technically ideal architecture may be operationally infeasible. A team of 5 developers without distributed systems experience choosing microservices is setting up for a distributed monolith — the worst possible outcome. The book is clear: organizational factors (team size, DevOps maturity, deployment process, budget) can and should override purely technical analysis. An architecture the team can't operate is worse than a simpler architecture they can operate well.
| Factor | Impact on style selection |
|--------|-------------------------|
| **Team size <10** | Avoid microservices. Consider service-based or monolith. |
| **Team size 10-30** | Service-based or limited microservices (start with 3-5 services). |
| **Team size 30+** | Microservices viable if DevOps maturity is high. |
| **No distributed experience** | Start with monolith or service-based. Do NOT jump to microservices. |
| **Immature CI/CD** | Avoid any style requiring per-service pipelines. Service-based max. |
| **Tight budget** | Monolithic styles (layered, pipeline, microkernel) strongly favored. |
| **Mergers/acquisitions expected** | Favor integration-friendly styles (service-based, microservices). |
| **Must ship fast** | Service-based or layered. Avoid event-driven and space-based (long setup). |
**IF** organizational constraints eliminate the technically best option → recommend the next-best style that the team CAN operate, with a roadmap to grow into the ideal style.
### Step 4: Check for Anti-Patterns
**ACTION:** Verify the candidate style doesn't match known anti-patterns for this domain.
**WHY:** Each style has specific failure modes that are predictable and preventable. The architecture sinkhole in layered, the reuse-coupling trap in SOA, enforced heterogeneity in microservices — these aren't edge cases, they're the most common mistakes. Checking for anti-patterns before committing to a style prevents choosing a style that will fail in a predictable way. This is the "measure twice, cut once" step.
| Anti-pattern | Style affected | Detection | Resolution |
|-------------|---------------|-----------|------------|
| **Architecture Sinkhole** | Layered | >20% of requests pass through layers with no processing | Switch to open layers or consider a different style |
| **Distributed Monolith** | Microservices | Services share DB, deploy in lockstep, require synchronized changes | Consolidate into service-based, or fix service boundaries |
| **Too-Fine-Grained Services** | Microservices | Services smaller than bounded contexts, excessive inter-service calls | Merge related services; "microservice" is a label, not a description |
| **Enforced Heterogeneity** | Microservices | Mandating different tech per service | Use appropriate tech, not mandatory diversity |
| **Transactions Across Boundaries** | Microservices | Need for ACID across services | Fix granularity — services needing transactions belong together |
| **Broker/Mediator Mismatch** | Event-driven | Using broker for complex error-handling workflows, or mediator for simple fire-and-forget | Match topology to workflow complexity |
| **Reuse Coupling Trap** | SOA-style | Shared services create coupling between all consumers | Prefer duplication over coupling in distributed systems |
### Step 5: Score and Recommend
**ACTION:** Produce a scored comparison of the top 2-3 candidate styles and make a clear recommendation.
**WHY:** Architecture decisions are never binary — they're trade-off decisions where the goal is the "least worst set of trade-offs" (the book's exact words). Presenting scored alternatives with explicit trade-offs enables informed decision-making rather than dogmatic style selection. The recommendation should be specific enough to act on, including not just which style but how to get started with it.
**Scoring method:**
1. List the top 3 driving characteristics
2. For each candidate style, look up the star rating for each characteristic
3. Sum the scores (max possible = 15 for 3 characteristics at 5 stars each)
4. Apply organizational fit modifier: -1 per significant organizational gap
5. Apply isomorphism bonus: +1 if domain naturally maps to the style's topology
## Inputs
- System description and domain
- Driving architecture characteristics (top 3, prioritized)
- Architecture quantum count (from quantum analysis or estimation)
- Team size, experience, and organizational constraints
- Data architecture constraints (ACID needs, partitioning feasibility)
## Outputs
### Architecture Style Recommendation
```markdown
# Architecture Style Selection: {System Name}
## Decision Context
**System:** {what it does}
**Driving characteristics:** {top 3, in priority order}
**Architecture quanta:** {count and reasoning}
**Team context:** {size, experience, constraints}
## Step 1: Monolith vs Distributed
**Decision:** {Monolith / Distributed / Hybrid}
**Reasoning:** {quantum count, characteristic variance, organizational factors}
## Candidate Evaluation
| Criterion | {Style A} | {Style B} | {Style C} |
|-----------|:---------:|:---------:|:---------:|
| {Characteristic 1} (priority) | {score}/5 | {score}/5 | {score}/5 |
| {Characteristic 2} | {score}/5 | {score}/5 | {score}/5 |
| {Characteristic 3} | {score}/5 | {score}/5 | {score}/5 |
| **Characteristic total** | **{sum}** | **{sum}** | **{sum}** |
| Organizational fit | {Good/Fair/Poor} | {Good/Fair/Poor} | {Good/Fair/Poor} |
| Domain isomorphism | {Yes/No} | {Yes/No} | {Yes/No} |
| Anti-pattern risk | {risk or "none"} | {risk or "none"} | {risk or "none"} |
## Data Architecture
**Data location:** {shared DB / partitioned / per-service}
**Communication:** {sync / async / hybrid}
**Consistency model:** {ACID / eventual / mixed}
## Recommendation
**Selected style: {Style Name}**
**Why this style:**
- {Primary reason — how it matches driving characteristics}
- {Secondary reason — organizational fit, isomorphism}
**Trade-offs accepted:**
- {What you give up by choosing this style}
- {What you gain}
**Trade-offs rejected (why alternatives were not chosen):**
- {Style B}: {why it was eliminated}
- {Style C}: {why it was eliminated}
## Getting Started
1. {First concrete step to implement this style}
2. {Second step}
3. {Key pattern or practice to adopt}
## Migration Path (if applicable)
{If current system exists, how to get from here to there}
```
## Key Principles
- **Everything is a trade-off** — The First Law of Software Architecture. There is no "best" architecture style — only the one with the least worst set of trade-offs for your specific context. Every style gains something by sacrificing something else. Anyone claiming one style is universally superior hasn't understood the problem.
- **Quantum count drives the first fork** — The monolith vs distributed decision is not a matter of preference. One quantum = monolith is architecturally sufficient. Multiple quanta with different characteristic needs = distribution is architecturally required. This is a structural determination, not a philosophical one.
- **Organizational fit trumps technical optimality** — The best architecture is one the team can actually build and operate. A technically perfect microservices design operated by a team without distributed experience produces worse outcomes than a "suboptimal" service-based architecture they can run well. Factor in team size, DevOps maturity, and operational capability.
- **Domain/architecture isomorphism matters** — Some problem domains naturally match certain architecture topologies. Customization-heavy systems map to microkernel. Linear data processing maps to pipeline. High-scale event processing maps to event-driven. Fighting isomorphism creates friction; embracing it creates natural solutions.
- **Start simple, evolve up** — When uncertain, start with the simplest style that could work. It's far easier to extract services from a well-structured monolith than to merge poorly separated microservices. Service-based architecture is often the best "starting distributed" option because it offers distribution benefits at moderate complexity.
- **Monolith is not a dirty word** — The book explicitly lists monolith advantages: simpler deployment, simpler testing, simpler debugging, lower cost. Many successful systems run as monoliths. Choosing monolith when it fits is a sign of architectural maturity, not backwardness. Don't recommend distribution to be "modern."
## Examples
**Scenario: Nationwide sandwich shop ordering system (Silicon Sandwiches)**
Trigger: "We're building an online ordering system for a sandwich franchise. Need web/mobile ordering, regional customization, promotions, and POS integration."
Process: Identified characteristics — scalability (lunch rush traffic), customizability (regional recipes), availability (ordering must work). Quantum analysis: single quantum — all features share the same scalability/availability profile. Monolith is sufficient. Evaluated monolithic styles: Layered (simplicity fits, but customizability scores 1), Microkernel (customizability is built-in, regional variations as plug-ins, scores well). Isomorphism check: the customization requirement naturally maps to microkernel's plug-in topology. Organizational fit: small team, low budget — microkernel's simplicity (4) and cost (5) work well.
Output: **Microkernel** recommended. Core system handles ordering workflow; plug-in components handle regional customization (recipes, prices, promotions). Layered was backup option but doesn't structurally support customization. Trade-off accepted: limited to single quantum scaling. Trade-off rejected: microservices would add operational cost without solving the core customization problem.
**Scenario: Online auction with real-time bidding (Going, Going, Gone)**
Trigger: "Building an online auction platform. Need real-time bidding, live video streaming, bid tracking, and payment processing. Expecting thousands of concurrent bidders."
Process: Identified characteristics — elasticity (bursty bidder traffic), performance (sub-second bids), reliability (bids cannot be lost), availability (auctioneer feed can't drop). Quantum analysis: MULTIPLE quanta — bidder-facing components need different characteristics (high elasticity) than auctioneer-facing components (high reliability). Different quanta → distributed. Evaluated distributed styles: Service-based (pragmatic but scores 2 on elasticity — insufficient for burst traffic), Event-driven (performance 5, scalability 5 — matches real-time event flow), Microservices (scalability 5 but adds complexity beyond what's needed), Space-based (elasticity 5 but massive cost for this scale). Isomorphism: real-time bid events naturally flow as events — event-driven topology matches the domain. Data: per-component databases (bidding needs strong consistency, tracking can be eventual). Communication: async for bid streams, sync for payment.
Output: **Event-driven architecture** with microservices topology for service boundaries. Bid processing uses broker topology for real-time event flow. Payment uses mediator topology for workflow orchestration with error handling. Trade-off accepted: harder to test, eventual consistency for non-critical paths. Trade-off rejected: service-based can't handle the elasticity requirements; pure microservices without event-driven doesn't match the domain's natural event flow.
**Scenario: Internal business application for insurance company**
Trigger: "We're building an insurance claim processing system. Multi-page forms where each page depends on context from previous pages. Team of 8 developers, no distributed experience. Budget is tight."
Process: Identified characteristics — reliability (claims can't be lost), simplicity (team is small), cost (budget-constrained). Quantum analysis: single quantum — the multi-page form workflow is HIGHLY semantically coupled (each page depends on previous context). This is a textbook example where distribution would create pain. Monolith is clearly appropriate. Organizational fit: 8 developers, no distributed experience, tight budget — this eliminates all distributed styles. Evaluated monolithic styles: Layered (high simplicity and low cost, handles the coupled workflow well), Pipeline (doesn't fit — forms aren't linear data transformations), Microkernel (doesn't fit — no plug-in/customization requirement). Anti-pattern check: watch for sinkhole anti-pattern as the app grows.
Output: **Layered architecture** recommended. The high semantic coupling of multi-page forms naturally fits a single deployment unit. Team size and experience align perfectly. Trade-off accepted: limited scalability and deployability — but these aren't driving characteristics. Trade-off rejected: Service-based was considered but adds unnecessary distribution complexity for a system with one quantum and a team without distributed experience. Explicit note: "highly coupled problem domain matches poorly with highly decoupled distributed architectures."
## References
- For detailed style ratings and profiles, see [references/style-comparison-matrix.md](references/style-comparison-matrix.md)
- For architecture characteristics identification, use `architecture-characteristics-identifier`
- For quantum analysis, use `architecture-quantum-analyzer`
- For distributed feasibility checking, use `distributed-feasibility-checker`
- For documenting the final decision, use `architecture-decision-record-creator`
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-architecture-characteristics-identifier`
- `clawhub install bookforge-architecture-quantum-analyzer`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/style-comparison-matrix.md
# Architecture Style Comparison Matrix
> Source: Fundamentals of Software Architecture (Richards & Ford), Chapters 10-17
> Each characteristic rated 1-5 (1 = poorly supported, 5 = strongest feature)
## Complete Ratings Table
| Characteristic | Layered | Pipeline | Microkernel | Service-Based | Event-Driven | Space-Based | Microservices |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **Partitioning** | Technical | Technical | Domain/Tech | Domain | Technical | Domain | Domain |
| **Quanta** | 1 | 1 | 1 | 1 to many | 1 to many | 1 | 1 to many |
| **Deployability** | 1 | 2 | 3 | 4 | 3 | 3 | 4 |
| **Elasticity** | 1 | 1 | 1 | 2 | 3 | 5 | 5 |
| **Evolutionary** | 1 | 3 | 3 | 3 | 5 | 3 | 5 |
| **Fault tolerance** | 1 | 1 | 1 | 4 | 5 | 3 | 4 |
| **Modularity** | 1 | 3 | 3 | 4 | 4 | 3 | 5 |
| **Overall cost** | 5 | 5 | 5 | 4 | 3 | 2 | 1 |
| **Performance** | 2 | 2 | 3 | 3 | 5 | 5 | 2 |
| **Reliability** | 3 | 3 | 3 | 4 | 3 | 4 | 4 |
| **Scalability** | 1 | 1 | 1 | 3 | 5 | 5 | 5 |
| **Simplicity** | 5 | 5 | 4 | 3 | 1 | 1 | 1 |
| **Testability** | 2 | 3 | 3 | 4 | 2 | 1 | 4 |
> Note: Orchestration-driven SOA is intentionally excluded. The book treats it as a historical pattern with known coupling problems. If evaluating SOA, see Chapter 16.
---
## Style Profiles
### Monolithic Styles
#### Layered Architecture
- **Topology:** Presentation → Business → Persistence → Database (closed layers by default)
- **Best for:** Small/simple applications, tight budgets, teams still analyzing requirements, starting points when style is undecided
- **Strengths:** Simplicity (5), cost (5), reliability (3)
- **Weaknesses:** Deployability (1), elasticity (1), scalability (1), fault tolerance (1), modularity (1), evolutionary (1)
- **Anti-pattern:** Architecture Sinkhole — requests pass through layers with no processing. Use the 80-20 rule: acceptable if <20% of requests are sinkholes. If >80%, this is the wrong style.
- **Key trade-off:** Easy to understand and build, but degrades quickly as applications grow larger
#### Pipeline Architecture
- **Topology:** Source → Filter → Filter → ... → Sink (unidirectional data flow)
- **Best for:** ETL tools, data transformations, orchestration engines, shell command chains, content processing
- **Strengths:** Simplicity (5), cost (5), modularity (3)
- **Weaknesses:** Elasticity (1), scalability (1), fault tolerance (1), performance (2)
- **Anti-pattern:** Forcing bidirectional data flow into a unidirectional pipeline
- **Key trade-off:** Excellent for linear processing workflows but cannot handle complex interaction patterns
- **Isomorphism:** Problems with linear data transformation stages naturally map to this style
#### Microkernel Architecture
- **Topology:** Core system + plug-in components (compile-time or runtime)
- **Best for:** Product-based applications with high customizability, IDE-like systems, insurance rules engines, tax software, workflow engines
- **Strengths:** Cost (5), simplicity (4), testability (3), deployability (3)
- **Weaknesses:** Elasticity (1), scalability (1), fault tolerance (1)
- **Anti-pattern:** Plug-in dependencies on each other (violates independence)
- **Key trade-off:** Excellent extensibility through plug-ins, but limited to single quantum (monolithic core)
- **Isomorphism:** Any problem requiring high customizability or regional/client variations naturally maps here
### Distributed Styles
#### Service-Based Architecture
- **Topology:** 4-12 coarse-grained domain services + shared database + separately deployed UI
- **Best for:** Pragmatic distributed systems, teams transitioning from monolith, domain-driven applications needing some distribution benefits without full microservices complexity
- **Strengths:** Deployability (4), fault tolerance (4), modularity (4), reliability (4), testability (4), cost (4)
- **Weaknesses:** Elasticity (2), scalability (3)
- **Anti-pattern:** Creating too many services (>12, becoming accidental microservices) or too few (<3, becoming a distributed monolith)
- **Key trade-off:** Best balance of distributed benefits vs operational complexity. The "pragmatic middle ground." ACID transactions still possible within services due to shared database.
- **Unique advantage:** Preserves database-level ACID transactions while gaining independent deployability
#### Event-Driven Architecture
- **Topology:** Broker topology (no central mediator, events flow freely) or Mediator topology (central orchestrator coordinates events)
- **Best for:** High-performance, highly scalable systems with complex event processing, real-time systems, IoT platforms
- **Strengths:** Performance (5), scalability (5), fault tolerance (5), evolutionary (5)
- **Weaknesses:** Simplicity (1), testability (2), overall cost (3)
- **Broker vs Mediator:**
- **Broker:** Higher decoupling, better performance, but no central error handling or workflow control
- **Mediator:** Better error handling and workflow control, but introduces coupling and potential bottleneck
- **Anti-pattern:** Using broker for complex workflows requiring error handling, or mediator for simple event notifications
- **Key trade-off:** Highest performance and scalability of any style, but hardest to test and reason about (eventual consistency, race conditions)
#### Space-Based Architecture
- **Topology:** Processing units with in-memory data grids, messaging grid, data pumps to persistent storage
- **Best for:** Systems with extreme and unpredictable scalability/elasticity needs — concert ticketing, auction systems, social media events
- **Strengths:** Elasticity (5), scalability (5), performance (5)
- **Weaknesses:** Simplicity (1), testability (1), overall cost (2)
- **Anti-pattern:** Using for systems with normal, predictable load patterns (massive over-engineering)
- **Key trade-off:** Can handle virtually unlimited scale through in-memory processing, but extremely expensive and complex to build and test
- **When it shines:** Variable load that would be cost-prohibitive to provision for peak with traditional architectures
#### Microservices Architecture
- **Topology:** Fine-grained, independently deployable services, each with its own database (bounded context), communicating via REST/messaging
- **Best for:** Maximum independent deployability, evolutionary architecture, large teams needing autonomy, systems requiring different technology stacks per service
- **Strengths:** Scalability (5), elasticity (5), evolutionary (5), modularity (5), deployability (4), fault tolerance (4), testability (4)
- **Weaknesses:** Overall cost (1), simplicity (1), performance (2)
- **Anti-patterns:**
- Enforced heterogeneity (mandating different tech stacks per service)
- Too fine-grained services (more communication overhead than benefit)
- Transactions across service boundaries (fix granularity instead!)
- **Key trade-off:** Maximum flexibility and scalability, but maximum operational complexity and cost. Requires mature DevOps practices.
- **Granularity guidance:** "Microservice" is a label, not a description (Martin Fowler). Service boundaries should capture a domain or workflow — use Purpose, Transactions, and Choreography to find boundaries.
---
## Quick Selection Guide
### By Primary Driving Characteristic
| If you need... | Consider first | Consider second |
|---|---|---|
| **Scalability** | Microservices, Event-driven | Space-based |
| **Elasticity** | Space-based, Microservices | Event-driven |
| **Performance** | Event-driven, Space-based | Pipeline (for data processing) |
| **Simplicity** | Layered, Pipeline | Microkernel |
| **Low cost** | Layered, Pipeline, Microkernel | Service-based |
| **Deployability** | Microservices, Service-based | Event-driven |
| **Fault tolerance** | Event-driven, Microservices | Service-based |
| **Evolutionary** | Microservices, Event-driven | Service-based |
| **Testability** | Microservices, Service-based | Microkernel, Pipeline |
| **Customizability** | Microkernel | Service-based |
| **Reliability** | Service-based, Microservices | Space-based |
### By Domain Isomorphism
| Domain pattern | Natural fit |
|---|---|
| Linear data processing, ETL | Pipeline |
| High customization, plug-in rules | Microkernel |
| Small/simple CRUD application | Layered |
| Pragmatic business application | Service-based |
| Real-time event processing | Event-driven |
| Extreme/variable scalability needs | Space-based |
| Maximum team autonomy + independent scaling | Microservices |
| Highly coupled domain (e.g., multi-page forms) | Service-based or Layered (NOT microservices) |
### By Organizational Context
| Context | Recommended |
|---|---|
| Small team (<10), tight budget | Layered or Microkernel |
| Medium team, needs some distribution | Service-based |
| Large team (30+), mature DevOps | Microservices or Event-driven |
| Unknown requirements, starting point | Layered (then migrate) |
| Must deliver fast, iterate later | Service-based |
Quantify architecture risk using a 2D risk matrix (impact x likelihood, scored 1-9) and produce structured risk assessment reports. Use this skill whenever t...
---
name: architecture-risk-assessor
description: Quantify architecture risk using a 2D risk matrix (impact x likelihood, scored 1-9) and produce structured risk assessment reports. Use this skill whenever the user asks about architecture risks, wants to evaluate risk across services or components, needs a risk matrix, mentions risk assessment, risk analysis, risk heat map, risk scoring, or asks "what are the risks?" for any architecture — even if they don't explicitly say "risk assessment." Also triggers when the user mentions unproven technology risk, scalability risk, availability concerns, security risk, data integrity risk, or wants to prioritize risks for stakeholder meetings.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-risk-assessor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [20]
tags: [software-architecture, architecture, risk, risk-matrix, risk-assessment, governance]
depends-on: []
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Architecture context from the user — system description, services, components, and concerns"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can read it for context."
---
# Architecture Risk Assessor
## When to Use
You need to systematically identify and quantify risks in a software architecture. Typical triggers:
- The user describes a system and asks "what are the risks?"
- The user is adopting unproven or unfamiliar technologies
- The user needs to present risk findings to stakeholders or leadership
- The user is planning a migration, new service, or significant architecture change
- The user wants to compare risk profiles across services or domains
- The user is doing sprint/iteration planning and wants to identify high-risk stories
Before starting, verify:
- Is there an architecture to assess? (At minimum, a description of services/components and their responsibilities)
- Is there a specific risk dimension to focus on, or should you cover all standard criteria?
## Context
### Required Context (must have before proceeding)
- **System description:** What services, components, or domains exist in the architecture?
-> Check prompt for: service names, component descriptions, architecture diagrams, system overviews
-> Check environment for: docker-compose files, k8s manifests, service directories, README files
-> If still missing, ask: "Can you describe the services or components in your architecture and their primary responsibilities?"
- **Risk concerns:** What is the user worried about? What prompted this risk assessment?
-> Check prompt for: mentions of failures, performance issues, security concerns, scaling problems, technology uncertainty
-> If still missing, proceed with all standard risk criteria (scalability, availability, performance, security, data integrity)
### Observable Context (gather from environment if available)
- **Technology stack:** What technologies are in use?
-> Look for: package.json, requirements.txt, go.mod, Dockerfile, infrastructure configs
-> If unavailable: rely on user description
- **Architecture style:** Monolith, microservices, event-driven, etc.
-> Look for: service count, communication patterns, deployment configs
-> If unavailable: infer from user description
- **Existing risk documentation:** Previous risk assessments, incident reports, post-mortems
-> Look for: docs/risks/, incident reports, ADRs mentioning risk
-> If unavailable: start fresh
### Default Assumptions
- If no risk criteria specified -> use the standard five: scalability, availability, performance, security, data integrity
- If no team context -> assume a team with moderate experience in the primary technology
- If technology maturity unknown -> ask, because unproven tech always gets highest risk (9)
- If no existing risk assessments found -> treat as first assessment (no direction indicators)
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- System description with identifiable services/components is known
- At least one risk concern or dimension is identified
- Technology stack is known or can be inferred
PROCEED WITH DEFAULTS when:
- System description is known
- Risk criteria can use standard defaults
- Technology details are partially available
MUST ASK when:
- No system description exists (cannot assess risk without knowing what to assess)
- User mentions "unproven technology" but doesn't specify which one
- The architecture is ambiguous (could be interpreted as monolith or distributed)
```
## Process
### Step 1: Identify Architecture Components
**ACTION:** List all services, components, or domain areas that will be assessed for risk. Name each one clearly.
**WHY:** Risk assessment maps criteria AGAINST specific areas of the architecture. Without a clear component list, the assessment becomes vague hand-waving. Each component carries different risk profiles — a payment service has different risk exposure than a notification service. Identifying components first creates the columns of your risk assessment table.
**IF** the user provided a clear service list -> use it directly
**ELSE IF** a codebase is available -> scan for service boundaries (separate deployables, bounded contexts)
**ELSE** -> ask the user to enumerate their primary services or domains
### Step 2: Determine Risk Criteria
**ACTION:** Select the risk criteria (dimensions) to evaluate. Start with the standard five unless the user specifies different ones:
1. **Scalability** — Can each component handle increased load without degradation?
2. **Availability** — What is the impact and likelihood of each component going down?
3. **Performance** — Can each component meet latency and throughput requirements?
4. **Security** — What is the exposure to unauthorized access, data breaches, or compliance violations?
5. **Data Integrity** — What is the risk of data loss, corruption, or inconsistency?
**WHY:** Risk criteria form the rows of your assessment table. Using standardized criteria ensures consistency across assessments and makes them comparable over time. Custom criteria can be added for domain-specific concerns (e.g., "regulatory compliance" for fintech, "patient safety" for healthcare).
**IF** the user mentioned specific concerns -> add those as additional criteria
**IF** the domain has regulatory requirements -> add a compliance criterion
### Step 3: Score Each Cell Using the Risk Matrix
**ACTION:** For each component-criteria pair, assess two dimensions independently:
- **Impact** (1-3): How severe would it be if this risk materializes?
- 1 = Low: minor inconvenience, easy recovery
- 2 = Medium: significant disruption, recoverable with effort
- 3 = High: severe damage, potential data loss, major business impact
- **Likelihood** (1-3): How probable is it that this risk materializes?
- 1 = Low: unlikely given current architecture and controls
- 2 = Medium: possible under certain conditions (peak load, specific failures)
- 3 = High: probable or already occurring
- **Risk Score** = Impact x Likelihood (range: 1-9)
Classify the composite score:
- **1-2: Low risk (green)** — acceptable, monitor only
- **3-4: Medium risk (yellow)** — needs attention, plan mitigation
- **6-9: High risk (red)** — requires immediate action or architectural change
**WHY:** The 2D matrix separates two fundamentally different aspects of risk that people conflate. A risk with high impact but low likelihood (earthquake destroys data center) requires a different response than a risk with low impact but high likelihood (cache miss causing a slightly slower page load). Multiplying them produces a single comparable score, but keeping both dimensions visible enables smarter mitigation — you can reduce impact OR reduce likelihood.
**CRITICAL RULE:** For any unproven or unknown technology, always assign the highest risk score (9 — impact 3 x likelihood 3). Teams consistently underestimate the risk of technologies they haven't used in production. Unknown unknowns are the most dangerous risks.
For detailed matrix layout and visual reference, see [references/risk-matrix-template.md](references/risk-matrix-template.md).
### Step 4: Build the Risk Assessment Table
**ACTION:** Construct a comprehensive risk assessment table mapping all criteria (rows) against all components (columns). Include row totals (accumulated risk per criterion) and column totals (accumulated risk per component).
**WHY:** The table is the primary artifact. Row totals reveal which risk criteria are most concerning across the entire system — a high scalability total means the architecture has a systemic scalability problem, not just one service. Column totals reveal which components carry the most risk — these are the parts that need the most architectural attention. Both views are essential: criteria totals drive architectural strategy, component totals drive prioritization.
Format:
```
| Risk Criteria | Service A | Service B | Service C | Total |
|------------------|-----------|-----------|-----------|-------|
| Scalability | 6 (H) | 2 (L) | 4 (M) | 12 |
| Availability | 3 (M) | 9 (H) | 1 (L) | 13 |
| Performance | 2 (L) | 4 (M) | 6 (H) | 12 |
| Security | 9 (H) | 3 (M) | 3 (M) | 15 |
| Data Integrity | 6 (H) | 1 (L) | 9 (H) | 16 |
| **Total** | **26** | **19** | **23** | |
```
### Step 5: Add Risk Direction Indicators
**ACTION:** For each cell, add a direction indicator showing whether the risk is improving (+), worsening (-), or stable (=) compared to the previous assessment or recent trends.
**WHY:** A risk score of 6 that is improving (+) tells a very different story than a risk score of 6 that is worsening (-). Direction matters as much as current state. It shows whether mitigation efforts are working, whether new risks are emerging, and where to focus future attention. For first-time assessments, use observable signals: recent incidents suggest worsening (-), recent infrastructure improvements suggest improving (+).
**IF** this is the first assessment -> use contextual signals (recent incidents, known issues, recent improvements) to infer direction
**IF** previous assessments exist -> compare directly
### Step 6: Create Filtered Views for Stakeholders
**ACTION:** Produce a filtered version of the risk assessment showing ONLY high-risk cells (scores 6-9). Replace low and medium cells with dots or dashes.
**WHY:** Stakeholders and leadership don't need to see every cell. Showing only red (high-risk) areas focuses attention on what matters and prevents "risk fatigue" where everything looks concerning. The filtered view is what you present in meetings. The full table is the reference document for the architecture team.
Format:
```
| Risk Criteria | Service A | Service B | Service C |
|------------------|-----------|-----------|-----------|
| Scalability | 6 (H) - | . | . |
| Availability | . | 9 (H) - | . |
| Security | 9 (H) = | . | . |
| Data Integrity | 6 (H) + | . | 9 (H) - |
```
### Step 7: Recommend Mitigations for High-Risk Areas
**ACTION:** For each high-risk cell (6-9), propose a specific mitigation strategy. Include the estimated risk score AFTER mitigation to show the expected improvement.
**WHY:** Risk assessment without mitigation is just worry. The value is in the response plan. Including post-mitigation scores makes the business case concrete — "spending $X on database clustering reduces data integrity risk from 9 to 3." This feeds directly into budget negotiations with stakeholders.
**IF** the user needs Agile story-level risk analysis -> also apply the risk matrix to user stories (Step 8)
**ELSE** -> proceed to output
### Step 8 (Optional): Agile Story Risk Analysis
**ACTION:** Apply the risk matrix to individual user stories during iteration planning:
- **Impact dimension:** Overall impact if this story is NOT completed in the iteration
- **Likelihood dimension:** Probability that this story will NOT be completed (complexity, dependencies, unknowns)
- Identify stories scoring 6-9 as high-risk and flag them for priority attention
**WHY:** The same risk matrix that works for architecture works for sprint planning. High-risk stories — those with high impact if missed AND high likelihood of not completing — should be started early, broken down further, or given to the most experienced developers. This connects architectural risk thinking to daily development practice.
## Inputs
- System description with identifiable services/components (from user or codebase)
- Risk concerns or dimensions to evaluate (from user, or use defaults)
- Optionally: previous risk assessments, incident history, technology stack details
## Outputs
### Architecture Risk Assessment Report
```markdown
# Architecture Risk Assessment: {System Name}
## Assessment Scope
- **Date:** {date}
- **Assessed by:** {who}
- **Architecture style:** {monolith/microservices/event-driven/etc.}
- **Components assessed:** {count}
- **Risk criteria:** {list}
## Architecture Components
1. **{Component A}** — {responsibility}
2. **{Component B}** — {responsibility}
3. **{Component C}** — {responsibility}
## Full Risk Assessment
| Risk Criteria | Component A | Component B | Component C | Total |
|------------------|------------------|------------------|------------------|-------|
| Scalability | {score} ({L/M/H}) {dir} | ... | ... | {sum} |
| Availability | ... | ... | ... | {sum} |
| Performance | ... | ... | ... | {sum} |
| Security | ... | ... | ... | {sum} |
| Data Integrity | ... | ... | ... | {sum} |
| **Total** | **{sum}** | **{sum}** | **{sum}** | |
### Scoring Key
- Score = Impact (1-3) x Likelihood (1-3)
- Low (L): 1-2 | Medium (M): 3-4 | High (H): 6-9
- Direction: + improving, - worsening, = stable
## High-Risk Summary (Filtered View)
{Filtered table showing only 6-9 scores}
## Risk Details and Mitigations
### {Component A} — {Risk Criteria} (Score: {N})
- **Impact ({1-3}):** {why this impact level}
- **Likelihood ({1-3}):** {why this likelihood level}
- **Direction:** {+/-/=} {reason}
- **Mitigation:** {specific recommendation}
- **Post-mitigation estimate:** {expected new score}
{Repeat for each high-risk cell}
## Systemic Risk Observations
- {Pattern observed across multiple components}
- {Risk criteria with highest total — indicates systemic issue}
- {Components with highest totals — indicates architectural attention needed}
## Recommendations Priority
1. {Highest priority mitigation} — addresses {risk}
2. {Second priority} — addresses {risk}
3. {Third priority} — addresses {risk}
```
## Key Principles
- **Separate impact from likelihood** — These are fundamentally different dimensions that require different mitigation strategies. High-impact/low-likelihood risks need contingency plans. Low-impact/high-likelihood risks need engineering fixes. Conflating them into a single "risk level" hides the appropriate response.
- **Unknown technology = maximum risk** — For any unproven or unfamiliar technology, always assign the highest risk score (9). Teams systematically underestimate the danger of technologies they haven't used in production. Unknown unknowns are the most dangerous class of risk. This isn't pessimism — it's calibration against a well-documented bias.
- **Direction matters as much as magnitude** — A risk score of 6 that is improving tells a completely different story than a score of 4 that is worsening. Track direction with every assessment. It reveals whether mitigation efforts are working and where new risks are emerging.
- **Filter for your audience** — Show the full risk assessment to the architecture team. Show only high-risk areas (6-9) to stakeholders and leadership. Risk fatigue is real — if everything looks red, nothing gets attention. Filtering focuses the conversation on what actually needs action.
- **Risk assessment is continuous, not one-time** — Architecture risk changes as the system evolves, new technologies are adopted, team composition shifts, and business requirements change. A risk assessment is a living document. Treat it like a fitness function — measure regularly, compare to previous results, act on trends.
- **Mitigation has a cost** — Every mitigation strategy costs something: money, complexity, development time, or operational burden. Present mitigations alongside their costs so stakeholders can make informed decisions. Sometimes accepting a medium risk is better than the cost of eliminating it.
## Examples
**Scenario: E-commerce platform risk assessment**
Trigger: "We have 4 services: customer registration, catalog checkout, order fulfillment, and order shipment. Can you assess the architecture risks?"
Process: Listed 4 components. Applied standard 5 risk criteria. Scored each cell using the impact x likelihood matrix. Customer registration scored low across the board. Catalog checkout scored high on performance (6) due to peak-hour load. Order fulfillment scored high on availability (9) and data integrity (6) because lost orders mean lost revenue. Order shipment scored medium across most criteria. Created filtered view showing only the 3 high-risk cells. Recommended: message queue between checkout and fulfillment (reduces availability risk from 9 to 3), database replication for order data (reduces data integrity risk from 6 to 2), auto-scaling for catalog during peak hours (reduces performance risk from 6 to 2).
Output: Full risk assessment report with filtered stakeholder view and 3 prioritized mitigations with cost estimates.
**Scenario: Unproven technology risk flag**
Trigger: "We're evaluating using CockroachDB for our new service. Nobody on the team has used it before."
Process: Immediately flagged CockroachDB as unproven technology — assigned risk score 9 (impact 3 x likelihood 3) per the unknown-technology rule. Assessed the service across standard criteria. Data integrity risk automatically high (9) due to unfamiliar database behavior under edge cases. Availability risk elevated (6) because the team can't troubleshoot production issues quickly. Recommended mitigations: proof-of-concept with production-like load before committing, dedicated learning sprint, identify rollback strategy to known database, engage vendor support.
Output: Risk assessment highlighting the technology risk with specific de-risking steps and a decision gate (proceed only if PoC succeeds).
**Scenario: Agile story risk analysis for sprint planning**
Trigger: "We have 12 stories for the next sprint. Some feel risky. How do we figure out which ones to prioritize?"
Process: Applied the risk matrix to each story. Impact = business impact if story not completed this sprint. Likelihood = probability of non-completion (based on complexity, dependencies, unknowns). Identified 3 stories scoring 6-9: a payment integration story (impact 3 x likelihood 2 = 6, depends on external API), a data migration story (impact 3 x likelihood 3 = 9, unknown data quality), and a performance optimization story (impact 2 x likelihood 3 = 6, requires load testing infrastructure). Recommended: start data migration story on day 1 with senior developer, spike the payment integration dependency immediately, defer performance story to next sprint if load testing infra isn't ready.
Output: Story risk matrix with all 12 stories scored, 3 flagged as high-risk with specific handling recommendations.
## References
- For the detailed risk matrix layout with visual scoring guide, see [references/risk-matrix-template.md](references/risk-matrix-template.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/risk-matrix-template.md
# Risk Matrix Template
## The 2D Risk Matrix
Architecture risk is quantified using two independent dimensions, each scored 1-3. The composite score (1-9) determines the risk classification.
### Dimension 1: Impact
How severe would it be if this risk materializes?
| Score | Level | Description |
|:-----:|--------|----------------------------------------------------------------|
| 1 | Low | Minor inconvenience. Easy recovery. No data loss. No revenue impact. |
| 2 | Medium | Significant disruption. Recoverable with effort. Partial degradation of service. Some user impact. |
| 3 | High | Severe damage. Potential data loss. Major business impact. Revenue loss. Regulatory consequences. |
### Dimension 2: Likelihood
How probable is it that this risk materializes?
| Score | Level | Description |
|:-----:|--------|----------------------------------------------------------------|
| 1 | Low | Unlikely given current architecture and controls. Would require multiple simultaneous failures. |
| 2 | Medium | Possible under specific conditions: peak load, partial failures, edge cases. Has happened in similar systems. |
| 3 | High | Probable or already occurring. Known issue. Architectural weakness actively exploitable. |
### Composite Risk Score Matrix
```
IMPACT
1 2 3
+-------+-------+-------+
1 | 1 | 2 | 3 |
| (low) | (low) | (med) |
L +-------+-------+-------+
I 2 | 2 | 4 | 6 |
K | (low) | (med) | HIGH |
E +-------+-------+-------+
L 3 | 3 | 6 | 9 |
I | (med) | HIGH | HIGH |
H +-------+-------+-------+
O
O
D
```
### Risk Classification Thresholds
| Score Range | Classification | Color | Response |
|:-----------:|:--------------:|:------:|----------------------------------------------------------|
| 1-2 | Low | Green | Acceptable. Monitor only. No immediate action required. |
| 3-4 | Medium | Yellow | Needs attention. Plan mitigation within current or next iteration. |
| 6-9 | High | Red | Requires immediate action or architectural change. Escalate to stakeholders. |
Note: Score 5 is not possible in a 3x3 matrix (no combination of 1-3 x 1-3 produces 5). This is intentional — it creates a clear gap between medium (max 4) and high (min 6).
## Standard Risk Criteria
These five criteria cover the most common architecture risk dimensions. Add domain-specific criteria as needed.
| Criterion | What It Measures | Common Risk Signals |
|------------------|----------------------------------------------------------|---------------------|
| **Scalability** | Can the component handle increased load? | No auto-scaling, shared database bottlenecks, synchronous chains, single-threaded processing |
| **Availability** | What happens when the component goes down? | No redundancy, single points of failure, no health checks, no circuit breakers |
| **Performance** | Can the component meet latency/throughput requirements? | Missing caching, N+1 queries, synchronous external calls in hot paths, no CDN |
| **Security** | What is the exposure to breaches or unauthorized access? | Unencrypted data at rest/transit, missing auth, broad network exposure, unpatched dependencies |
| **Data Integrity** | What is the risk of data loss, corruption, or inconsistency? | No backups, eventual consistency without conflict resolution, shared mutable state, no validation |
## Risk Assessment Table Template
### Full View (for architecture team)
```markdown
| Risk Criteria | Service A | Service B | Service C | Total |
|------------------|------------------|------------------|------------------|-------|
| Scalability | 6 (H) - | 2 (L) = | 4 (M) + | 12 |
| Availability | 3 (M) = | 9 (H) - | 1 (L) = | 13 |
| Performance | 2 (L) + | 4 (M) = | 6 (H) - | 12 |
| Security | 9 (H) = | 3 (M) + | 3 (M) = | 15 |
| Data Integrity | 6 (H) + | 1 (L) = | 9 (H) - | 16 |
| **Total** | **26** | **19** | **23** | |
```
### Filtered View (for stakeholder meetings — high-risk only)
```markdown
| Risk Criteria | Service A | Service B | Service C |
|------------------|------------------|------------------|------------------|
| Scalability | 6 (H) - | . | . |
| Availability | . | 9 (H) - | . |
| Performance | . | . | 6 (H) - |
| Security | 9 (H) = | . | . |
| Data Integrity | 6 (H) + | . | 9 (H) - |
```
Dots (.) replace low and medium scores to reduce noise and focus attention on what needs action.
## Direction Indicators
| Symbol | Meaning | Description |
|:------:|------------|------------------------------------------------|
| + | Improving | Risk is decreasing. Mitigation efforts working. Recent improvements made. |
| - | Worsening | Risk is increasing. New issues emerging. Previous mitigations insufficient. |
| = | Stable | No significant change since last assessment. |
### How to Determine Direction
For first-time assessments (no previous baseline):
- **Recent incidents in this area** -> worsening (-)
- **Recent infrastructure improvements** -> improving (+)
- **No recent changes or incidents** -> stable (=)
- **New/unproven technology recently adopted** -> worsening (-)
For subsequent assessments:
- Compare directly to previous risk scores
- Consider whether mitigation actions from the last assessment were implemented
## Unproven Technology Rule
**For any technology that the team has NOT used in production, always assign the maximum risk score (9).**
This is not negotiable. Teams systematically underestimate the risk of unfamiliar technologies because:
1. **Unknown unknowns** — You don't know what failure modes exist until you've operated the technology under real load
2. **Optimism bias** — Technology evaluations tend to focus on features (benefits) rather than operational characteristics (risks)
3. **Vendor marketing** — Published benchmarks don't reflect your specific usage patterns, data shapes, or scale
4. **Support gap** — When the team can't diagnose production issues, mean time to recovery (MTTR) skyrockets
The score of 9 is a forcing function: it ensures unproven technologies get explicit attention, a proof-of-concept, and a rollback plan before being committed to production.
## Agile Story Risk Matrix
The same 2D matrix applies to user stories during sprint/iteration planning:
| Dimension | Architecture Risk | Story Risk |
|-------------|------------------------------------------|-------------------------------------------|
| **Impact** | Severity if risk materializes | Business impact if story is NOT completed |
| **Likelihood** | Probability risk materializes | Probability story will NOT be completed |
### Story Risk Signals
| Factor | Impact on Likelihood |
|---------------------------------|---------------------|
| Complex story with many unknowns | Increases (2-3) |
| Depends on external API/team | Increases (2-3) |
| Developer has done similar work | Decreases (1) |
| Story is well-spiked/prototyped | Decreases (1) |
| Requires new infrastructure | Increases (2-3) |
### Handling High-Risk Stories (6-9)
1. Start them on day 1 of the iteration — don't leave them until the end
2. Assign to the most experienced developer for that area
3. Break into smaller stories if possible to reduce likelihood
4. Identify and resolve dependencies before the iteration starts
5. Have a "plan B" scope reduction if the story is at risk mid-sprint
## Mitigation Documentation Template
For each high-risk cell in the assessment, document:
```markdown
### {Component} — {Risk Criterion} (Score: {N})
**Impact ({1-3}):** {Specific justification for impact level}
**Likelihood ({1-3}):** {Specific justification for likelihood level}
**Direction:** {+/-/=} — {Reason for trend}
**Root Cause:** {What architectural characteristic or decision creates this risk}
**Mitigation Options:**
1. {Option A} — Cost: {$X / complexity / time}
- Post-mitigation score: {expected new score}
- Trade-off: {what you give up}
2. {Option B} — Cost: {$X / complexity / time}
- Post-mitigation score: {expected new score}
- Trade-off: {what you give up}
**Recommended:** {Option N} because {justification tied to constraints}
```
Analyze a system's architecture quanta — independently deployable units with distinct quality attribute needs. Use this skill whenever the user needs to dete...
---
name: architecture-quantum-analyzer
description: Analyze a system's architecture quanta — independently deployable units with distinct quality attribute needs. Use this skill whenever the user needs to determine if their system should be monolith or distributed, is analyzing deployment boundaries, evaluating which parts of a system need different scalability/reliability/performance characteristics, decomposing a monolith, or asking "should this be one service or many?" — even if they don't use the term "quantum."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-quantum-analyzer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- architecture-characteristics-identifier
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [3, 7, 8]
tags: [software-architecture, architecture, quantum, deployment, monolith-vs-distributed, coupling, connascence]
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "A software project to analyze for quantum boundaries — or a system description if no codebase exists"
tools-required: [Read, Grep, Glob]
tools-optional: [Bash]
mcps-required: []
environment: "Best with access to a codebase. Can work from system description for greenfield projects."
---
# Architecture Quantum Analyzer
An **architecture quantum** (or "independently deployable unit") is an independently deployable artifact with high functional cohesion and synchronous connascence. The quantum count determines whether a system should use a monolith or distributed architecture — because different quanta need different quality attributes, and a single monolith can only optimize for one set.
## When to Use
You're deciding whether a system should be one deployable unit or many, or you're analyzing the deployment boundaries of an existing system. Typical situations:
- "Should this be a monolith or microservices?" — quantum analysis provides the answer
- Decomposing a monolith — which parts should separate?
- Performance/scalability issues in one part of a system while other parts are fine
- Different teams need to deploy independently
- Pre-requisite for `architecture-style-selector` — quantum count informs style choice
Before starting, verify:
- Do you know the system's architecture characteristics? If not, use `architecture-characteristics-identifier` first
- Is there a codebase to analyze, or is this a greenfield design?
## Context
### Required Context (must have before proceeding)
- **System description:** What does the system do? What are its major components/services? Ask the user if not apparent.
- **Architecture characteristics:** The quality attributes that matter for this system. If unknown, invoke `architecture-characteristics-identifier` first.
### Observable Context (gather from environment if available)
- **Codebase structure:** Scan for service boundaries, packages, modules
→ Look for: `docker-compose.yml`, `k8s/` manifests, service directories, separate `package.json`/`pyproject.toml` per service
→ These reveal deployment topology and current boundaries
- **Communication patterns:** How do components talk to each other?
→ Look for: HTTP client imports (`httpx`, `requests`, `axios`), message queue imports (`pika`, `kafka`, `amqplib`), gRPC definitions
→ Synchronous = same quantum potential. Asynchronous = different quanta.
- **Database configuration:** Shared database = single quantum. Per-service databases = potential separate quanta
→ Look for: database connection configs, ORM models, migration files
- **Deployment configs:** What deploys together vs separately?
→ Look for: `docker-compose.yml` services, Kubernetes deployments, CI/CD pipeline stages
- **Architecture characteristics per component:** Do different parts have different scaling/reliability needs?
→ Look for: replica counts, resource limits, SLA configs, autoscaling rules
### Default Assumptions
- If no codebase → work from user's system description (greenfield analysis)
- If no deployment configs → assume everything deploys together (monolith)
- If no explicit characteristics per component → ask user which parts have different needs
## Process
### Step 1: Identify Components
**ACTION:** List all major components, services, or modules in the system.
**WHY:** You can't find quantum boundaries without knowing what the pieces are. Components are the building blocks — quanta are how they group based on deployment and coupling. If you're analyzing a codebase, scan the file structure. If greenfield, list the planned components.
**IF** codebase exists → scan directory structure, docker-compose, deployment configs
**ELSE** → ask user to list the major components and their responsibilities
### Step 2: Map Communication Patterns
**ACTION:** For each pair of components that communicate, determine if the communication is synchronous or asynchronous.
**WHY:** This is the critical step. Synchronous connascence (one component waits for another's response) means both components share fate during the call — they MUST have compatible operational characteristics. If Service A calls Service B synchronously and A needs 99.99% availability but B only has 99%, A's availability is capped at B's. Asynchronous communication (fire-and-forget via message queue) breaks this fate-sharing — each component can have independent characteristics.
Map each communication as:
- **Synchronous:** REST calls, gRPC, direct function calls, shared database reads/writes
- **Asynchronous:** Message queues (RabbitMQ, Kafka, SQS), event buses, async event publishing
### Step 3: Identify Architecture Characteristics Per Component
**ACTION:** Determine what quality attributes each component needs. Look for differences between components.
**WHY:** The whole point of quantum analysis is discovering that different parts of the system need DIFFERENT characteristics. If everything needs the same scalability, reliability, and performance — it's one quantum, and a monolith is fine. But if the bidding engine needs extreme elasticity while the payment service needs extreme reliability, they're in different quanta with different architectural needs. This non-uniformity is what drives the need for distributed architecture.
**CAUTION — the uniform characteristics anti-pattern:** Don't assume the whole system has one set of characteristics. This is the most common mistake. Ask: "Does the order processing part of the system need the same scalability as the notification part?" If the answer is no, you have multiple quanta.
### Step 4: Group Into Quanta
**ACTION:** Group components into quanta based on the three-criteria test. Components belong to the same quantum if they satisfy ALL THREE:
1. **Deploy together** — they ship as one unit (or must be deployed in lockstep)
2. **High functional cohesion** — they serve a unified business purpose together
3. **Synchronous connascence** — they communicate synchronously (fate-sharing)
**WHY:** These three criteria are AND conditions, not OR. Two services might deploy independently (criterion 1 fails) but communicate synchronously (criterion 3 met) — they're still NOT the same quantum because independent deployment means they CAN have different characteristics. Conversely, two components that deploy together but serve unrelated purposes (low cohesion) are forced into the same quantum by deployment, but this might be a design problem worth flagging.
**Remember:** Databases are part of the quantum. If two services share a database, they share a quantum — because you can't deploy the database independently from either service.
### Step 5: Analyze Quantum Characteristics
**ACTION:** For each identified quantum, list its driving architecture characteristics (use the top 3 from `architecture-characteristics-identifier`). Note where quanta DIFFER.
**WHY:** The value of quantum analysis is revealing that different quanta have different needs. If Quantum A needs elasticity + performance and Quantum B needs reliability + security, a single monolith cannot optimize for both simultaneously. This difference is what justifies the complexity of distributed architecture.
### Step 6: Determine Architecture Direction
**ACTION:** Based on quantum count and characteristic differences, recommend monolith vs distributed.
**WHY:** This is the payoff. The quantum count IS the architecture style driver:
| Quantum count | Characteristic uniformity | Recommendation |
|:---:|:---:|-------------|
| 1 | N/A (only one) | **Monolith** — single set of characteristics, simple deployment |
| Multiple | Same characteristics | **Monolith might still work** — if quanta need the same things, a monolith can satisfy all |
| Multiple | **Different** characteristics | **Distributed required** — different quanta need different optimization, monolith can't serve both |
## Inputs
- Codebase to analyze (preferred) OR system description for greenfield
- Architecture characteristics (from `architecture-characteristics-identifier` or user input)
## Outputs
### Quantum Analysis Report
```markdown
# Quantum Analysis: {System Name}
## Components Identified
| Component | Responsibility | Deployment unit |
|-----------|---------------|----------------|
| {name} | {what it does} | {how it deploys} |
## Communication Map
| From | To | Type | Mechanism | Fate-sharing? |
|------|-----|------|-----------|:---:|
| {A} | {B} | Sync/Async | REST/MQ/gRPC | Yes/No |
## Quantum Map
| Quantum | Components | Driving Characteristics | Communication type |
|---------|-----------|------------------------|:---:|
| {Quantum 1} | {A, B} | {elasticity, performance} | Internal: sync |
| {Quantum 2} | {C} | {reliability, security} | External: async from Q1 |
## Characteristic Comparison
| Characteristic | Quantum 1 | Quantum 2 | Quantum 3 | Uniform? |
|---------------|-----------|-----------|-----------|:---:|
| {attr} | High/Med/Low | High/Med/Low | High/Med/Low | Yes/No |
## Architecture Direction
**Quantum count:** {N}
**Characteristic uniformity:** {Uniform / Non-uniform}
**Recommendation:** {Monolith / Distributed}
**Reasoning:** {why, based on quantum analysis}
## Warnings
- {Any anti-patterns detected: uniform characteristics assumption, shared DB coupling, etc.}
```
## Key Principles
- **Synchronous connascence = shared fate** — If Service A calls Service B synchronously, they must have compatible operational characteristics for the duration of that call. A highly scalable caller paired with a non-scalable callee creates a bottleneck. This is why sync communication defines quantum boundaries.
- **The database is part of the quantum** — A shared database means shared deployment. You cannot independently deploy services that share a database without risk of schema conflicts. Legacy systems with one shared database are, by definition, a single quantum regardless of how many services exist.
- **Non-uniform characteristics drive distribution** — The ONLY valid reason to accept the complexity of distributed architecture is that different parts of the system need genuinely different quality attributes. If everything needs the same characteristics, keep it monolith. Distribution for its own sake is unnecessary complexity.
- **Don't assume uniformity** — The most common mistake is applying one set of characteristics to the entire system. Ask about each major component: "Does this part need the same scalability/reliability/performance as the other parts?" Differences reveal quantum boundaries.
- **Quantum = bounded context (deployment lens)** — In Domain-Driven Design, bounded contexts define functional boundaries. Architecture quanta add the deployment and operational perspective. A bounded context with its own database that deploys independently IS a quantum.
## Examples
**Scenario: Online auction system (Going Going Gone)**
Trigger: "Our auction platform has bidding, payment, and notification features. Should we use microservices?"
Process: Identified 4 components (Bidder, Auction, Payment, Notification). Mapped communication: Bidder↔Auction is synchronous REST (same quantum — bidders need instant auction state), Auction→Payment is async via message queue (different quantum — payment needs reliability, not speed), Auction→Notification is async (different quantum — notifications can be delayed). Characteristics: Bidding quantum needs elasticity+performance (auction traffic bursts), Payment quantum needs reliability+security, Notification quantum needs availability. Three quanta with different characteristics → distributed architecture required.
Output: Quantum analysis showing 3 quanta, non-uniform characteristics, recommending distributed with event-driven communication between quanta.
**Scenario: Simple ordering system analysis**
Trigger: "We're a small team building an ordering app. Our CTO wants microservices but I think we're overcomplicating things."
Process: Identified components (Order, Inventory, Payment, User). All communicate synchronously via REST, share one PostgreSQL database, deploy as one Docker container. All need the same moderate characteristics (availability, simplicity). One quantum with uniform characteristics → monolith recommended. Flagged the shared database as proof of single quantum.
Output: Quantum analysis showing 1 quantum, recommending monolith. Diplomatically addressed CTO's microservices enthusiasm by showing the quantum analysis doesn't justify distribution.
**Scenario: Codebase analysis of existing system**
Trigger: User has a codebase at `./test-env/` — "Analyze this system's architecture quanta"
Process: Scanned docker-compose.yml, found 4 services with different networks. Read source files, found synchronous HTTP calls (httpx) between bidder and auction services, asynchronous RabbitMQ between auction→payment and auction→notification. Read architecture-characteristics.yaml, found different characteristic profiles per service. Grouped: Bidder+Auction (sync, shared network, same scaling) = Quantum 1, Payment (async consumer, independent) = Quantum 2, Notification (async consumer, independent) = Quantum 3.
Output: Full quantum analysis with communication map, quantum groupings, characteristic comparison, and distributed architecture recommendation.
## References
- For connascence types and their implications, see [references/connascence-types.md](references/connascence-types.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-architecture-characteristics-identifier`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/connascence-types.md
# Connascence Types Reference
Connascence defines how components are coupled. Two components are connascent if a change in one requires the other to be modified to maintain correctness. Use this to identify quantum boundaries.
## Static Connascence (discoverable from source code)
| Type | Description | Quantum impact |
|------|-----------|---------------|
| **Name (CoN)** | Components agree on entity names | Weakest. Normal coupling. |
| **Type (CoT)** | Components agree on entity types | Normal coupling. |
| **Meaning (CoM)** | Components agree on meaning of values (e.g., TRUE=1) | Moderate coupling. |
| **Position (CoP)** | Components agree on parameter order | Moderate coupling. |
| **Algorithm (CoA)** | Components agree on an algorithm (e.g., hashing) | Strongest static. Cross-service algorithm agreement = tight coupling. |
Static connascence exists at compile time. It indicates structural coupling but does NOT define quantum boundaries by itself.
## Dynamic Connascence (observable at runtime)
| Type | Description | Quantum impact |
|------|-----------|---------------|
| **Execution (CoE)** | Order of execution matters | Implies synchronous coordination. |
| **Timing (CoT)** | Timing of execution matters (race conditions) | Implies shared fate — same quantum. |
| **Values (CoV)** | Multiple values must change together (distributed transactions) | Strong coupling — same quantum or explicit saga pattern. |
| **Identity (CoI)** | Components reference the same entity | Strongest dynamic. Shared mutable state = same quantum. |
## The Quantum Rule
**Synchronous connascence = same quantum.** If Component A synchronously waits for Component B's response, they share operational fate for the duration of the call. Their availability, scalability, and performance characteristics must be compatible.
**Asynchronous communication breaks quantum boundaries.** Fire-and-forget via message queues means Component A does not wait for Component B. They can have independent operational characteristics.
## Strength Ordering (weakest to strongest)
```
Static: Name → Type → Meaning → Position → Algorithm
Dynamic: Execution → Timing → Values → Identity
```
Stronger connascence = harder to refactor = more likely to be within the same quantum.
Design automated governance mechanisms (fitness functions) that objectively measure and enforce architecture characteristics over time. Use this skill whenev...
---
name: architecture-fitness-function-designer
description: Design automated governance mechanisms (fitness functions) that objectively measure and enforce architecture characteristics over time. Use this skill whenever the user asks about architecture governance, fitness functions, automated architecture testing, architecture compliance checks, preventing architecture erosion, enforcing layer dependencies, cyclomatic complexity thresholds, ArchUnit or NetArchTest rules, structural tests for architecture, CI/CD architecture gates, chaos engineering as governance, measuring architecture characteristics objectively, architecture drift detection, continuous architecture verification, or wants to ensure their codebase stays aligned with architecture decisions -- even if they don't use the term "fitness function."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-fitness-function-designer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [6]
tags: [software-architecture, fitness-functions, governance, metrics, architecture-erosion, ArchUnit, CI-CD, cyclomatic-complexity, chaos-engineering]
depends-on:
- architecture-characteristics-identifier
- modularity-health-evaluator
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "A software project with identified architecture characteristics to govern"
- type: none
description: "Alternatively, a description of architecture decisions and characteristics to protect"
tools-required: [Read, Write]
tools-optional: [Grep, Glob, Bash]
mcps-required: []
environment: "Best results inside a codebase directory with CI/CD pipeline access. Can also produce a governance plan from descriptions."
---
# Architecture Fitness Function Designer
## When to Use
You need to create automated, objective mechanisms that verify your architecture characteristics are maintained over time. Typical triggers:
- The user has identified architecture characteristics (scalability, deployability, testability, etc.) but has no automated way to verify them
- The user's architecture decisions are being violated without detection -- code is drifting from the intended design
- The user wants to enforce structural rules (layer dependencies, package access, no circular dependencies)
- The user wants CI/CD pipeline gates that prevent architecture degradation
- The user is concerned about architecture erosion -- decisions made months ago are no longer reflected in the code
- The user mentions chaos engineering, ArchUnit, NetArchTest, or architectural compliance testing
Before starting, verify:
- Are the target architecture characteristics already identified? (If not, invoke the `architecture-characteristics-identifier` skill first)
- Does the user have a CI/CD pipeline where fitness functions can be integrated?
- What technology stack is in use? (This determines which fitness function tools are available)
## Context & Input Gathering
### Required Context (must have before proceeding)
- **Architecture characteristics to govern:** Which quality attributes need automated enforcement?
-> Check prompt for: scalability, performance, deployability, testability, security, modularity, maintainability
-> Check environment for: ADRs, architecture docs, quality attribute definitions
-> If still missing, ask: "Which architecture characteristics are most important for your system? Pick your top 3."
- **Technology stack:** What language/framework is the codebase built with?
-> Check prompt for: Java, Kotlin, C#, .NET, Python, Go, JavaScript/TypeScript, Spring, Django
-> Check environment for: pom.xml, build.gradle, package.json, go.mod, requirements.txt, *.csproj
-> If still missing, ask: "What is your primary technology stack? This determines which fitness function tools are available."
### Observable Context (gather from environment if available)
- **Existing CI/CD pipeline:** Is there a pipeline to integrate fitness functions into?
-> Look for: Jenkinsfile, .github/workflows/, .gitlab-ci.yml, Dockerfile, docker-compose.yml
-> If unavailable: design fitness functions as standalone test suites
- **Existing architecture tests:** Are there any structural tests already in place?
-> Look for: ArchUnit tests, NetArchTest files, custom architecture validation scripts
-> If unavailable: start from scratch
- **Codebase structure:** How is the code organized?
-> Look for: package structure, layer boundaries, module organization
-> If unavailable: rely on user description
### Default Assumptions
- If no CI/CD pipeline exists -> design fitness functions as test suites that can be run locally and later integrated
- If architecture characteristics are not formally identified -> derive them from the user's concern description
- If no specific thresholds are mentioned -> use industry-standard defaults (CC<10, p95 response times, etc.)
- If technology is unknown -> provide language-agnostic fitness function designs with tool recommendations
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- At least one architecture characteristic is identified for governance
- Technology stack is known or estimable
- The user's governance concern is clear (what they want to prevent)
PROCEED WITH DEFAULTS when:
- Characteristics are identified
- Technology stack is partially known
- Specific thresholds can use industry defaults
MUST ASK when:
- No architecture characteristics are identified AND cannot be inferred
- The user's concern is too vague to design specific fitness functions
```
## Process
### Step 1: Inventory Architecture Characteristics to Govern
**ACTION:** List all architecture characteristics that need automated governance. For each, identify:
- Category: operational (runtime behavior), structural (code organization), or process (development workflow)
- Current state: is this characteristic being measured at all today?
- Risk level: what is the consequence if this characteristic degrades undetected?
**WHY:** Fitness functions are only valuable when they protect characteristics that matter. Trying to govern everything creates noise and slows the pipeline. By categorizing characteristics, you determine which types of fitness functions to create. Operational characteristics need runtime monitoring. Structural characteristics need build-time analysis. Process characteristics need CI/CD pipeline metrics. Prioritize by risk -- a silently degrading scalability characteristic is more dangerous than a slightly suboptimal code style metric.
**IF** characteristics are already identified (from `architecture-characteristics-identifier`) -> proceed with that list
**ELSE** -> extract characteristics from the user's concern description and architecture documentation
### Step 2: Define Measurable Thresholds for Each Characteristic
**ACTION:** For each architecture characteristic, define what "good" looks like with concrete, measurable thresholds:
**Operational characteristics:**
- Response time: use percentiles, not averages. p95 < 200ms, p99 < 500ms (averages hide tail latency)
- Throughput: requests per second under expected load
- Availability: uptime percentage (99.9% = 8.7 hours downtime/year)
- Scalability: response time degradation under 2x, 5x, 10x load
**Structural characteristics:**
- Cyclomatic complexity: CC<10 per function (simple/low risk), 10-20 (moderate), >20 (problematic), >50 (untestable)
- Layer violations: zero tolerance for bypassing layers (e.g., UI directly calling database)
- Package dependency rules: no circular dependencies, enforced dependency direction
- Component coupling: maximum efferent coupling per module
**Process characteristics:**
- Test coverage: minimum percentage per module (e.g., >80% for critical paths)
- Deployment frequency: target deployments per week/day
- Change lead time: commit to production time
- Mean time to recover (MTTR): maximum acceptable recovery time
**WHY:** Without concrete thresholds, fitness functions become subjective opinions rather than objective tests. A fitness function that says "performance should be good" is useless. A fitness function that says "p95 response time for /api/orders must be under 200ms" is a pass/fail gate. The threshold is the line between "architecture is intact" and "architecture is eroding." For response times specifically, averages are misleading -- a p50 of 50ms can hide a p99 of 5000ms, meaning 1% of users have a terrible experience. Always use percentiles.
### Step 3: Classify Each Fitness Function
**ACTION:** For each fitness function, classify along five dimensions:
1. **Scope: Atomic vs Holistic**
- Atomic: tests a single characteristic in isolation (e.g., "no class exceeds CC of 20")
- Holistic: tests the interplay of multiple characteristics (e.g., "security + performance: encryption must not push p95 above 300ms")
2. **Cadence: Triggered vs Continuous**
- Triggered: runs on specific events (commit, PR, deployment)
- Continuous: runs constantly in production (monitoring, alerting)
3. **Nature: Static vs Dynamic**
- Static: analyzes code/configuration without running it (linting, dependency analysis)
- Dynamic: requires running the system (load tests, chaos tests, integration tests)
4. **Automation: Automated vs Manual**
- Automated: runs without human intervention (preferred)
- Manual: requires human judgment (code review checklists, architecture review boards)
5. **Temporality: Fixed vs Evolving**
- Fixed: threshold stays constant (zero layer violations)
- Evolving: threshold tightens over time (CC limit drops from 30 to 20 to 10 as codebase matures)
**WHY:** Classification determines where and how each fitness function is implemented. An atomic/triggered/static/automated fitness function is a unit test in CI. A holistic/continuous/dynamic/automated fitness function is a production monitoring alert. A holistic/triggered/dynamic/manual fitness function is a pre-release load test with human review. Without classification, teams implement all fitness functions in the same way, which either misses runtime issues (all static) or slows the pipeline (all dynamic).
### Step 4: Design Implementation for Each Fitness Function
**ACTION:** For each classified fitness function, specify the concrete implementation:
**For structural fitness functions (static/triggered):**
- Java/Kotlin: ArchUnit tests in the test suite
```java
@ArchTest
static final ArchRule no_layer_violations =
noClasses().that().resideInAPackage("..service..")
.should().dependOnClassesThat().resideInAPackage("..controller..");
```
- C#/.NET: NetArchTest
- Python: custom pytest fixtures using AST analysis or import linting
- Any language: custom scripts analyzing dependency graphs
**For operational fitness functions (dynamic/continuous):**
- Response time monitoring with percentile alerting (Prometheus, Datadog, New Relic)
- Load test suites (k6, Gatling, Locust) with pass/fail thresholds
- Chaos engineering: randomly terminate instances to verify resilience (inspired by Netflix Simian Army)
- Health check endpoints with degradation detection
**For process fitness functions (triggered):**
- CI pipeline gates: test coverage checks, deployment frequency tracking
- Git hooks: commit message format, branch naming conventions
- Build-time metrics: build duration, artifact size budgets
**WHY:** A fitness function that exists only as documentation is not a fitness function -- it is a wish. Implementation specifics ensure each function actually runs, produces a pass/fail result, and blocks or alerts when the architecture is violated. The tool choice matters because some fitness functions only work with specific ecosystems. ArchUnit is powerful for JVM projects but useless for Python. Chaos engineering requires production-like environments. Design the implementation around the team's actual capabilities and tooling.
**IF** codebase is available -> **AGENT: EXECUTES** -- generate fitness function test files, CI config, monitoring config
**ELSE** -> produce implementation specifications with code templates
### Step 5: Design the Integration Strategy
**ACTION:** Determine where each fitness function runs in the development lifecycle:
```
Developer Workstation CI Pipeline Staging Production
├── Pre-commit hooks ├── Build stage ├── Load tests ├── Continuous monitoring
│ └── Linting │ └── ArchUnit │ └── p95 gates │ └── p95/p99 alerts
│ └── CC check (fast) │ └── CC analysis ├── Chaos tests ├── Chaos engineering
├── Pre-push hooks ├── Test stage │ └── Resilience │ └── Simian Army
│ └── Dep. analysis │ └── Coverage gate ├── Security scans ├── Architecture drift
├── Quality gate │ └── OWASP/SAST │ └── Daily reports
│ └── Pass/fail ├── SLA monitoring
├── Deploy gate │ └── Uptime alerts
│ └── Approval
```
**WHY:** Fitness functions placed too early slow developers down (running load tests on every commit). Fitness functions placed too late catch problems when they are expensive to fix (finding layer violations in production). The integration strategy matches each fitness function to the earliest point where it can run without unacceptable delay. Static/atomic functions run on every commit. Dynamic/holistic functions run in staging or production. This mirrors the testing pyramid: fast/cheap tests run frequently, slow/expensive tests run at key gates.
**HANDOFF TO HUMAN** for production chaos engineering setup -- injecting failures in production requires organizational buy-in, blast radius controls, and runbook preparation that go beyond what an agent can configure.
### Step 6: Create the Fitness Function Governance Report
**ACTION:** Produce the complete fitness function design document combining all classifications, implementations, and integration points.
**WHY:** The governance report serves as the architecture team's contract with the development team. It documents what is being governed, why, and how -- so developers understand that a failing fitness function is not a "broken test" but an architecture violation that needs architectural resolution, not a test skip. Without this document, fitness functions are treated as optional tests that can be ignored under deadline pressure.
## Inputs
- Architecture characteristics to govern (from the `architecture-characteristics-identifier` skill or user description)
- Technology stack and CI/CD pipeline configuration
- Existing architecture decisions or ADRs
- Optionally: current codebase for structural analysis, production monitoring setup
## Outputs
### Fitness Function Governance Report
```markdown
# Fitness Function Governance Report: {System Name}
## Governance Scope
- **Date:** {date}
- **Architecture characteristics governed:** {list}
- **Technology stack:** {stack}
- **CI/CD pipeline:** {tool}
## Fitness Function Inventory
| ID | Characteristic | Fitness Function | Threshold | Scope | Cadence | Nature | Automation |
|----|---------------|-----------------|-----------|-------|---------|--------|------------|
| FF-01 | {characteristic} | {description} | {threshold} | {atomic/holistic} | {triggered/continuous} | {static/dynamic} | {auto/manual} |
## Implementation Details
### FF-01: {Fitness Function Name}
- **Protects:** {characteristic}
- **Threshold:** {measurable pass/fail criteria}
- **Classification:** {scope} / {cadence} / {nature} / {automation} / {temporality}
- **Implementation:** {tool and code/config}
- **Integration point:** {where it runs in the lifecycle}
- **Failure action:** {block pipeline / alert / report}
- **Evolving threshold:** {how the threshold changes over time, if applicable}
## Integration Map
{Lifecycle diagram showing where each FF runs}
## Temporal Evolution Plan
| Phase | Timeline | FF Changes |
|-------|----------|------------|
| Baseline | Now | {initial thresholds — permissive to establish baseline} |
| Tighten | +3 months | {reduce CC limit, increase coverage requirement} |
| Mature | +6 months | {add holistic FFs, chaos engineering} |
## Architecture Erosion Risk Assessment
| Risk | Without Fitness Functions | With Fitness Functions |
|------|------------------------|---------------------|
| {risk description} | {undetected until...} | {caught at... by FF-xx} |
```
## Key Principles
- **Fitness functions must be objective and automated** -- A fitness function that requires subjective human judgment is a code review, not a fitness function. The defining characteristic is objectivity: a machine can evaluate the result as pass or fail without interpretation. Manual fitness functions are acceptable only as a temporary measure while automation is being built, and they must have a migration plan to automation.
- **Measure percentiles, not averages, for operational characteristics** -- An average response time of 100ms can hide a p99 of 5 seconds. Averages are statistically misleading for latency distributions, which are typically long-tailed. Always define operational thresholds using p95 or p99 percentiles. This is the single most common measurement mistake in architecture governance.
- **Fitness functions are tests, not monitoring** -- Monitoring tells you what happened. Fitness functions tell you whether it was acceptable. A fitness function wraps a measurement in a pass/fail threshold. Response time monitoring without a threshold is observability. Response time monitoring that alerts when p95 exceeds 200ms is a fitness function. The threshold transforms data into governance.
- **Start permissive, tighten over time (temporal fitness functions)** -- A codebase with functions averaging CC of 35 cannot jump to a CC<10 threshold overnight. Set initial thresholds just below current worst-case, then ratchet them down quarterly. This prevents fitness functions from being disabled under pressure ("we can't ship if this test blocks us") while still driving improvement. The goal is a trend line, not immediate perfection.
- **Holistic fitness functions catch what atomic ones miss** -- Individual characteristics may pass their thresholds while the system as a whole degrades. Security encryption may pass its test, and response time may pass its test, but the combination degrades user experience. Holistic fitness functions test the interaction between characteristics. They are harder to build but catch the most dangerous architectural problems -- the ones that emerge from trade-off conflicts.
- **Architecture erosion is silent without fitness functions** -- Code naturally drifts from architectural intent. Developers under deadline pressure take shortcuts. Layer boundaries get bypassed. Dependency directions reverse. Without automated detection, this erosion accumulates until the architecture exists only in documentation, not in code. Fitness functions are the immune system that detects violations before they metastasize.
## Examples
**Scenario: Java Spring Boot microservices governance**
Trigger: "We identified scalability, deployability, and testability as our top architecture characteristics. How do we create automated checks to ensure our codebase doesn't drift from these goals? We use Java with Spring Boot and have a Jenkins CI pipeline."
Process: Inventoried three characteristics across operational, structural, and process categories. Defined thresholds: scalability (p95 <200ms under 2x load), deployability (deploy time <15min, zero-downtime deploys), testability (>80% coverage on service layer, CC<10 per method). Classified each: scalability = atomic/triggered/dynamic (load test in staging), deployability = atomic/triggered/static (build time check) + holistic/continuous/dynamic (deploy monitoring), testability = atomic/triggered/static (ArchUnit + JaCoCo in CI). Designed ArchUnit tests for layer dependency enforcement. Configured Jenkins pipeline gates: build -> ArchUnit -> coverage -> deploy-to-staging -> k6 load test -> promote.
Output: 8 fitness functions with Jenkins pipeline integration, ArchUnit test file, k6 load test script, and temporal evolution plan (tighten CC from 20 to 10 over 6 months).
**Scenario: Cross-database dependency enforcement**
Trigger: "Our architecture decision says 'no service should directly depend on another service's database.' How do we enforce this automatically? We have 8 microservices in a Kotlin/Spring project."
Process: Identified this as a structural/holistic fitness function protecting data isolation (a key microservices characteristic). Designed ArchUnit test that verifies each service's repository classes only reference their own database schema. Added a network-level fitness function: database connection strings in each service's config must only point to that service's database. Classified as atomic/triggered/static/automated. Created a holistic companion: integration test that detects cross-service database queries by analyzing SQL query logs. Integrated both into the CI pipeline as blocking gates.
Output: ArchUnit test class enforcing package-to-schema mapping, config validation script, integration test for cross-database query detection, and CI pipeline configuration.
**Scenario: Architecture erosion prevention program**
Trigger: "Our CTO is concerned about architecture erosion. We made decisions 6 months ago but nobody checks if the code still follows them. How do we set up governance that doesn't rely on manual code reviews?"
Process: Audited existing ADRs to identify 5 key architecture decisions. Mapped each decision to a testable fitness function: (1) layered architecture compliance -> ArchUnit layer rules, (2) no circular package dependencies -> JDepend analysis in CI, (3) API response time SLAs -> p95 monitoring with alerting, (4) maximum component coupling -> efferent coupling threshold in static analysis, (5) security: no plaintext secrets -> secret scanning in pre-commit hooks. Classified all as automated. Designed temporal evolution: start with reporting-only mode (2 weeks to establish baseline), then warning mode (2 weeks for team awareness), then blocking mode (permanent). Created architecture erosion dashboard showing fitness function pass rates over time.
Output: 5 fitness functions with phased rollout plan, ArchUnit configuration, CI pipeline gates, monitoring dashboard spec, and team communication template explaining the new governance approach.
## References
- For the complete fitness function classification taxonomy with examples per category, see [references/fitness-function-catalog.md](references/fitness-function-catalog.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-architecture-characteristics-identifier`
- `clawhub install bookforge-modularity-health-evaluator`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/fitness-function-catalog.md
# Fitness Function Catalog
Reference catalog of fitness function types, implementations, and examples organized by architecture characteristic category. Read this file when you need specific implementation details for a fitness function type.
## Classification Dimensions
Every fitness function is classified along five dimensions. This table provides the full taxonomy:
| Dimension | Option A | Option B | Decision Factor |
|-----------|----------|----------|-----------------|
| **Scope** | Atomic (single characteristic) | Holistic (multiple characteristics) | Does the check involve trade-offs between characteristics? |
| **Cadence** | Triggered (on event) | Continuous (always running) | Can you afford to wait for an event, or must violations be caught immediately? |
| **Nature** | Static (no execution) | Dynamic (requires running system) | Does the check need runtime behavior or just code analysis? |
| **Automation** | Automated (machine evaluates) | Manual (human evaluates) | Can the pass/fail criteria be expressed as a machine-readable rule? |
| **Temporality** | Fixed (constant threshold) | Evolving (threshold changes over time) | Is the target state known, or is the team incrementally improving? |
## Structural Fitness Functions
Structural fitness functions verify code organization and design rules at build time. They are the easiest to implement and the most cost-effective governance mechanism.
### Layer Dependency Enforcement
**Protects:** Modularity, maintainability
**Classification:** Atomic / Triggered / Static / Automated / Fixed
**Threshold:** Zero violations (any layer bypass = failure)
**Java/Kotlin (ArchUnit):**
```java
import com.tngtech.archunit.lang.ArchRule;
import static com.tngtech.archunit.lang.syntax.ArchRuleDefinition.noClasses;
@ArchTest
static final ArchRule services_should_not_access_controllers =
noClasses().that().resideInAPackage("..service..")
.should().dependOnClassesThat().resideInAPackage("..controller..");
@ArchTest
static final ArchRule controllers_should_not_access_repositories =
noClasses().that().resideInAPackage("..controller..")
.should().dependOnClassesThat().resideInAPackage("..repository..");
```
**C# (.NET with NetArchTest):**
```csharp
var result = Types.InAssembly(typeof(ServiceClass).Assembly)
.That().ResideInNamespace("Services")
.ShouldNot().HaveDependencyOn("Controllers")
.GetResult();
Assert.True(result.IsSuccessful);
```
**Python (custom with importlib or AST):**
```python
import ast, os
def check_no_controller_imports_in_services(service_dir):
violations = []
for root, _, files in os.walk(service_dir):
for f in files:
if f.endswith('.py'):
tree = ast.parse(open(os.path.join(root, f)).read())
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
if 'controller' in alias.name:
violations.append(f"{f}: imports {alias.name}")
assert len(violations) == 0, f"Layer violations: {violations}"
```
### Cyclomatic Complexity Thresholds
**Protects:** Testability, maintainability
**Classification:** Atomic / Triggered / Static / Automated / Evolving
**Thresholds:**
| CC Score | Risk Level | Action |
|:--------:|-----------|--------|
| 1-10 | Low risk | No action needed |
| 11-20 | Moderate risk | Flag for review, refactor when touched |
| 21-50 | High risk | Must refactor before next release |
| 51+ | Untestable | Block pipeline, immediate refactoring required |
**Implementation:** Integrate into CI with language-specific tools:
- Java: PMD, SonarQube, Checkstyle
- Python: radon, flake8 with mccabe plugin
- JavaScript/TypeScript: ESLint complexity rule
- C#: NDepend, SonarQube
- Go: gocyclo
**Evolving threshold example:**
```
Month 1-3: Block on CC > 50 (catch only untestable code)
Month 4-6: Block on CC > 30 (start driving improvement)
Month 7-12: Block on CC > 15 (approach target)
Month 12+: Block on CC > 10 (target state)
```
### No Circular Dependencies
**Protects:** Modularity, deployability
**Classification:** Atomic / Triggered / Static / Automated / Fixed
**Threshold:** Zero circular dependencies between packages/modules
**Java (ArchUnit):**
```java
@ArchTest
static final ArchRule no_cycles =
slices().matching("com.myapp.(*)..").should().beFreeOfCycles();
```
**Any language (custom):** Build a dependency graph from imports and run a topological sort. If the sort fails, circular dependencies exist.
### Package Access Rules
**Protects:** Encapsulation, modularity
**Classification:** Atomic / Triggered / Static / Automated / Fixed
**Java (ArchUnit) - microservice database isolation:**
```java
@ArchTest
static final ArchRule order_service_uses_only_order_db =
classes().that().resideInAPackage("..order.repository..")
.should().onlyAccessClassesThat()
.resideInAnyPackage("..order..", "java..", "javax..", "org.springframework..");
```
### Component Size Budgets
**Protects:** Maintainability, modularity
**Classification:** Atomic / Triggered / Static / Automated / Evolving
**Thresholds:**
- Maximum lines per file: 500 (recommend 300)
- Maximum methods per class: 20
- Maximum parameters per method: 5
- Maximum depth of inheritance: 4
## Operational Fitness Functions
Operational fitness functions verify runtime behavior. They require a running system and are typically more expensive to implement.
### Response Time Percentiles
**Protects:** Performance, user experience
**Classification:** Atomic / Continuous / Dynamic / Automated / Fixed
**Thresholds (typical web application):**
| Percentile | Threshold | Why This Percentile |
|:----------:|:---------:|---------------------|
| p50 | < 100ms | Median user experience |
| p95 | < 200ms | 95% of users have acceptable experience |
| p99 | < 500ms | Even tail users are not abandoned |
| max | < 2000ms | No request should feel broken |
**Why percentiles, not averages:** A service with average response time of 80ms could have p99 of 5000ms. This means 1% of users (potentially thousands per hour) experience 5-second delays. Averages are statistically meaningless for latency distributions, which are always long-tailed. Always use p95 or p99.
**Implementation:**
- Prometheus + Grafana with histogram metrics and alert rules
- Datadog APM with SLO monitors
- Custom: log response times, compute percentiles, alert on threshold breach
### Scalability Under Load
**Protects:** Scalability
**Classification:** Holistic / Triggered / Dynamic / Automated / Evolving
**Threshold:** Response time degradation < 20% under 2x load, < 50% under 5x load
**Implementation:**
- k6, Gatling, or Locust load test suites running in staging
- Compare p95 at baseline load vs 2x/5x load
- Fail the pipeline if degradation exceeds threshold
### Chaos Engineering (Netflix Simian Army Model)
**Protects:** Resilience, fault tolerance, recoverability
**Classification:** Holistic / Continuous / Dynamic / Automated / Fixed
**Threshold:** System maintains availability SLA during injected failures
The Netflix Simian Army pioneered the concept of using controlled failure injection as a fitness function:
| Tool | What It Tests | Blast Radius |
|------|--------------|--------------|
| Chaos Monkey | Random instance termination | Single instance |
| Latency Monkey | Artificial network delays | Single service |
| Conformity Monkey | Best practice compliance | Configuration |
| Security Monkey | Security configuration | Vulnerability |
| Janitor Monkey | Unused resource cleanup | Cost optimization |
**Key principle:** If your architecture claims to be resilient, prove it by breaking things. A resilience characteristic without chaos testing is an untested assumption.
**Implementation levels:**
1. **Starter:** Kill random test environment instances, verify auto-recovery
2. **Intermediate:** Inject latency between services in staging, verify graceful degradation
3. **Advanced:** Run Chaos Monkey in production during business hours with blast radius controls
### Availability Monitoring
**Protects:** Availability
**Classification:** Atomic / Continuous / Dynamic / Automated / Fixed
**Threshold:** Based on SLA tier:
| SLA | Annual Downtime | Monthly Downtime |
|:---:|:--------------:|:---------------:|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.56 minutes | 4.38 minutes |
## Process Fitness Functions
Process fitness functions verify development workflow health. They govern how the team builds software, not just what it builds.
### Test Coverage Gates
**Protects:** Testability
**Classification:** Atomic / Triggered / Static / Automated / Evolving
**Thresholds:**
- Overall coverage: > 70% (minimum viable)
- Service/business logic layer: > 85%
- Critical paths (payment, authentication): > 95%
- New code (diff coverage): > 80%
### Deployment Pipeline Metrics
**Protects:** Deployability
**Classification:** Holistic / Continuous / Dynamic / Automated / Evolving
**Metrics (from DORA):**
| Metric | Elite | High | Medium | Low |
|--------|:-----:|:----:|:------:|:---:|
| Deploy frequency | Multiple/day | Weekly-monthly | Monthly-biannually | Biannually+ |
| Change lead time | < 1 hour | 1 day - 1 week | 1 week - 1 month | 1-6 months |
| Change failure rate | 0-15% | 16-30% | 16-30% | 46-60% |
| MTTR | < 1 hour | < 1 day | < 1 day | 1 week - 1 month |
### Architecture Decision Compliance
**Protects:** All characteristics (meta-governance)
**Classification:** Holistic / Triggered / Static / Manual (with automation support) / Fixed
For each ADR, create a corresponding fitness function:
- ADR says "use event-driven for inter-service communication" -> fitness function scans for synchronous cross-service REST calls
- ADR says "all data at rest must be encrypted" -> fitness function scans database configs for encryption settings
- ADR says "maximum 3 layers of service dependency" -> fitness function analyzes the service dependency graph depth
## Composite Fitness Functions
Some characteristics cannot be measured by a single metric. Composite fitness functions combine multiple measurements:
### Agility Composite
Agility = testability + deployability + modularity
```
agility_score = (
test_coverage_percentage * 0.3 +
deploy_frequency_score * 0.3 +
modularity_distance_from_main_sequence * 0.4
)
threshold: agility_score > 0.7
```
### Process Measure Composite (from the book)
Process measures like testability and deployability are composite by nature:
```
process_health = (
cyclomatic_complexity_compliance + # % of functions with CC < 10
test_coverage + # overall test coverage %
deployment_success_rate + # % of deploys without rollback
change_lead_time_score # normalized 0-1
) / 4
threshold: process_health > 0.75
```
## Anti-Patterns in Fitness Function Design
| Anti-Pattern | Description | Correction |
|-------------|-------------|------------|
| **All-or-nothing** | Setting perfect thresholds day one, causing mass failures | Start permissive, tighten over time (temporal) |
| **Noise factory** | Too many fitness functions generating constant warnings | Prioritize by risk; only block on critical violations |
| **Cargo cult** | Copying fitness functions from another team without adapting | Each function must protect a specific characteristic YOU care about |
| **Test theater** | Fitness functions that always pass (thresholds too loose) | Thresholds must be tight enough to catch real violations |
| **Pipeline blocker** | Dynamic tests on every commit, slowing development | Match function to appropriate lifecycle stage |
## Fitness Function Evolution Roadmap Template
| Phase | Duration | Focus | Actions |
|-------|----------|-------|---------|
| **Baseline** | Weeks 1-2 | Measure current state | Deploy all fitness functions in reporting-only mode |
| **Awareness** | Weeks 3-4 | Team alignment | Share reports, discuss violations, agree on thresholds |
| **Warning** | Months 2-3 | Soft enforcement | Fitness functions warn but don't block |
| **Enforcement** | Month 4+ | Hard governance | Fitness functions block pipeline on violation |
| **Tightening** | Quarterly | Continuous improvement | Review and tighten thresholds each quarter |
Create effective architecture diagrams following established diagramming standards (UML, C4, ArchiMate) with proper visual elements and presentation techniqu...
---
name: architecture-diagram-creator
description: Create effective architecture diagrams following established diagramming standards (UML, C4, ArchiMate) with proper visual elements and presentation techniques. Use this skill whenever the user needs to create, review, or improve architecture diagrams, wants guidance on which diagramming standard to use, needs help with diagram elements (titles, lines, shapes, labels, color, keys), is preparing architecture presentations with slides, wants to use incremental builds for presenting complex diagrams, is struggling with inconsistent notation across diagrams, or needs to maintain representational consistency across different zoom levels of their architecture — even if they don't explicitly say "diagram."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-diagram-creator
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [21]
tags: [software-architecture, architecture, diagrams, presentation, UML, C4, ArchiMate, communication, visual]
depends-on: []
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Architecture context from the user — system components, relationships, communication patterns, and target audience"
tools-required: [Read, Write]
tools-optional: []
mcps-required: []
environment: "Any agent environment. No codebase required."
---
# Architecture Diagram Creator
## When to Use
You need to create or improve architecture diagrams that effectively communicate system design. Typical triggers:
- The user needs to create a new architecture diagram and wants guidance on standards and best practices
- The user has existing diagrams with inconsistent notation and wants to standardize them
- The user is preparing a presentation of their architecture and wants advice on visual communication
- The user is choosing between diagramming standards (UML, C4, ArchiMate)
- The user has diagrams at different zoom levels that don't connect visually
- The user is spending too much time perfecting diagrams early in the design process (Irrational Artifact Attachment)
Before starting, verify:
- What system or architecture needs to be diagrammed?
- Who is the audience (developers, executives, operations, mixed)?
## Context
### Required Context (must have before proceeding)
- **System to diagram:** What system, service, or architecture needs visual representation?
-> Check prompt for: system names, service descriptions, component lists, technology mentions
-> If still missing, ask: "What system or architecture do you need to diagram? Can you describe its main components and how they communicate?"
- **Target audience:** Who will view these diagrams?
-> Check prompt for: "developers," "CTO," "stakeholders," "team," presentation context
-> If still missing, ask: "Who is the primary audience for this diagram — developers, executives, operations, or a mix?"
### Observable Context (gather from environment)
- **Existing diagrams:** Are there existing diagrams that need improvement or consistency fixes?
-> Check prompt for: references to current diagrams, notation complaints, inconsistency mentions
-> If unavailable: assume creating from scratch
- **Diagramming tool:** What tool is being used?
-> Check prompt for: tool names (Mermaid, PlantUML, draw.io, Lucidchart, OmniGraffle, Visio)
-> If unavailable: provide tool-agnostic guidance
- **Architecture style:** What architecture pattern is the system using?
-> Check prompt for: microservices, monolith, event-driven, layered, service-based
-> If unavailable: infer from component descriptions
### Default Assumptions
- If audience unknown -> assume mixed technical audience (developers + architects)
- If tool unknown -> provide text-based diagram descriptions and general guidelines
- If zoom level unknown -> start with the Container level (C4 terminology) as the most commonly useful view
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- System components and relationships are known or described
- Target audience is known or can be inferred
- Communication patterns (sync/async) are understood
PROCEED WITH DEFAULTS when:
- System is described at a high level
- Audience can be assumed as technical
- Standard architecture pattern is used
MUST ASK when:
- No system description is provided at all
- The request is ambiguous between creating vs reviewing diagrams
```
## Process
### Step 1: Select the Appropriate Diagramming Standard
**ACTION:** Based on the system type, audience, and organizational context, recommend a diagramming standard.
**WHY:** Different standards serve different purposes. UML is universally understood for class and sequence diagrams but most other diagram types have fallen into disuse. C4 provides four natural zoom levels ideal for monolithic architectures where container and component relationships matter. ArchiMate serves enterprise-level modeling across business domains. Choosing the wrong standard wastes time and confuses the audience.
| Standard | Best For | Limitations |
|----------|----------|-------------|
| **UML** | Class diagrams, sequence diagrams, workflow | Most diagram types are disused; overly formal for architecture overviews |
| **C4** | Systems with clear container and component boundaries; monolithic and service-based | Less suited for distributed architectures like microservices where container/component relationships differ |
| **ArchiMate** | Enterprise architecture spanning business domains | Heavier; overkill for single-system diagrams |
| **Custom notation** | When no standard fits perfectly | Requires a key; risk of misinterpretation without one |
**IF** the user's organization mandates a standard -> use that standard
**IF** the system is a monolith or service-based -> recommend C4 for its four zoom levels (Context, Container, Component, Class)
**IF** the system spans multiple business domains -> recommend ArchiMate
**IF** the need is class structure or workflow -> recommend UML (class/sequence diagrams only)
**ELSE** -> recommend custom notation with a clear key
### Step 2: Check for Irrational Artifact Attachment Risk
**ACTION:** Assess whether the user is at risk of the Irrational Artifact Attachment anti-pattern — spending disproportionate time creating beautiful diagrams before the design is stable.
**WHY:** There is a proportional relationship between how long it takes to produce an artifact and how irrationally attached a person becomes to it. A four-hour diagram creates more attachment than a two-hour one. This attachment prevents architects from revising designs when they should, because they don't want to "waste" the time invested. Early in design, use low-fidelity tools (whiteboards, tablets, sticky notes) so the team feels free to throw away and iterate.
**IF** the user is in early design phase -> recommend low-fidelity tools first (whiteboard, tablet, index cards)
**IF** the user has a stable, finalized architecture -> recommend investing in high-fidelity diagrams
**IF** the user mentions spending hours perfecting diagrams -> flag Irrational Artifact Attachment and recommend reducing tool investment until the design stabilizes
### Step 3: Apply Diagram Element Guidelines
**ACTION:** For each diagram, ensure all six core visual elements are properly used. For detailed standards and examples, see [references/diagram-standards.md](references/diagram-standards.md).
**WHY:** Each element serves a specific communication purpose. Missing or misused elements create ambiguity, and a diagram that leads to misinterpretation is worse than no diagram at all.
**Elements to verify:**
1. **Titles** — Every element must have a title or be well-known to the audience. Use rotation and visual effects to make titles "sticky" to their shapes.
2. **Lines** — Must be thick enough to see clearly. Solid lines = synchronous communication. Dotted lines = asynchronous communication. Use arrows for directional flow. Be consistent with arrowhead styles.
3. **Shapes** — Use 3D boxes for deployable artifacts, rectangles for containment. Build a stencil of standard shapes for organizational consistency.
4. **Labels** — Label every item, especially if there is any ambiguity. When in doubt, label.
5. **Color** — Use sparingly to distinguish artifacts from one another (e.g., different services in different colors). Favor monochrome with selective color over full-color chaos.
6. **Keys** — If shapes are ambiguous, include a key explaining what each shape represents. A misinterpreted diagram is worse than no diagram.
### Step 4: Ensure Representational Consistency
**ACTION:** If producing multiple diagrams at different zoom levels, ensure each maintains visual context showing where it fits in the larger architecture.
**WHY:** When an architect shows a portion of the architecture without indicating where it fits in the overall system, viewers lose context and become confused. Representational consistency means always showing the relationship between parts and the whole, either in diagrams or presentations, before changing views. For example, when drilling from a system overview into a specific service, first show the overview with the target service highlighted, then zoom into it.
**IF** creating multiple views -> include a small context indicator showing which part of the larger system is being detailed
**IF** presenting to an audience -> use the overview-then-zoom pattern: show the full system, highlight the area of focus, then drill in
### Step 5: Apply Presentation Techniques (if presenting)
**ACTION:** If the diagrams will be presented (not just shared as documents), apply presentation-specific techniques.
**WHY:** Presentations and documents are fundamentally different media. In a presentation, the presenter controls how quickly an idea unfolds (manipulating time). In a document, the reader controls the pace. Treating a presentation as a document (Bullet-Riddled Corpse anti-pattern) wastes the presenter's most powerful tool: controlling the narrative flow.
**Techniques:**
- **Incremental Builds:** Never show a complex diagram all at once. Build it piece by piece using animations. Cover parts of the diagram with borderless white boxes, then use "build out" animations to reveal sections as you narrate. This maintains suspense and keeps the audience engaged.
- **Manipulating Time:** Use subtle transitions and dissolves to stitch slides into a continuous story. Use distinctly different transitions (door, cube) to signal topic changes.
- **Infodecks vs Presentations:** If the slides will be emailed, not presented, they are an "infodeck" — include all information, skip animations. If they will be presented live, slides should be half the story (the other half is the speaker).
- **Slides Are Half of the Story:** Don't put everything on the slide. The presenter is the other information channel. Adding less text to slides gives more punch to spoken points.
- **Invisibility:** Insert blank black slides when you want to refocus attention on the speaker. Turning off the visual channel automatically amplifies the verbal channel.
### Step 6: Generate the Diagram Specification
**ACTION:** Produce a complete diagram specification including all components, relationships, communication types, labels, and visual guidelines.
**WHY:** A specification serves as both the blueprint for creating the diagram in any tool and as documentation of what the diagram should contain. It prevents the "I drew it from memory" syndrome where critical elements get omitted.
## Inputs
- System description (components, services, data stores, external systems)
- Communication patterns (synchronous REST, asynchronous messaging, etc.)
- Target audience and purpose
- Optionally: existing diagrams to review, organizational standards, preferred tools
## Outputs
### Architecture Diagram Specification
```markdown
# Architecture Diagram: {System Name}
## Diagram Metadata
- **Standard:** {UML / C4 / ArchiMate / Custom}
- **C4 Level:** {Context / Container / Component / Class} (if C4)
- **Audience:** {who will view this}
- **Purpose:** {what decision or understanding this supports}
## Components
| ID | Name | Type | Description |
|----|------|------|-------------|
| 1 | {name} | {service/database/queue/external} | {what it does} |
## Relationships
| From | To | Type | Protocol | Label |
|------|----|------|----------|-------|
| {source} | {target} | sync/async | {REST/gRPC/AMQP/etc.} | {what is communicated} |
## Visual Guidelines
- **Line styles:** Solid = synchronous, Dotted = asynchronous
- **Colors:** {color scheme with rationale}
- **Shapes:** {shape conventions}
- **Key:** {if custom shapes are used}
## Presentation Notes (if applicable)
- **Build order:** {sequence for incremental reveals}
- **Narration points:** {what to say at each build step}
```
## Key Principles
- **Representational consistency is non-negotiable** — WHY: Viewers who see a zoomed-in diagram without context for where it fits in the overall architecture will misunderstand the scope and boundaries. Always show the relationship between parts and the whole before changing zoom levels.
- **Solid lines = synchronous, dotted lines = asynchronous is a universal standard** — WHY: This is one of the few diagram conventions that exists across the software industry. Violating it confuses everyone who has learned this convention, and most architects have. If you must deviate, include a key.
- **Low-fidelity early, high-fidelity late** — WHY: The Irrational Artifact Attachment anti-pattern causes architects to defend designs they should revise, simply because they invested hours in the diagram. Using whiteboards and sticky notes early frees the team to iterate without sunk-cost bias.
- **A misinterpreted diagram is worse than no diagram** — WHY: Diagrams carry authority. If a diagram is ambiguous and someone interprets it wrong, they will build or operate the system based on that wrong interpretation with high confidence. When in doubt, add labels and keys.
- **Incremental builds make presentations compelling** — WHY: The human brain cannot resist reading text that appears on screen. Showing a complex diagram all at once forces the audience to read ahead of the presenter, splitting their attention. Building the diagram piece by piece keeps narrator and visual in sync.
- **The presenter is half the presentation** — WHY: Slides have two information channels: visual (slides) and verbal (speaker). Overloading the visual channel by putting everything on slides starves the verbal channel and makes the presenter redundant. The best presentations have sparse slides and a compelling narrator.
## Examples
**Scenario: Creating a microservices architecture diagram**
Trigger: "I need to create an architecture diagram for our microservices system with 6 services, an API gateway, message queue, and 3 databases."
Process: Selected custom notation over C4 (C4 is less suited for distributed microservices where container/component relationships differ). Applied the six element guidelines: assigned each service a distinct color, used solid lines for synchronous REST calls and dotted lines for asynchronous message queue communication, labeled every relationship with the protocol and data exchanged. Included a key explaining shapes (3D boxes = deployable services, cylinders = databases, hexagon = API gateway). Recommended incremental build order for presentation: start with the API gateway, build out to the services one by one, then show the async communication layer.
Output: Complete diagram specification with component table, relationship matrix, visual guidelines, and presentation build order.
**Scenario: Standardizing inconsistent team diagrams**
Trigger: "Our team has different diagrams at different zoom levels using inconsistent notation."
Process: Recommended adopting C4 as the standard since the system has clear context, container, and component boundaries. Created a notation guide: specific shapes for each component type, consistent color palette, solid/dotted line convention. For each existing diagram, identified which C4 level it corresponds to and added representational consistency indicators (small overview diagram in the corner showing which part is detailed). Created a shared stencil template for the team's diagramming tool.
Output: Notation standard document, C4 level mapping for existing diagrams, shared stencil template, and diagram review checklist.
**Scenario: Preparing architecture presentation for executives**
Trigger: "I'm presenting our new event-driven architecture to the CTO next week. I have 15 slides full of bullet points."
Process: Flagged the Bullet-Riddled Corpse anti-pattern — slides full of text that the presenter reads aloud. Redesigned the presentation using incremental builds: replaced bullet point slides with a single architecture diagram revealed in 6 build steps, each narrated by the presenter. Added invisibility slides (blank black slides) before key decision points to refocus attention on the speaker. Converted detailed technical content to an infodeck appendix for email distribution after the meeting. Advised: "slides are half the story — you are the other half."
Output: Restructured 15-slide deck into 8 slides with incremental builds, 3 invisibility slides, and a 12-page infodeck appendix.
## References
- For detailed diagramming standards, element guidelines, C4 level descriptions, and notation conventions, see [references/diagram-standards.md](references/diagram-standards.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/diagram-standards.md
# Diagram Standards Reference
Detailed guide to architecture diagramming standards, element guidelines, and presentation techniques.
## Diagramming Standards
### UML (Unified Modeling Language)
UML was a standard that unified three competing design philosophies in the 1980s. While designed by committee and largely fallen out of widespread use, two diagram types remain valuable:
- **Class diagrams** — Still the most effective way to show object relationships and inheritance hierarchies
- **Sequence diagrams** — Still the best tool for illustrating workflow and interaction timing between components
Most other UML diagram types (use case, activity, state machine, deployment) have been superseded by lighter-weight alternatives.
**When to use UML:** When you need to show class structure or detailed service interaction sequences. When the organization mandates UML.
### C4 (Context, Container, Component, Class)
Developed by Simon Brown to address UML's deficiencies. C4 provides four zoom levels:
| Level | What it shows | Best audience | Example |
|-------|-------------|---------------|---------|
| **Context** | The entire system, including users and external dependencies | Everyone — executives, developers, ops | "Our system talks to these 3 external APIs and serves these 2 user types" |
| **Container** | Physical/logical deployment boundaries within the architecture | Operations, architects | "The web app, API server, and database are in separate containers" |
| **Component** | The internal structure of a container — its modules and their relationships | Architects, senior developers | "The API server has these 5 modules with these internal dependencies" |
| **Class** | Same as UML class diagrams | Developers | "This module has these classes with these relationships" |
**When to use C4:** For monolithic and service-based architectures where the container and component relationships create meaningful zoom levels. C4 is best suited for systems where the container (deployment unit) and component (module) distinction is clear.
**Limitations:** C4 is less suited for distributed architectures like microservices where:
- Each microservice IS a container AND a component simultaneously
- The interesting relationships are between services, not within them
- Container and component levels may be redundant
### ArchiMate
An open source enterprise architecture modeling language from The Open Group. ArchiMate is:
- A technical standard (not just a convention)
- Lighter-weight than UML ("as small as possible")
- Designed for description, analysis, and visualization across business domains
**When to use ArchiMate:** When modeling architecture that spans multiple business domains or when you need a standard that covers business, application, and technology layers in a single view.
## Diagram Element Guidelines
### Titles
- Every element must have a title or be universally known to the audience
- Use rotation and visual effects to make titles "stick" to their shapes
- Make efficient use of space — titles should not dominate the diagram
### Lines
- **Thickness:** Lines must be thick enough to see clearly, especially when projected
- **Direction:** Use arrows to indicate information flow direction. Use bidirectional arrows for two-way traffic
- **Arrowhead style:** Different arrowhead types suggest different semantics — be consistent within a diagram
- **The synchronous/asynchronous convention:**
- **Solid lines** = synchronous communication (request-response, blocking)
- **Dotted lines** = asynchronous communication (fire-and-forget, message queues, events)
- This is one of the few near-universal standards in architecture diagrams
### Shapes
- **3D boxes** — Deployable artifacts (services, applications, servers)
- **Rectangles** — Containment, logical grouping, layers
- **Cylinders** — Databases and data stores (universally recognized)
- **No pervasive standard** — Each architect tends to build their own shape vocabulary
- **Build a stencil** — Create a library of standard shapes for your organization. This creates consistency across all architecture diagrams and speeds up diagram creation
### Labels
- Label every item in the diagram, especially if there is any chance of ambiguity
- When in doubt, add a label — the cost of a redundant label is near zero; the cost of ambiguity is high
- Label relationships (lines) with what is being communicated, not just the protocol
### Color
- Architects historically under-use color due to book printing being black-and-white
- Use color to **distinguish** artifacts from one another (e.g., different services in unique colors)
- Favor monochrome base with selective color over full-color schemes
- Color should carry semantic meaning (e.g., red = high risk, green = healthy)
- Never rely on color alone for meaning — always pair with labels or shapes for accessibility
### Keys
- If any shape is ambiguous, include a key on the diagram
- A diagram that leads to misinterpretation is worse than no diagram
- Keys should be visible without scrolling or paging
- For presentations: show the key on the first slide, then remove it to free space (the audience will remember)
## Tool Features to Look For
1. **Layers** — Link groups of elements together logically. Enable/disable layers to show different levels of detail. Essential for building incremental presentations from a single comprehensive diagram.
2. **Stencils/Templates** — Build reusable shape libraries for organizational consistency. Standard component shapes (microservice, database, queue, load balancer) should look identical across all diagrams.
3. **Magnets** — Automatic connection points on shapes for clean line routing. Create custom magnets for consistent visual style.
4. **Export formats** — Support for multiple output formats (PNG, SVG, PDF) for different contexts (docs, slides, wikis).
## Presentation Techniques
### Incremental Builds
The most powerful presentation technique for architecture diagrams:
1. Start with the complete diagram in your diagramming tool
2. Cover portions with **borderless white boxes** that hide sections
3. Use "build out" animations to remove the covers one at a time
4. Each build step reveals one new part of the architecture
5. Narrate each reveal — explain what it is and why it's there
**Why this works:** When presenting, you have two information channels: visual (slides) and verbal (speaker). Showing everything at once overloads the visual channel while starving the verbal channel. The audience reads ahead and stops listening. Incremental builds keep both channels synchronized.
**Anti-pattern: Bullet-Riddled Corpse** — Slides that are essentially the speaker's notes projected for all to see. If the slides contain everything, just email them as an infodeck instead.
### Manipulating Time
- Use **subtle transitions** (dissolve, fade) to stitch related slides into a continuous story
- Use **distinctly different transitions** (door, cube) to signal topic changes
- Avoid "splashy" transitions (dropping anvils, swirling effects) — they distract from content
### Infodecks vs Presentations
| Aspect | Infodeck | Presentation |
|--------|----------|-------------|
| **Delivery** | Emailed, read individually | Projected, narrated live |
| **Content** | Comprehensive — all information is in the slides | Sparse — slides + speaker = complete story |
| **Animations** | None needed | Essential for pacing |
| **Amount of text** | Full text, standalone | Minimal — keywords and visuals |
**Rule:** If you build comprehensive slides with no animations, you have an infodeck. Email it. Don't stand in front of it and read it aloud.
### Invisibility Pattern
Insert a **blank black slide** when you want to:
- Refocus attention solely on the speaker
- Make a key point that requires eye contact, not screen-staring
- Transition between major topics
When the visual channel goes dark, the verbal channel automatically gets amplified. The speaker becomes the only interesting thing in the room.
### Slides Are Half of the Story
- If the slides are comprehensive, spare everyone and email them
- If you are presenting, the slides should be half the content — you are the other half
- Adding less text to slides makes spoken points land with more impact
- Presenters who put everything on slides make themselves redundant
## The Irrational Artifact Attachment Anti-Pattern
**Definition:** The proportional relationship between time invested in creating an artifact and irrational attachment to it. A four-hour diagram creates more attachment than a two-hour one.
**Why it matters:** Attached architects resist revising designs they should change because they don't want to "waste" the time invested in the diagram. This is classic sunk-cost fallacy applied to architecture.
**Prevention:**
- In early design phases, use low-fidelity tools: whiteboards, tablets, sticky notes, index cards
- Use tablets attached to overhead projectors instead of physical whiteboards (unlimited canvas, easy to save, no glare from cell phone photos)
- Only invest in high-fidelity diagramming tools once the design has stabilized through iteration
- Follow the Agile principle: create just-in-time artifacts with as little ceremony as possible
**Benefits of low-fidelity:**
- Team members throw away what's not working without guilt
- The true nature of the architecture emerges through revision, collaboration, and discussion
- Faster iteration cycles — sketching takes minutes, not hours
Create structured Architecture Decision Records (ADRs) with 7 sections to document architecture decisions with full justification. Use this skill whenever th...
---
name: architecture-decision-record-creator
description: Create structured Architecture Decision Records (ADRs) with 7 sections to document architecture decisions with full justification. Use this skill whenever the user has made or needs to make an architecture decision, wants to document why a technical choice was made, is choosing between technologies or patterns, needs to create an ADR, or is experiencing repeated debates about past decisions — even if they don't explicitly mention "ADR" or "architecture decision record."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-decision-record-creator
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- architecture-tradeoff-analyzer
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [19]
tags: [software-architecture, architecture, decisions, documentation, adr, governance]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "An architecture decision that needs to be documented — the choice, the context, and the alternatives"
tools-required: [Write]
tools-optional: [Read, Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can check for existing ADRs and suggest numbering."
---
# Architecture Decision Record Creator
## When to Use
An architecture decision has been made (or needs to be made) and it should be documented. Typical situations:
- A technology or pattern choice has been decided — needs formal documentation
- A decision keeps getting revisited ("didn't we already decide this?") — the Groundhog Day anti-pattern
- A stakeholder asks "why did we choose X?" and nobody can answer — missing documentation
- Before implementing a significant technical change — document BEFORE building
- An existing decision needs to be superseded by a new one
Before starting, verify:
- Is there actually a DECISION to document? (If it's still an open question, use `architecture-tradeoff-analyzer` first to analyze trade-offs, then come back here to document the result)
- Is this decision architecturally significant? (Step 1 below helps determine this)
## Context
### Required Context (must have before proceeding)
- **The decision:** What was decided (or what needs to be decided). Ask the user if not stated.
- **The alternatives:** What options were considered. If only one option was considered, that's a red flag — push back and identify at least one alternative.
### Observable Context (gather from environment if available)
- **Existing ADRs:** Check for prior decisions in the project
→ Look for: `docs/adr/`, `docs/decisions/`, `architecture/`, `*.adr.md`, files matching `ADR-*.md` or `*-adr.md`
→ If found: determine the next sequential number, check for related/conflicting prior decisions
→ If none: this will be ADR 1, suggest establishing an ADR directory
- **Codebase context:** What technologies, patterns, and structures currently exist
→ Look for: package.json, pyproject.toml, docker-compose, CI configs
→ This informs the Context section of the ADR
### Default Assumptions
- If no existing ADR numbering → start at ADR 1
- If no approval process exists → default to "Accepted" status (solo dev or small team)
- If compliance mechanism is unclear → suggest manual review as the starting point
## Process
### Step 1: Assess Architectural Significance
**ACTION:** Determine if this decision is architecturally significant by evaluating against 5 dimensions.
**WHY:** Not every technical decision needs an ADR. Over-documenting trivial choices creates noise and dilutes the value of ADRs. A decision is architecturally significant if it affects at least one of these dimensions — and it's the significance that justifies the effort of formal documentation.
Evaluate the decision against:
| Dimension | Question |
|-----------|---------|
| **Structure** | Does this affect the patterns or styles of architecture? |
| **Nonfunctional characteristics** | Does this impact a quality attribute that matters to the system? |
| **Dependencies** | Does this create or change coupling between components/services? |
| **Interfaces** | Does this affect how services or components are accessed? |
| **Construction techniques** | Does this impact platforms, frameworks, tools, or processes? |
**IF** the decision affects at least one dimension → it's architecturally significant, proceed to write the ADR.
**IF** it affects none → it's a technical implementation detail, not an architecture decision. Document it in code comments or a tech spec instead.
**IMPORTANT: Show your work.** Include the significance assessment as a visible section in your output BEFORE the ADR itself. This is not just an internal check — it demonstrates rigor and helps stakeholders understand why this decision warrants formal documentation.
Output the assessment as:
```
## Significance Assessment
| Dimension | Affected? | How |
|-----------|:---------:|-----|
| Structure | Yes/No | {explanation} |
| Nonfunctional characteristics | Yes/No | {explanation} |
| Dependencies | Yes/No | {explanation} |
| Interfaces | Yes/No | {explanation} |
| Construction techniques | Yes/No | {explanation} |
**Verdict:** Architecturally significant — affects {N} of 5 dimensions.
```
**CAUTION:** Don't assume technology decisions aren't architectural. If choosing Kafka over RabbitMQ directly supports a performance or scalability characteristic, it IS an architecture decision — the technology choice supports the architecture.
### Step 2: Determine Status
**ACTION:** Set the appropriate ADR status based on the decision's approval context.
**WHY:** Status isn't just metadata — it communicates where the decision is in its lifecycle and what action is needed. Setting the wrong status (e.g., "Accepted" when approval is needed) can lead to unauthorized implementations. Setting "Proposed" when the architect can self-approve adds unnecessary bureaucracy.
| Status | When to use |
|--------|------------|
| **Proposed** | Decision needs approval from a governance body or senior architect |
| **Accepted** | Decision is approved and ready for implementation |
| **Superseded by ADR N** | Decision has been replaced (link to the new ADR) |
| **RFC (with deadline)** | Architect wants broader input before deciding. MUST include a deadline date — otherwise it becomes an open-ended discussion that never concludes (Analysis Paralysis). |
Escalation triggers — the decision should be **Proposed** (not self-approved) when:
- **Cost** exceeds the team's authority (significant purchases, licensing)
- **Cross-team impact** — it affects other teams or systems
- **Security implications** — any security-relevant change needs governance review
### Step 3: Write the Context Section
**ACTION:** Describe the forces at play — what situation is forcing this decision? Include the alternatives considered.
**WHY:** Context serves double duty: it explains WHY the decision is needed AND documents the architecture. A future developer reading this ADR learns both the decision and the architectural context it applies to. Keep it concise — if alternatives need detailed analysis, add a separate Alternatives section or reference a trade-off analysis.
Format: A clear, concise statement of the situation + the alternatives.
Good: *"The order service must pass information to the payment service. This could be done using REST (synchronous) or asynchronous messaging via Kafka."*
Bad: *"We need to figure out how services should communicate."* (Too vague — which services? What are the options?)
**Before writing context, diagnose the situation for anti-patterns.** Check if the scenario shows signs of:
- **Covering Your Assets** — Has this decision been deferred repeatedly? Is the architect afraid to commit?
- **Groundhog Day** — Is this a decision that was already made but nobody recorded WHY, so it's being revisited?
- **Email-Driven Architecture** — Was a prior decision made but lost in email/Slack, so it's being re-made?
If an anti-pattern is present, NAME IT explicitly in the Context section and note how this ADR addresses it. For example: *"This decision is being re-made because the original rationale (ADR-12) did not document WHY the monolith was chosen — a classic Groundhog Day anti-pattern. This ADR includes full justification to prevent recurrence."*
### Step 4: Write the Decision Section
**ACTION:** State the decision in active, commanding voice with full justification emphasizing WHY.
**WHY:** "Why is more important than how" (Second Law of Software Architecture). Anyone can look at the system and figure out HOW it works. What they can't figure out is WHY it was built that way. Without WHY, future developers may undo good decisions — like the architect who replaced gRPC with messaging for "better decoupling," not knowing the original gRPC choice was specifically to reduce latency, causing timeouts throughout the system.
- Use **affirmative, commanding voice**: "We will use..." not "I think we should..."
- Lead with the decision, then justify
- Include BOTH technical AND business justification
- Apply the **business value litmus test**: if the decision provides no business value (cost savings, time to market, user satisfaction, or strategic positioning), reconsider whether it should be made at all
### Step 5: Write the Consequences Section
**ACTION:** Document BOTH positive and negative impacts of the decision.
**WHY:** Every architecture decision has trade-offs — this is the First Law. Documenting only positives is dishonest and sets up future surprises. Documenting negatives explicitly forces the architect to think about whether the impacts outweigh the benefits. It also prevents the Groundhog Day anti-pattern — when someone questions the decision later, the consequences are already documented with the reasoning.
For each consequence, indicate whether it's positive or negative:
- **Positive:** What improves because of this decision?
- **Negative:** What gets worse or becomes more complex? What new risks are introduced?
- **Trade-off:** What are we accepting in exchange for the benefits?
### Step 6: Write the Compliance Section
**ACTION:** Specify HOW the decision will be measured and governed.
**WHY:** A decision without enforcement is a suggestion. Many architecture decisions erode over time because nobody checks whether they're being followed. The Compliance section forces the architect to think about governance at decision time, not as an afterthought. This is the difference between "we decided to use layered architecture" and "we decided to use layered architecture, AND here's the ArchUnit test that enforces it."
Two types of compliance:
| Type | When to use | Example |
|------|------------|---------|
| **Manual** | Decision is hard to check automatically, involves judgment | "Review service boundaries during quarterly architecture review" |
| **Automated fitness function** | Decision can be verified programmatically | "ArchUnit test ensures shared services reside in the services layer" |
For automated compliance, specify:
- How the fitness function would be written
- Where the test lives
- How and when it's executed (CI pipeline, pre-commit, scheduled)
### Step 7: Write the Notes Section
**ACTION:** Add metadata: original author, approval date, last modified, approvers, supersession history.
**WHY:** Notes provide the audit trail. When a decision is questioned months later, the Notes section shows who made it, who approved it, and what changed. This is especially important in regulated environments where decision provenance matters.
## Inputs
- The decision to document (from user or from a completed trade-off analysis)
- Context: what alternatives were considered, what constraints apply
- Optionally: existing ADR directory for numbering and cross-referencing
## Outputs
### Architecture Decision Record
```markdown
## Significance Assessment
| Dimension | Affected? | How |
|-----------|:---------:|-----|
| Structure | Yes/No | {explanation} |
| Nonfunctional characteristics | Yes/No | {explanation} |
| Dependencies | Yes/No | {explanation} |
| Interfaces | Yes/No | {explanation} |
| Construction techniques | Yes/No | {explanation} |
**Verdict:** Architecturally significant — affects {N} of 5 dimensions.
---
# ADR {N}: {Short Descriptive Title}
## Status
{Proposed | Accepted | Superseded by ADR N | RFC, Deadline YYYY-MM-DD}
## Context
{Clear, concise description of the situation and forces at play.
What alternatives were considered?}
## Decision
{Active voice. Affirmative. Full justification emphasizing WHY.
Both technical and business justification.}
## Consequences
### Positive
- {What improves}
### Negative
- {What gets worse or becomes more complex}
### Trade-offs
- {What we're accepting in exchange}
## Compliance
{How this decision will be enforced}
- **Type:** Manual review | Automated fitness function
- **Mechanism:** {Specific enforcement mechanism}
- **Frequency:** {When/how often compliance is checked}
## Notes
- **Author:** {name}
- **Date:** {YYYY-MM-DD}
- **Approved by:** {name(s), if applicable}
- **Last modified:** {YYYY-MM-DD}
- **Supersedes:** {ADR N, if applicable}
- **Superseded by:** {ADR N, if applicable}
```
## Key Principles
- **WHY over HOW** — The Decision section's most powerful aspect is the justification. Anyone can see how a system works; only the ADR explains why it was built that way. Without WHY, good decisions get undone by well-meaning but uninformed future developers.
- **Decisions without enforcement are suggestions** — The Compliance section is what separates an ADR from a wish. If you can automate compliance (fitness functions, ArchUnit tests), do it. If not, schedule manual reviews. But never leave enforcement unspecified.
- **Both positive AND negative consequences** — Every decision has trade-offs. Documenting only positives is dishonest. The negative consequences, acknowledged upfront, prevent surprise later and provide ammunition when someone asks "did you consider X?"
- **Name the anti-pattern when you see it** — If a team avoids deciding (Covering Your Assets), revisits decisions repeatedly (Groundhog Day), or loses decisions in email (Email-Driven Architecture), name the dysfunction. These three anti-patterns form a progressive chain — overcoming one often reveals the next.
- **Last responsible moment, not last possible moment** — Decide when you have enough information to justify the choice, but before development teams are blocked. Too early = premature commitment. Too late = analysis paralysis. The sweet spot is "last responsible moment."
- **Business value litmus test** — If a decision provides no business value (cost, time to market, user satisfaction, strategic positioning), reconsider making it. Architecture decisions exist to serve business outcomes, not architectural purity.
## Examples
**Scenario: Documenting a messaging decision for an auction system**
Trigger: "We decided to use asynchronous messaging between the order and payment services. Can you write an ADR for this?"
Process: Assessed significance — affects structure (async vs sync), dependencies (service coupling), and nonfunctional characteristics (performance, reliability). Set status: Accepted (small team, self-approved). Context: order → payment communication, REST vs async messaging. Decision: "We will use asynchronous messaging via RabbitMQ" with WHY: reduces latency from 3,100ms to 25ms for review posting, decouples services. Consequences: positive (responsiveness, decoupling), negative (complex error handling for bad content). Compliance: automated test verifying no direct REST calls between these services.
Output: Complete ADR with all 7 sections, filed as ADR-42.
**Scenario: Superseding a previous technology decision**
Trigger: "We originally chose gRPC for service communication but now want to switch to messaging. There's an existing ADR for the gRPC decision."
Process: Assessed significance — affects structure and dependencies. Created new ADR with status "Accepted, supersedes ADR 23." Documented WHY the original decision (latency reduction) is no longer the priority and why decoupling now matters more. Explicitly noted the consequence: latency will increase, and upstream timeouts must be reconfigured. Updated ADR 23 status to "Superseded by ADR 45."
Output: New ADR-45 + updated status on ADR-23, creating a traceable decision history.
**Scenario: Decision that needs broader input**
Trigger: "I think we should adopt event sourcing for our audit trail, but I want the team's input before committing."
Process: Assessed significance — affects structure (event store pattern), construction techniques (new tooling). Set status: RFC, Deadline 2026-04-15. Wrote Context explaining the audit requirements and alternatives (event sourcing vs append-only table vs CDC). Decision section presents the architect's recommendation with justification, inviting comments. Noted in Compliance: "if adopted, automated test verifying all state changes emit events."
Output: ADR in RFC status with deadline, ready for team review. After deadline, architect incorporates feedback and moves to Accepted.
## References
- For the ADR file template, see [assets/adr-template.md](assets/adr-template.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-architecture-tradeoff-analyzer`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:assets/adr-template.md
# ADR {N}: {Short Descriptive Title}
## Status
{Proposed | Accepted | Superseded by ADR N | RFC, Deadline YYYY-MM-DD}
## Context
{Clear, concise description of the situation and forces at play.
What alternatives were considered?}
## Decision
{Active voice. Affirmative. Full justification emphasizing WHY.
Both technical and business justification.}
## Consequences
### Positive
- {What improves because of this decision}
### Negative
- {What gets worse or becomes more complex}
### Trade-offs
- {What we're accepting in exchange for the benefits}
## Compliance
- **Type:** {Manual review | Automated fitness function}
- **Mechanism:** {Specific enforcement mechanism}
- **Frequency:** {When/how often compliance is checked}
## Notes
- **Author:** {name}
- **Date:** {YYYY-MM-DD}
- **Approved by:** {name(s), if applicable}
- **Last modified:** {YYYY-MM-DD}
- **Supersedes:** {ADR N, if applicable}
- **Superseded by:** {ADR N, if applicable}
Systematically identify, categorize, and prioritize architecture characteristics (quality attributes / -ilities) from requirements, domain concerns, and stak...
---
name: architecture-characteristics-identifier
description: Systematically identify, categorize, and prioritize architecture characteristics (quality attributes / -ilities) from requirements, domain concerns, and stakeholder input. Use this skill whenever the user is starting a new project, defining architecture requirements, translating business needs into technical characteristics, asking "what quality attributes matter?", figuring out nonfunctional requirements, or evaluating what -ilities to optimize for — even if they don't explicitly say "architecture characteristics."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architecture-characteristics-identifier
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: [] # Foundation skill — no dependencies
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [4, 5]
tags: [software-architecture, architecture, quality-attributes, requirements, nonfunctional-requirements, ilities]
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Requirements, domain concerns, or stakeholder priorities from the user"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase exists, can scan for existing architecture docs."
---
# Architecture Characteristics Identifier
## When to Use
You're at the start of an architecture decision — before choosing patterns, styles, or technologies. The team needs to understand which quality attributes actually matter for THIS system. Typical situations:
- New project kicking off — "what should we optimize for?"
- Requirements review — translating business language into technical characteristics
- Stakeholder disagreement — everyone has different priorities, need structured resolution
- Architecture audit — evaluating whether an existing system's characteristics match its needs
- Pre-requisite for other skills — `architecture-style-selector`, `architecture-fitness-function-designer`, and others depend on knowing the driving characteristics first
Before starting, verify:
- Are there requirements, domain concerns, or stakeholder inputs to work from? (If nothing exists, help the user articulate their domain concerns first)
- Is this about identifying characteristics (this skill) or analyzing trade-offs between options (use `architecture-tradeoff-analyzer` instead)?
## Context
### Required Context (must have before proceeding)
- **Domain or project description:** What is this system for? What problem does it solve? Ask the user if not stated.
- **At least one source of characteristics:** Requirements document, stakeholder concerns, domain description, or existing system to audit.
### Observable Context (gather from environment if available)
- **Requirements documents:** Search for requirements, PRDs, specs
→ Look for: `docs/`, `requirements/`, `*.prd.md`, `README` sections about goals
→ If unavailable: work from user's verbal description
- **Existing architecture docs:** Check for prior characteristic decisions
→ Look for: ADRs, architecture docs, existing `-ilities` lists
→ If found: audit and update rather than start from scratch
- **Codebase structure:** If a system exists, its structure reveals implicit characteristics
→ Look for: caching layers (performance), retry logic (reliability), auth modules (security), i18n files (localization)
→ If no codebase: greenfield, start from requirements
### Default Assumptions
- If no requirements exist → work from user's verbal domain description
- If no stakeholders available → ask the user to role-play key stakeholders (CTO, product, ops)
- If domain is unfamiliar → apply the three common implicit characteristics (availability, reliability, security) and ask probing questions
## Process
### Step 1: Gather Domain Concerns
**ACTION:** Identify what the business stakeholders care about. Translate their language into a domain concerns list.
**WHY:** Stakeholders speak in business language ("we need to merge with Company X", "time to market is critical", "users must love it"). Architects speak in -ilities ("interoperability", "deployability", "usability"). If you skip this translation, you'll optimize for the wrong things. The "lost in translation" problem is the #1 cause of architecture-business misalignment.
Common domain concerns and what they map to:
| Domain Concern | Architecture Characteristics |
|----------------|------------------------------|
| Mergers and acquisitions | Interoperability, scalability, adaptability, extensibility |
| Time to market | Agility, testability, deployability |
| User satisfaction | Performance, availability, fault tolerance, testability, deployability, agility, security |
| Competitive advantage | Agility, testability, deployability, scalability, availability, fault tolerance |
| Time and budget | Simplicity, feasibility |
For the full domain-concern mapping table, see [references/domain-concern-mapping.md](references/domain-concern-mapping.md).
**CAUTION:** Don't over-simplify the translation. "Agility" is NOT the same as "time to market" — agility = agility + testability + deployability. Focusing on only one ingredient is like forgetting to put the flour in the cake batter.
**IF** stakeholders are available → facilitate a brief discussion: "What are your top business concerns for this system?"
**ELSE** → ask the user to state the key domain concerns, or infer from the domain description.
### Step 2: Extract from Requirements
**ACTION:** Analyze requirements (explicit or stated by user) and extract architecture characteristics from each one.
**WHY:** Requirements contain encoded architecture characteristics. "Support 10,000 concurrent users" explicitly calls for scalability. But a single requirement often implies MULTIPLE characteristics. The classic trap: a stakeholder says "end-of-day fund pricing must complete on time" — an ineffective architect focuses only on performance. A good architect recognizes the need for performance AND availability AND scalability AND reliability AND recoverability AND auditability. It doesn't matter how fast the system is if it crashes at 85% load.
For each requirement:
1. Identify the EXPLICIT characteristic (what it directly states)
2. Probe for HIDDEN characteristics (what else must be true for this requirement to be met?)
3. Check if it requires special STRUCTURAL support (not just implementation) — if it doesn't influence structure, it's a design concern, not an architecture characteristic
### Step 3: Identify Implicit Characteristics
**ACTION:** Add characteristics that aren't in requirements but are necessary for the domain.
**WHY:** The most dangerous characteristics are the ones nobody writes down. Every web application needs availability, reliability, and security — but these rarely appear in requirements because stakeholders assume they're obvious. An architect who only addresses explicitly stated requirements will build a system that fails on implicit needs. Experience in the problem domain is what surfaces these.
Always consider these three for any system:
- **Availability** — can users access it?
- **Reliability** — does it stay up during interactions?
- **Security** — is it protected against threats?
Then probe domain-specific implicit characteristics:
- Handling payments? → security rises to architecture level (needs structural isolation)
- Serving global users? → localization, legal compliance, data residency
- Burst traffic patterns? → elasticity (not just scalability — elasticity handles SPIKES, scalability handles GROWTH)
### Step 4: Validate with the Three-Criteria Test
**ACTION:** For each candidate characteristic, verify it passes ALL three criteria:
1. **Specifies a nondomain design consideration** — It's about HOW to build, not WHAT to build
2. **Influences some structural aspect of the design** — It requires special architectural support, not just good implementation
3. **Is critical or important to application success** — The system would fail or significantly underperform without it
**WHY:** Without validation, the list inflates with everything anyone can think of. Every system COULD support every characteristic, but SHOULDN'T — each one adds complexity. The three-criteria test is the filter that separates real architecture characteristics from design concerns and wishful thinking. If a characteristic doesn't influence structure, handle it at the design level instead.
**IF** a characteristic fails criterion 2 (doesn't influence structure) → it's a design concern, not an architecture characteristic. Note it for the development team but don't include it in the architecture characteristics list.
### Step 5: Categorize
**ACTION:** Organize the validated characteristics into three categories: Operational, Structural, Cross-Cutting.
**WHY:** Categorization reveals blind spots. If all your characteristics are operational (performance, scalability, availability) and none are structural (maintainability, extensibility), you might be building a fast system that's impossible to change. If they're all cross-cutting (security, legal, accessibility), you might be ignoring operational realities. A balanced list across categories is a sign of thorough analysis.
- **Operational:** How the system runs (availability, scalability, performance, reliability, elasticity)
- **Structural:** How the code is organized (maintainability, extensibility, modularity, testability, deployability)
- **Cross-Cutting:** Spans both (security, accessibility, observability, legal, privacy)
For the full taxonomy with definitions, see [references/characteristics-taxonomy.md](references/characteristics-taxonomy.md).
### Step 6: Prioritize to Top 3
**ACTION:** Force-rank to the top 3 driving characteristics. No more.
**WHY:** Trying to optimize for everything produces a generic architecture that optimizes for nothing. Each additional characteristic you support complicates the overall design — like flying a helicopter where every control affects every other control. The Swedish warship Vasa tried to be both a troop transport AND a gunship with two decks of oversized cannons. It capsized and sank on its maiden voyage. Three characteristics is the practical limit for what one architecture can genuinely drive.
Facilitation technique:
1. Present the validated list to stakeholders
2. Ask: "Pick your top 3. Not in priority order — just the 3 most critical."
3. If they resist eliminating any, use the elimination exercise: "If you MUST eliminate one, which would it be?"
4. The top 3 become the DRIVING characteristics. Others are still acknowledged but don't drive architecture decisions.
**IF** stakeholders insist on more than 3 → explain the Vasa story and the helicopter metaphor. More is not better — it's more complex, more expensive, and more fragile.
### Step 7: Produce the Characteristics Report
**ACTION:** Document the identified, validated, categorized, and prioritized characteristics.
**WHY:** This report becomes the input for architecture style selection, fitness function design, and trade-off analysis. Without it, downstream decisions lack a foundation. It also creates alignment — stakeholders sign off on what matters, preventing the "Groundhog Day" anti-pattern (revisiting the same decisions because nobody recorded the rationale).
## Inputs
- Requirements document, PRD, or verbal project description
- Domain concerns from stakeholders (or user role-playing stakeholders)
- Optionally: existing codebase or architecture docs to audit
## Outputs
### Architecture Characteristics Report
```markdown
# Architecture Characteristics: {System Name}
## Domain Concerns
| Concern | Source | Mapped Characteristics |
|---------|--------|----------------------|
| {concern} | {stakeholder/requirement} | {characteristic1, characteristic2} |
## Identified Characteristics
### Explicit (from requirements)
| Characteristic | Source Requirement | Reasoning |
|---------------|-------------------|-----------|
| {characteristic} | {requirement} | {why this requirement implies this characteristic} |
### Implicit (from domain knowledge)
| Characteristic | Reasoning |
|---------------|-----------|
| {characteristic} | {why this is needed even though no one asked for it} |
## Three-Criteria Validation
| Characteristic | Nondomain? | Influences Structure? | Critical? | Verdict |
|---------------|:---:|:---:|:---:|---------|
| {char} | Yes/No | Yes/No | Yes/No | Include / Design-only / Exclude |
## Categorization
| Category | Characteristics |
|----------|----------------|
| Operational | {list} |
| Structural | {list} |
| Cross-Cutting | {list} |
## Top 3 Driving Characteristics
1. **{#1}** — {why this is driving}
2. **{#2}** — {why this is driving}
3. **{#3}** — {why this is driving}
### Acknowledged but not driving
- {characteristic}: {why it's important but not top 3}
## Characteristics NOT Included (and why)
- {candidate}: {failed criterion X / is a design concern / not critical enough}
```
## Key Principles
- **Three-criteria test is the gatekeeper** — A characteristic must be nondomain, influence structure, AND be critical to success. Anything less is a design concern, not an architecture characteristic. This filter prevents characteristic bloat.
- **Implicit characteristics are the dangerous ones** — What nobody writes down in requirements is often what kills the project. Availability, reliability, and security are almost always implicit. An architect's domain experience is what surfaces these.
- **Top 3, not top 10** — Every additional characteristic complicates the architecture like adding controls to a helicopter. The Vasa warship sank because it tried to optimize for too many things. Force stakeholders to choose 3 driving characteristics. This creates focus, not limitation.
- **Translate, don't transcribe** — Stakeholders say "time to market." That's NOT one characteristic — it's agility + testability + deployability. A single domain concern maps to multiple characteristics, and a single requirement often implies multiple characteristics. The translation table is your tool.
- **Over-specifying is as bad as under-specifying** — Adding characteristics you don't need is just as damaging as missing ones you do need. Each unnecessary characteristic adds complexity, cost, and design constraints. When in doubt, leave it out and handle at the design level.
- **Explicit vs implicit, not obvious vs hidden** — Explicit means it's stated in requirements. Implicit means it's necessary but unstated. Don't confuse "obvious" with "explicit" — security is obvious but almost always implicit (unstated in requirements). The distinction matters because implicit characteristics require the architect to proactively surface them.
## Examples
**Scenario: Online sandwich ordering system (Silicon Sandwiches)**
Trigger: "We're building a national online sandwich ordering platform for our franchise chain. What should we optimize for?"
Process: Gathered domain concerns: thousands to millions of users, mealtime burst traffic, franchise customization, online payments, overseas expansion plans, cost-conscious hiring. Extracted explicit characteristics: scalability (user volume), elasticity (mealtime bursts — lurking in the domain, not in requirements), performance (peak times), customizability (franchise-specific behavior). Identified implicit: availability, reliability, security (payments). Validated all against three-criteria test — security doesn't require special structure because payments are handled by a third-party processor, so it stays at design level. Categorized and prioritized: top 3 = scalability, elasticity, customizability.
Output: Characteristics report with 7 candidates, 4 validated, 3 driving. Customizability flagged as architecture-vs-design trade-off (microkernel structure vs Template Method pattern).
**Scenario: Regulatory financial system**
Trigger: "We need to build an end-of-day fund pricing system. The regulator says we absolutely must complete pricing on time."
Process: The naive approach: focus on performance. The thorough approach: "complete on time" requires performance AND availability (system must be up) AND scalability (handle growing fund count) AND reliability (no crashes at 85% load) AND recoverability (recover quickly if something fails) AND auditability (regulators need proof it completed). One requirement → six characteristics. Validated all, categorized, prioritized top 3: reliability, performance, auditability.
Output: Characteristics report showing how one business statement expanded into 6 characteristics, with justification for top 3 selection.
**Scenario: Startup MVP with stakeholder disagreement**
Trigger: "Our CTO wants scalability, product wants time-to-market, and our investor wants low cost. We're 4 developers. What matters?"
Process: Mapped domain concerns: CTO's scalability, product's time-to-market (= agility + testability + deployability), investor's cost (= simplicity + feasibility). Identified implicit: availability, security. Validated — for a 4-person startup MVP, scalability doesn't influence structure YET (can scale later with cloud auto-scaling, no special architecture needed now). Removed from architecture characteristics, noted as design concern. Top 3: agility, simplicity, availability. Elimination exercise confirmed: if forced to drop one, drop availability (cloud platforms provide baseline availability).
Output: Characteristics report that diplomatically resolves stakeholder disagreement by showing that scalability is valid but premature as an architecture driver for an MVP.
## References
- For the full taxonomy of architecture characteristics with definitions, see [references/characteristics-taxonomy.md](references/characteristics-taxonomy.md)
- For the domain-concern-to-characteristic translation table, see [references/domain-concern-mapping.md](references/domain-concern-mapping.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/characteristics-taxonomy.md
# Architecture Characteristics Taxonomy
Complete list of architecture quality attributes organized by category. Use this as a reference when identifying characteristics — but remember, this list is never exhaustive. Any system may define custom characteristics based on unique factors.
## Operational Characteristics
These affect how the system runs in production.
| Characteristic | Definition |
|---------------|-----------|
| **Availability** | How long the system needs to be available (e.g., 24/7 with quick recovery from failure) |
| **Continuity** | Disaster recovery capability |
| **Performance** | Response times, throughput, capacity under stress |
| **Recoverability** | How quickly the system returns to normal after disaster |
| **Reliability / Safety** | Fail-safe behavior, mission-critical impact, financial cost of failure |
| **Robustness** | Handling error and boundary conditions (internet down, power outage, hardware failure) |
| **Scalability** | Ability to handle growing number of users or requests |
| **Elasticity** | Ability to handle sudden BURSTS of traffic (distinct from scalability which handles GROWTH) |
## Structural Characteristics
These affect how the codebase is organized and evolved.
| Characteristic | Definition |
|---------------|-----------|
| **Configurability** | End-user ability to easily change software configuration |
| **Extensibility** | How easy to plug in new functionality |
| **Installability** | Ease of installation on all necessary platforms |
| **Leverageability / Reuse** | Common components usable across multiple products |
| **Localization** | Multi-language, multi-currency, multi-unit support |
| **Maintainability** | Ease of applying changes and enhancements |
| **Portability** | Ability to run on multiple platforms |
| **Supportability** | Logging, debugging, and technical support facilities |
| **Upgradeability** | Ease of upgrading to newer versions |
## Cross-Cutting Characteristics
These span operational and structural concerns.
| Characteristic | Definition |
|---------------|-----------|
| **Accessibility** | Support for users with disabilities |
| **Archivability** | Data retention, archival, and deletion policies |
| **Authentication** | Verifying user identity |
| **Authorization** | Controlling access to functions by role, rule, or field |
| **Legal** | Legislative constraints (data protection, GDPR, SOX) |
| **Privacy** | Hiding transactions from internal employees (encryption beyond external threats) |
| **Security** | Database/network encryption, authentication for remote access, threat protection |
| **Usability** | Training requirements, user goal achievement |
## ISO Quality Characteristics
Additional standardized definitions:
- **Performance efficiency:** time behavior, resource utilization, capacity
- **Compatibility:** coexistence, interoperability
- **Usability:** recognizability, learnability, error protection, accessibility
- **Reliability:** maturity, availability, fault tolerance, recoverability
- **Security:** confidentiality, integrity, nonrepudiation, accountability, authenticity
- **Maintainability:** modularity, reusability, analyzability, modifiability, testability
- **Portability:** adaptability, installability, replaceability
## Custom Characteristics
Lists are always incomplete. Any system may define custom characteristics based on unique domain factors. The "Italy-ility" example: after a freak communication outage severed connections with Italian branches, a client required that all future architectures support "Italy-ility" — a unique combination of availability, recoverability, and resilience specific to their geography.
FILE:references/domain-concern-mapping.md
# Domain Concern to Architecture Characteristic Mapping
Use this table to translate business/stakeholder language into architecture characteristics. This is the bridge between "what the business cares about" and "what the architect designs for."
## Primary Mapping Table
| Domain Concern | Architecture Characteristics |
|----------------|------------------------------|
| Mergers and acquisitions | Interoperability, scalability, adaptability, extensibility |
| Time to market | Agility, testability, deployability |
| User satisfaction | Performance, availability, fault tolerance, testability, deployability, agility, security |
| Competitive advantage | Agility, testability, deployability, scalability, availability, fault tolerance |
| Time and budget | Simplicity, feasibility |
| Regulatory compliance | Auditability, security, legal, privacy, recoverability |
| Global expansion | Localization, scalability, legal, data residency |
| Cost reduction | Simplicity, feasibility, maintainability |
| Innovation speed | Agility, extensibility, testability, deployability |
| Customer trust | Security, reliability, availability, privacy |
| Operational efficiency | Performance, observability, automation, maintainability |
## Translation Warnings
**"Agility" is NOT "time to market."** Agility = agility + testability + deployability. Focusing on only one ingredient produces an incomplete architecture.
**One concern → many characteristics.** "Complete end-of-day fund pricing on time" implies performance + availability + scalability + reliability + recoverability + auditability. A single business statement can expand to 6+ architecture characteristics.
**Don't transcribe, translate.** The stakeholder's exact words are domain language. Your job is to decode the underlying technical needs, not echo the business terminology back as a characteristic name.
## Probing Questions by Domain
When stakeholders state a concern, use these questions to uncover the full set of characteristics:
| They say... | Ask... | Reveals... |
|------------|--------|-----------|
| "It needs to be fast" | "Fast for whom? Under what load? At what percentile?" | Performance vs scalability vs elasticity |
| "We might get acquired" | "By whom? What systems would need to integrate?" | Interoperability, adaptability, data portability |
| "Security is important" | "Important enough to influence architecture? Or standard best practices?" | Whether security is a design concern or architecture characteristic |
| "We need to scale" | "Scale how? More users (scalability)? Burst traffic (elasticity)? More features (extensibility)?" | The specific type of scaling needed |
| "Budget is tight" | "Tight for build or for operations? Short term or long term?" | Simplicity vs maintainability vs feasibility |
Evaluate whether a software architect is fulfilling the 8 core expectations of the role and assess their technical breadth vs depth balance using the knowled...
---
name: architect-role-assessor
description: Evaluate whether a software architect is fulfilling the 8 core expectations of the role and assess their technical breadth vs depth balance using the knowledge pyramid. Use this skill whenever the user asks what a software architect should be doing, questions whether they are performing the architect role correctly, wants to assess their own or someone else's architect performance, describes symptoms of role dysfunction (spending too much time coding, not attending stakeholder meetings, only recommending technologies they know, avoiding decisions), asks about transitioning from developer to architect, or encounters the Frozen Caveman anti-pattern where past experiences irrationally drive current decisions — even if they don't explicitly say "architect role" or "expectations."
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architect-role-assessor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [1, 2]
tags: [software-architecture, architecture, role-definition, career, leadership, technical-breadth, self-assessment]
depends-on: []
execution:
tier: 1
mode: hybrid
inputs:
- type: none
description: "Current architect behavior and concerns from the user — what they spend time on, what challenges they face, how they interact with teams and stakeholders"
tools-required: [Read, Write]
tools-optional: []
mcps-required: []
environment: "Any agent environment. No codebase required."
---
# Architect Role Assessor
## When to Use
You need to evaluate whether an architect (the user or someone they manage) is fulfilling the core expectations of the role. Typical triggers:
- The user just got promoted to architect and wants to know what the role requires
- The user suspects they are spending time on the wrong activities (too much coding, not enough strategy)
- The user's manager says they need to "be more strategic" or "step up as an architect"
- The user only recommends technologies they have personal experience with (Frozen Caveman pattern)
- The user wants to assess another architect's effectiveness
- The user is transitioning from developer to architect and struggles with the shift
Before starting, verify:
- Is the user an architect being assessed, or assessing someone else?
- What specific concerns or symptoms triggered this question?
## Context
### Required Context (must have before proceeding)
- **Current activities:** What does the architect spend their time on?
-> Check prompt for: coding, meetings, reviews, presentations, stakeholder interactions, technology evaluation
-> If still missing, ask: "Can you describe how you currently spend your time as an architect — what activities take up most of your day/week?"
- **Specific concern:** What triggered this assessment?
-> Check prompt for: feedback, frustration, role confusion, promotion, performance review
-> If still missing, ask: "What prompted this question — is there a specific concern about how the architect role is being performed?"
### Observable Context (gather from environment)
- **Organization type:** What kind of company and team structure?
-> Check prompt for: startup, enterprise, team size, reporting structure
-> If unavailable: assume mid-size tech company
- **Career stage:** How long has this person been an architect?
-> Check prompt for: "just promoted," "been an architect for X years," experience references
-> If unavailable: assess from behavior descriptions
- **Team interaction patterns:** How does the architect interact with development teams?
-> Check prompt for: code reviews, pair programming, stand-ups, 1-on-1s, architectural reviews
-> If unavailable: assess from activity descriptions
### Default Assumptions
- If career stage unknown -> assess all 8 expectations equally
- If organization type unknown -> apply general guidance
- If team interaction patterns unknown -> flag as an area to investigate
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- At least 3-4 current activities are described
- The specific concern or trigger is understood
- The career context provides enough to assess depth vs breadth
PROCEED WITH DEFAULTS when:
- Some activities are described
- A general concern is expressed
- Career stage can be estimated
MUST ASK when:
- No current activities are described at all
- The concern is completely unclear
```
## Process
### Step 1: Assess Against the 8 Core Expectations
**ACTION:** Evaluate the architect against each of the eight core expectations. Score each as Strong, Adequate, Needs Improvement, or Missing. For detailed descriptions of each expectation, see [references/eight-expectations.md](references/eight-expectations.md).
**WHY:** These eight expectations define what an architect should be doing regardless of their title, organization, or seniority level. An architect who excels at 3 of 8 is failing the role even if those 3 are done brilliantly. The expectations are intentionally broad — they encompass technical, interpersonal, and organizational dimensions because the architect role sits at the intersection of all three.
| # | Expectation | What it means | Common failure mode |
|---|------------|---------------|---------------------|
| 1 | **Make architecture decisions** | Define architecture decisions and design principles to GUIDE (not specify) technology choices | Specifying "use React.js" instead of guiding "use a reactive-based framework" |
| 2 | **Continually analyze the architecture** | Assess how viable the architecture is given today's business and technology landscape, recommend improvements | Designing once and never revisiting; architecture decay goes unnoticed |
| 3 | **Keep current with latest trends** | Stay up to date on technology and industry trends | Falling behind on technologies, making decisions based on outdated knowledge |
| 4 | **Ensure compliance with decisions** | Verify teams are following architecture decisions and design principles | Making rules but never checking if they're followed; architecture violations go uncaught |
| 5 | **Diverse exposure and experience** | Know multiple technologies, frameworks, and platforms — not just one stack | Only knowing Java/Spring and recommending it for every problem |
| 6 | **Have business domain knowledge** | Understand the business side, not just technology | Building technically elegant solutions that don't solve the actual business problem |
| 7 | **Possess interpersonal skills** | Teamwork, facilitation, leadership | Being technically brilliant but unable to collaborate, facilitate meetings, or lead teams |
| 8 | **Understand and navigate politics** | Navigate corporate politics effectively | Ignoring organizational dynamics and being surprised when good ideas get blocked |
**IF** the user describes their activities -> map each activity to one or more expectations and identify gaps
**IF** the user describes problems -> trace each problem to a failing expectation
### Step 2: Evaluate Technical Breadth vs Depth
**ACTION:** Assess whether the architect has the right balance of technical breadth (knowing many technologies at a surface level) vs technical depth (knowing a few technologies deeply).
**WHY:** The knowledge pyramid has three zones: "stuff you know" (depth), "stuff you know you don't know" (breadth), and "stuff you don't know you don't know" (unknown unknowns). Developers should maximize depth. Architects should maximize breadth. When a developer becomes an architect, they must deliberately shift their learning investment: sacrifice some depth to expand breadth. An architect who only has depth makes poor decisions because they can only see solutions through the lens of technologies they know deeply.
**The Knowledge Pyramid:**
- **Stuff you know** (technical depth) — Your core expertise. As a developer, this is your strength. As an architect, this narrows your solution space.
- **Stuff you know you don't know** (technical breadth) — Technologies you're aware of and could evaluate but haven't used deeply. As an architect, THIS is your most valuable zone. It lets you identify the right technology for each problem, even if you need to dive deeper to implement it.
- **Stuff you don't know you don't know** (unknown unknowns) — The dangerous zone. Everything here is a potential blind spot. Expanding breadth shrinks this zone.
**Assessment criteria:**
- Does the architect recommend the same technology stack for every problem? -> Too much depth, not enough breadth
- Can the architect evaluate unfamiliar technologies against requirements? -> Good breadth
- Does the architect dismiss technologies they haven't used without evaluation? -> Depth bias
- Does the architect maintain awareness of the broader technology landscape? -> Good breadth practice
### Step 3: Check for the Frozen Caveman Anti-Pattern
**ACTION:** Determine whether the architect exhibits the Frozen Caveman pattern — irrationally reverting to past experience regardless of current context.
**WHY:** The Frozen Caveman Architect had a traumatic experience years or decades ago (a system failure due to scalability, a data breach, a vendor lock-in disaster) and now insists every new system must guard against that specific problem, even when it's irrelevant to the current context. This is not the same as learning from experience — it's the inability to objectively assess whether past lessons apply to the current situation.
**Warning signs:**
- Constantly references a specific past failure ("In 2018, our message queue crashed, so I never use message queues")
- Over-engineers for scenarios that are extremely unlikely in the current context
- Dismisses entire technology categories based on a single bad experience
- Cannot articulate current, evidence-based reasons for their position — only historical anecdotes
- Fear-driven decisions that don't match current requirements
**IF** Frozen Caveman pattern detected -> flag it explicitly with specific correction:
- Evaluate each architecture decision based on CURRENT context, requirements, and constraints
- Past experience should INFORM decisions but not DICTATE them
- Use risk assessment techniques to objectively evaluate whether historical concerns apply
- Ask: "What is the probability of that specific failure in THIS system, with TODAY's technology?"
### Step 4: Identify the Architecture vs Design Boundary
**ACTION:** Assess whether the architect is operating at the right level — making architecture decisions vs design decisions.
**WHY:** Architecture decisions affect the structure of the system and constrain or guide development teams. Design decisions affect implementation within those constraints. An architect who makes design decisions (choosing class structures, selecting design patterns, writing pseudocode) is micromanaging. An architect who doesn't make architecture decisions (letting the team decide service boundaries, communication protocols, data partitioning) is abdicating the role.
**The boundary test:** Does this decision affect the overall structure of the system?
- If YES -> architecture decision (architect's responsibility)
- If NO -> design decision (developer's responsibility)
- If UNCLEAR -> architecture decision, but the architect should guide rather than specify
### Step 5: Generate the Assessment Report
**AGENT: EXECUTES** — produces the assessment
**ACTION:** Compile the findings into a structured assessment with specific recommendations for improvement.
**HANDOFF TO HUMAN** — the user implements the changes in their daily work
## Inputs
- Description of current architect activities and time allocation
- Specific concerns or symptoms
- Optionally: career history, team structure, organization type, manager feedback
## Outputs
### Architect Role Assessment
```markdown
# Architect Role Assessment
## Current Profile
- **Role tenure:** {how long as architect}
- **Organization context:** {company type, team size}
- **Primary concern:** {what triggered the assessment}
## Eight Expectations Scorecard
| # | Expectation | Rating | Evidence | Recommendation |
|---|------------|--------|----------|----------------|
| 1 | Make architecture decisions | {Strong/Adequate/Needs Improvement/Missing} | {what was observed} | {specific action} |
| 2 | Continually analyze | ... | ... | ... |
| 3 | Keep current with trends | ... | ... | ... |
| 4 | Ensure compliance | ... | ... | ... |
| 5 | Diverse exposure | ... | ... | ... |
| 6 | Business domain knowledge | ... | ... | ... |
| 7 | Interpersonal skills | ... | ... | ... |
| 8 | Navigate politics | ... | ... | ... |
## Technical Breadth vs Depth Assessment
- **Current balance:** {depth-heavy / balanced / breadth-heavy}
- **Knowledge pyramid status:** {description of current state}
- **Recommendation:** {specific actions to adjust balance}
## Anti-Pattern Check
- **Frozen Caveman:** {detected / not detected} — {evidence}
- **Other patterns:** {any other role anti-patterns observed}
## Architecture vs Design Boundary
- **Current boundary:** {operating too low / appropriate / too high}
- **Evidence:** {examples of boundary violations}
## Top 3 Priority Actions
1. {most impactful change}
2. {second priority}
3. {third priority}
```
## Key Principles
- **The 8 expectations are non-negotiable** — WHY: They define the role. An architect who excels at 3 expectations but ignores the other 5 is not an effective architect — they're a specialist with the wrong title. Every expectation matters because the role sits at the intersection of technology, business, and people.
- **Breadth over depth for architects** — WHY: An architect who only knows Java/Spring recommends Java/Spring for everything, even when Go, Rust, or Python would be better fits. Technical breadth enables pattern recognition across technologies: "This problem looks like an event-sourcing problem regardless of the implementation language." Depth can always be re-acquired when needed for a specific implementation.
- **The Frozen Caveman is one of the most damaging anti-patterns** — WHY: Unlike other anti-patterns that produce bad output, the Frozen Caveman produces fear-based decisions that LOOK prudent. "We must plan for extreme scale" sounds responsible. But over-engineering for a scenario from 2018 that doesn't apply to the current 50-user internal tool wastes budget, time, and team morale. The Frozen Caveman confuses caution with wisdom.
- **Guide, don't specify** — WHY: The first expectation says architects should GUIDE technology decisions, not SPECIFY them. "Use a reactive-based framework for the frontend" guides the team to evaluate Angular, React, Vue, and Svelte. "Use React.js" takes that evaluation away from the team, depriving them of learning and depriving the project of potentially better alternatives.
- **Architecture decisions have structural scope** — WHY: If a decision doesn't affect the system's structure, it's a design decision and belongs to the developer. Architects who make design decisions become bottlenecks and steal the art of programming from their teams. Architects who don't make architecture decisions leave structural voids that the team fills inconsistently.
## Examples
**Scenario: New architect spending 80% of time coding**
Trigger: "I just got promoted to architect from senior developer. I'm still spending 80% of my time coding. My team says I'm a bottleneck on code reviews."
Process: Assessed against the 8 expectations. Found Expectation 1 (architecture decisions) as Needs Improvement — the user is making design decisions via code reviews instead of architecture decisions. Expectations 2-4 likely Missing since coding leaves no time for architecture analysis, trend-following, or compliance verification. Expectations 6-8 (business domain, interpersonal, politics) at risk due to time allocation. Diagnosed the architecture-vs-design boundary violation: the architect is operating at the design level, which is the developer's domain. Technical breadth assessment: likely depth-heavy since coding maintains depth at the expense of breadth. Recommended: reduce coding to 20-30%, shift to proof-of-concepts and fitness functions rather than production code, delegate code review responsibility to senior developers, and invest freed time in Expectations 2-8.
Output: Assessment showing 2 of 8 expectations met, depth-heavy knowledge profile, design-level boundary violation, and a 90-day transition plan.
**Scenario: Architect who only recommends familiar technologies**
Trigger: "I've been an architect for 5 years but I only recommend technologies I've used before — Java/Spring for everything. My team wants to try Go and Kafka but I keep saying no because I had a bad experience with message queues in 2018."
Process: Identified two issues: (1) Frozen Caveman anti-pattern — a 2018 message queue failure is driving current technology decisions without evaluating whether the same risk applies to today's context with today's tools. (2) Expectation 5 (diverse exposure) failure — recommending Java/Spring for everything indicates depth without breadth. Assessed knowledge pyramid: "stuff you know" (Java/Spring) is deep, "stuff you know you don't know" (Go, Kafka, modern event streaming) is being actively suppressed rather than explored. Correction: objectively evaluate whether the 2018 failure conditions exist in the current system; explore Go and Kafka through a proof-of-concept rather than dismissing them; set a goal of evaluating 2 unfamiliar technologies per quarter through lightweight experiments.
Output: Assessment flagging Frozen Caveman anti-pattern and Expectation 5 failure, with a structured technology exploration plan.
**Scenario: Architect who avoids stakeholder work**
Trigger: "I'm an architect but I never attend stakeholder meetings. I design systems, write ADRs, and review code. My manager says I need to 'be more strategic.'"
Process: Assessed against 8 expectations. Strong on Expectation 1 (architecture decisions via ADRs), Adequate on Expectation 4 (compliance via code reviews). Missing on Expectations 6 (business domain knowledge — can't learn the business without attending stakeholder meetings), 7 (interpersonal skills — not being exercised), and 8 (navigate politics — completely absent). The manager's feedback ("be more strategic") translates to: "you're fulfilling the technical expectations but ignoring the organizational expectations." Architecture doesn't happen in a vacuum — it must align with business goals, and that requires understanding the business. Recommended: attend 2 stakeholder meetings per week, schedule 1-on-1s with product managers to understand business priorities, and frame all ADRs with a "Business Context" section.
Output: Assessment showing 3 of 8 expectations met, organizational-expectation gap analysis, and specific stakeholder engagement plan.
## References
- For detailed descriptions of each of the 8 expectations with self-assessment questions and improvement strategies, see [references/eight-expectations.md](references/eight-expectations.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/eight-expectations.md
# Eight Expectations of a Software Architect
Detailed reference for each of the eight core expectations, with self-assessment questions and improvement strategies.
## Expectation 1: Make Architecture Decisions
*An architect is expected to define the architecture decisions and design principles used to guide technology choices within the team, the department, or across the enterprise.*
### Key Distinction: Guide vs Specify
The key operative word is **guide**. An architect should guide technology choices, not specify them.
- **Guiding** (correct): "Use a reactive-based framework for frontend web development" — the team evaluates Angular, React, Vue, Svelte and chooses the best fit
- **Specifying** (overreach): "Use React.js for the frontend" — this is a technical decision, not an architectural one, unless there's a specific architectural reason (e.g., server-side rendering requirements that favor Next.js)
An architect can make specific technology decisions when there is an architectural reason — when the choice affects scalability, performance, availability, or another architecture characteristic. But the default posture should be guidance.
### Self-Assessment Questions
- Do my decisions guide the team or dictate to them?
- Can I articulate the architecture characteristic that justifies each decision?
- Are my decisions documented (ideally in ADRs)?
- Do my decisions form constraints (boundaries) or do they make implementation choices?
### Improvement Strategies
- Review each recent technology decision and ask: "Is this architecture or design?"
- Convert specifications to guidelines where possible
- Start writing ADRs for significant decisions
## Expectation 2: Continually Analyze the Architecture
*An architect is expected to continually analyze the architecture and current technology environment and then recommend solutions for improvement.*
### What This Means
Architecture vitality — assessing how viable the architecture that was designed three or more years ago is today, given changes in both business and technology. Most architects don't focus enough energy on this. As a result, architectures experience structural decay as developers make coding or design changes that impact architecture characteristics.
### Self-Assessment Questions
- When was the last time I evaluated whether our current architecture still meets our needs?
- Am I monitoring architecture characteristics (performance, scalability, availability)?
- Are we tracking architecture fitness functions?
- Have I assessed our testing and release environments for agility?
### Improvement Strategies
- Schedule quarterly architecture health checks
- Implement fitness functions for critical architecture characteristics
- Create a technology radar for the team/organization
## Expectation 3: Keep Current with Latest Trends
*An architect is expected to keep current with the latest technology and industry trends.*
### Why This Matters
Architecture decisions are long-lasting and difficult to change. Understanding key trends helps the architect prepare for the future and make the correct decision. An architect making decisions based on 5-year-old knowledge is building architectures optimized for yesterday's constraints.
### Self-Assessment Questions
- Can I name 3 technology trends from the last 12 months that could affect our architecture?
- Am I reading/watching/attending conferences, tech talks, or community discussions?
- Do I maintain a technology radar or similar tracking mechanism?
### Improvement Strategies
- Dedicate 30 minutes per day to technology trend awareness
- Maintain a personal technology radar
- Attend at least 2 conferences or significant tech community events per year
- Follow thought leaders and industry publications
## Expectation 4: Ensure Compliance with Decisions
*An architect is expected to ensure compliance with architecture decisions and design principles.*
### What This Means
Making decisions is only half the job. The architect must verify that development teams are actually following those decisions. This includes:
- Architecture decisions (e.g., layering rules, service boundaries)
- Design principles (e.g., async messaging between services)
Without compliance checking, architecture violations accumulate until the actual system no longer matches the designed system.
### Self-Assessment Questions
- How do I verify teams follow architecture decisions?
- Do I have automated fitness functions checking structural compliance?
- When was the last time I found and corrected an architecture violation?
- Do developers know which decisions are architectural constraints vs guidelines?
### Improvement Strategies
- Implement automated fitness functions (e.g., ArchUnit, NetArchTest)
- Conduct periodic architecture reviews
- Create an architecture compliance checklist for code reviews
## Expectation 5: Diverse Exposure and Experience
*An architect is expected to have exposure to multiple and diverse technologies, frameworks, platforms, and environments.*
### The Breadth Imperative
This doesn't mean being an expert in everything. It means knowing enough about diverse technologies to:
- Evaluate them against requirements
- Understand their trade-offs
- Recognize when a familiar technology is the wrong choice
An architect who only knows Java/Spring will recommend Java/Spring for everything, even when Go, Python, or Rust would be better fits.
### Self-Assessment Questions
- How many distinct technology stacks have I worked with in the last 3 years?
- Can I evaluate a technology I haven't used against our requirements?
- Do I dismiss technologies I haven't used without investigation?
- When was the last time I recommended something outside my comfort zone?
### Improvement Strategies
- Set a goal of exploring 1-2 new technologies per quarter through lightweight experiments
- Build proof-of-concepts in unfamiliar stacks
- Read about technologies in domains adjacent to yours
- Deliberately sacrifice some depth for breadth in learning time allocation
## Expectation 6: Have Business Domain Knowledge
*An architect is expected to have a certain level of business domain knowledge.*
### Why Technical Skills Aren't Enough
An architect who doesn't understand the business domain will build technically elegant solutions that don't solve the actual business problem. Without business domain knowledge:
- Architecture characteristics can't be properly identified (which -ilities matter depends on the business)
- Trade-off decisions lack business context
- Communication with stakeholders breaks down
### Self-Assessment Questions
- Can I explain our business model without using any technical terms?
- Do I understand the competitive landscape we operate in?
- Can I identify which business processes are most critical to revenue/success?
- Do I attend stakeholder meetings regularly?
### Improvement Strategies
- Attend product/business meetings regularly
- Schedule 1-on-1s with product managers and business stakeholders
- Read industry publications (not just tech blogs)
- Frame architecture decisions in business terms
## Expectation 7: Possess Interpersonal Skills
*An architect is expected to possess exceptional interpersonal skills, including teamwork, facilitation, and leadership.*
### The Soft Skills Are Not Optional
Technical brilliance without interpersonal skills produces architects who:
- Cannot facilitate productive meetings
- Alienate team members with abrasive communication
- Fail to build consensus around architectural direction
- Lose influence because people avoid working with them
### Self-Assessment Questions
- Can I facilitate a productive meeting with 10 people who disagree?
- Do people seek out my input or avoid it?
- Can I give constructive feedback without creating defensiveness?
- Do I listen as much as I speak in technical discussions?
### Improvement Strategies
- Practice active listening — summarize others' points before responding
- Facilitate (not dominate) architecture review meetings
- Seek feedback on communication style from trusted colleagues
- Invest in leadership training
## Expectation 8: Understand and Navigate Politics
*An architect is expected to understand the political climate of the enterprise and be able to navigate the politics.*
### The Uncomfortable Truth
Almost every architecture decision is also a political decision. Budget allocation, team structure, technology standards, vendor selection — all involve stakeholders with competing interests. An architect who ignores politics will:
- Propose technically excellent solutions that never get funded
- Make enemies by disrupting established power structures without coalition-building
- Be outmaneuvered by less technically skilled but more politically savvy colleagues
### Self-Assessment Questions
- Do I understand who the key decision-makers are and what motivates them?
- Can I anticipate which stakeholders will support or oppose my recommendations?
- Do I build coalitions before proposing significant changes?
- Am I surprised when good ideas get blocked for "non-technical reasons"?
### Improvement Strategies
- Map the stakeholder landscape: who has power, what they care about, who influences whom
- Build relationships with key stakeholders before you need something from them
- Learn to frame proposals in terms that align with stakeholder priorities
- Practice the negotiation techniques from the stakeholder-negotiation-planner skill
## The Knowledge Pyramid
### Three Zones of Knowledge
```
/\
/ \
/ S \ "Stuff you don't know you don't know"
/ T U \ (Unknown unknowns — the dangerous zone)
/ U F \ Expanding breadth shrinks this zone
/ F F \
/ F \
/____________\
/ Stuff you \ "Stuff you know you don't know"
/ know you \ (Known unknowns — ARCHITECT's key zone)
/ don't know \ This is technical BREADTH
/___________________\
| |
| Stuff you know | "Stuff you know"
| | (Known knowns — DEVELOPER's key zone)
| (Technical DEPTH) | This is technical DEPTH
|_____________________|
```
### The Transition from Developer to Architect
As a developer, your value comes from **depth** — being the expert in your technology stack.
As an architect, your value shifts to **breadth** — being able to evaluate many technologies against requirements and select the right one for each problem.
This transition requires a **deliberate sacrifice**: you must give up some depth to build breadth. This feels uncomfortable because depth is how you built your career. But an architect who maintains developer-level depth in one stack at the expense of breadth across many stacks will make poor architecture decisions.
**Practical guidance:**
- Maintain depth in 1-2 technologies (enough to stay credible and build POCs)
- Build breadth across 10-20 technologies (enough to evaluate and compare)
- Invest 70% of learning time in breadth, 30% in depth maintenance
## The Frozen Caveman Anti-Pattern
A Frozen Caveman Architect always reverts to their past experience, regardless of whether it applies to the current situation.
**Example:** An architect who experienced a system failure in 1997 due to scalability issues now insists every system must handle extreme scale, even when the current application is an internal tool with 50 users.
**How to differentiate learning from experience vs Frozen Caveman:**
| Learning from experience (good) | Frozen Caveman (anti-pattern) |
|------|------|
| "We should consider scalability because our user base is growing 200% yearly" | "We MUST build for massive scale because my system crashed in 2018" |
| Decision grounded in current data | Decision grounded in past trauma |
| Can articulate current risk factors | Can only reference historical incidents |
| Considers probability in current context | Ignores probability differences |
| Open to evidence that the risk is different now | Refuses to reconsider |
**Correction:** For every decision driven by past experience, ask: "What is the probability of this specific failure in THIS system, with TODAY's technology, given our CURRENT requirements?" If the answer is "low," the past experience is informing a bias, not a decision.
Determine the appropriate level of architect control over a development team using a quantitative 5-factor scoring model (-100 to +100 scale). Use this skill...
---
name: architect-control-calibrator
description: Determine the appropriate level of architect control over a development team using a quantitative 5-factor scoring model (-100 to +100 scale). Use this skill whenever the user asks how much they should be involved in a team's decisions, how hands-on or hands-off to be as an architect, how to calibrate their leadership style, whether they are micromanaging developers, whether they should give the team more autonomy, or any question about architect involvement level, team oversight, or technical leadership balance — even if they don't explicitly say "control." Also triggers when the user describes team dysfunction symptoms like merge conflicts increasing, nobody speaking up in meetings, or tasks falling through cracks.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/fundamentals-of-software-architecture/skills/architect-control-calibrator
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
source-books:
- id: fundamentals-of-software-architecture
title: "Fundamentals of Software Architecture"
authors: ["Mark Richards", "Neal Ford"]
chapters: [22]
tags: [software-architecture, architecture, leadership, team-management, control, elastic-leadership]
depends-on: []
execution:
tier: 1
mode: full
inputs:
- type: none
description: "Team and project context from the user — team size, experience, familiarity, project complexity and duration"
tools-required: [Read, Write]
tools-optional: []
mcps-required: []
environment: "Any agent environment. No codebase required."
---
# Architect Control Calibrator
## When to Use
You need to determine how much hands-on control an architect should exercise over a development team. Typical triggers:
- The user is a new architect joining an existing team and wants to know how involved to be
- The user is wondering if they are micromanaging or under-managing their team
- The user describes team problems (merge conflicts, silence in reviews, dropped tasks) and wants guidance
- The user is starting a new project and needs to set the right leadership posture
- The user asks about "elastic leadership" or calibrating their architect involvement
Before starting, verify:
- Is there a team to assess? (At minimum, team size and general experience level)
- Is there a project context? (Complexity and expected duration)
## Context
### Required Context (must have before proceeding)
- **Team size:** How many developers on the team?
-> Check prompt for: numbers, team descriptions, "my team of N"
-> If still missing, ask: "How many developers are on the team?"
- **Team experience level:** Are they mostly junior, mid-level, or senior?
-> Check prompt for: "junior," "senior," "experienced," "fresh out of college," years of experience
-> If still missing, ask: "What is the overall experience level of the team — mostly junior, mid-level, or senior?"
### Observable Context (gather from environment if available)
- **Team familiarity:** How long has the team worked together?
-> Check prompt for: "new team," "worked together for X years," "just formed"
-> If unavailable: assume moderate familiarity (score 0)
- **Project complexity:** How complex is the system being built?
-> Check prompt for: "distributed," "microservices," "simple CRUD," "complex," architecture descriptions
-> If unavailable: assume moderate complexity (score 0)
- **Project duration:** How long is the project expected to last?
-> Check prompt for: months, years, timeline references
-> If unavailable: assume moderate duration (score 0)
- **Team dysfunction signals:** Are there signs of process loss, pluralistic ignorance, or diffusion of responsibility?
-> Check prompt for: merge conflicts, silence in meetings, dropped tasks, nobody speaking up
-> If present: flag and address in output
### Default Assumptions
- If team familiarity unknown -> score 0 (moderate) and note the assumption
- If project complexity unknown -> score 0 (moderate) and note the assumption
- If project duration unknown -> score 0 (moderate) and note the assumption
- If experience level is "mixed" -> score 0 (moderate), but note that mixed teams often need more guidance than the score suggests
### Sufficiency Threshold
```
SUFFICIENT when ALL of these are true:
- Team size is known
- Team experience level is known or can be inferred
- At least 1 other factor (familiarity, complexity, duration) is known
PROCEED WITH DEFAULTS when:
- Team size and experience are known
- Other factors can use moderate defaults
- Team dysfunction signals can be assessed from context
MUST ASK when:
- Team size is completely unknown (cannot calibrate without it)
- Experience level is unknown AND cannot be inferred from context
```
## Process
### Step 1: Score Each Factor
**ACTION:** Score each of the five control factors on a scale from -20 to +20. Use the scoring guide below. For detailed breakdowns with intermediate values, see [references/control-scoring-guide.md](references/control-scoring-guide.md).
**WHY:** Each factor independently influences how much architect control is appropriate. Scoring them separately prevents one dominant factor from masking others. A senior team working on a complex project needs different handling than a junior team on a simple one — the individual factor scores reveal this nuance.
| Factor | -20 (less control) | 0 (moderate) | +20 (more control) |
|--------|-------------------|--------------|-------------------|
| **Team familiarity** | Established team, worked together 2+ years | Some familiarity, 6-12 months | Brand new team, never worked together |
| **Team size** | Small (4 or fewer) | Medium (5-9) | Large (12+) |
| **Overall experience** | Mostly senior (8+ years) | Mixed or mid-level | Mostly junior (0-2 years) |
| **Project complexity** | Simple (CRUD, well-understood domain) | Moderate | Highly complex (distributed, novel domain) |
| **Project duration** | Short (< 3 months) | Medium (3-12 months) | Long (> 18 months) |
**IF** a factor is unknown -> score 0 and note the assumption
**IF** a factor falls between values -> interpolate (e.g., team of 10 is between medium and large, score +10)
### Step 2: Calculate Total Score
**ACTION:** Sum all five factor scores to get the total control score (range: -100 to +100).
**WHY:** The total determines the overall control posture. Individual scores matter for understanding WHY the total is what it is, but the total drives the primary recommendation. The scale is intentionally symmetrical — neither extreme is inherently better. The right level depends entirely on context.
Interpret the total:
- **+60 to +100:** High control — Be very hands-on. Attend stand-ups, review all major technical decisions, pair-program on critical paths, create detailed technical guidance. This is NOT micromanaging — the team needs this level of support.
- **+20 to +59:** Moderate-high control — Attend key meetings, review architecture-impacting decisions, provide templates and patterns, check in regularly but don't dictate implementation details.
- **-19 to +19:** Balanced — Facilitate rather than direct. Set architecture boundaries, let the team decide implementation. Be available for guidance. This is the "effective architect" zone.
- **-59 to -20:** Moderate-low control — Focus on high-level guardrails only. Trust the team to make most technical decisions. Intervene only when architecture principles are at risk.
- **-100 to -60:** Low control — Be a strategic advisor. Set vision and principles, then step back. The team is capable and cohesive. Over-involvement will frustrate them and slow them down.
### Step 3: Check for Architect Personality Anti-Patterns
**ACTION:** Based on the total score and the user's described behavior, check whether the architect is falling into a personality anti-pattern.
**WHY:** The two extremes of the control spectrum represent well-known anti-patterns. An architect at +80 who is also over-controlling on a team that doesn't need it is a Control Freak. An architect at -80 who is also absent from a team that needs guidance is an Armchair Architect. The score tells you what SHOULD be, but the user's described behavior might not match.
**Control Freak Architect** (too much control for the context):
- Dictates class designs and design patterns to developers
- Restricts use of any external libraries without approval
- Writes pseudocode for the development team
- Makes implementation-level decisions that should belong to developers
- **Root cause:** Steals the art of programming, causes frustration and turnover
- **Correction:** Focus on component-level architecture, not implementation details
**Armchair Architect** (too little control for the context):
- Architecture diagrams are too high-level to be actionable
- Doesn't understand the technology stack the team is using
- Moves between projects without staying for implementation
- No regular time spent with the development team
- **Root cause:** Disconnected from reality, team left to figure things out alone
- **Correction:** Stay involved through implementation, spend time with the team
**IF** the user describes behavior matching a personality anti-pattern AND the score supports it -> flag the anti-pattern and provide specific correction steps
**IF** the behavior doesn't match the score -> the score may indicate they should adjust
### Step 4: Scan for Team Warning Signs
**ACTION:** Check the user's description for three critical team dysfunction signals. These indicate the team may be too large or that control needs recalibration regardless of the score.
**WHY:** These warning signs override the numerical score. A team scoring -40 (experienced, established) can still exhibit dysfunction that requires immediate architect intervention. Detecting these signs early prevents project failure.
**Process Loss (Brook's Law):**
- **Signal:** Merge conflicts have increased, developers stepping on each other's code, adding people hasn't improved velocity
- **Root cause:** Too many people working in overlapping areas
- **Response:** Look for areas of parallelism. Move developers to parallel tracks where they won't conflict. Consider splitting the team.
**Pluralistic Ignorance:**
- **Signal:** Team agrees publicly with decisions but complains privately. Nobody raises concerns in architecture reviews. Silence during design discussions.
- **Root cause:** Social pressure to conform. Team members privately disagree but assume everyone else agrees.
- **Response:** Observe body language during meetings. Directly ask quieter members for their opinion. Create anonymous feedback channels. As the architect, act as a facilitator who draws out dissent.
**Diffusion of Responsibility:**
- **Signal:** Tasks get dropped because everyone assumes someone else will do it. Unclear ownership. "I thought you were handling that."
- **Root cause:** Accountability gaps that grow with team size. The larger the team, the easier it is for individuals to assume someone else is responsible.
- **Response:** Assign explicit owners to every task and architecture concern. Question whether new team members are actually needed. Reduce team size if possible.
**IF** warning signs are detected -> include them in the output with specific remediation steps
**IF** no warning signs mentioned -> note that the architect should monitor for these throughout the project
### Step 5: Generate Recommendations
**ACTION:** Produce a calibrated recommendation with specific behaviors the architect should adopt at the determined control level. Include what to do, what NOT to do, and when to recalibrate.
**WHY:** A number alone isn't actionable. The architect needs specific guidance on behaviors: which meetings to attend, what decisions to own vs delegate, how to provide guidance without over-controlling. The recommendation must be concrete enough to change behavior on Monday morning.
Include:
1. **Control posture summary** — one-sentence description of the recommended level
2. **Specific behaviors to adopt** — 4-6 concrete actions at this control level
3. **Specific behaviors to avoid** — 3-4 things NOT to do (the anti-pattern behaviors for this level)
4. **Recalibration triggers** — when to re-score (team membership changes, project phase shifts, complexity changes)
5. **Warning sign monitoring** — which dysfunction signals to watch for given the team profile
## Inputs
- Team description (size, experience, familiarity)
- Project description (complexity, duration, technology)
- Optionally: current architect behavior, team dynamics observations, specific concerns
## Outputs
### Architect Control Calibration Report
```markdown
# Architect Control Calibration
## Team & Project Profile
- **Team size:** {N developers}
- **Team familiarity:** {description}
- **Overall experience:** {description}
- **Project complexity:** {description}
- **Project duration:** {description}
## Control Score
| Factor | Score | Rationale |
|--------|-------|-----------|
| Team familiarity | {-20 to +20} | {why this score} |
| Team size | {-20 to +20} | {why this score} |
| Overall experience | {-20 to +20} | {why this score} |
| Project complexity | {-20 to +20} | {why this score} |
| Project duration | {-20 to +20} | {why this score} |
| **Total** | **{-100 to +100}** | |
## Control Level: {High/Moderate-High/Balanced/Moderate-Low/Low}
{One-sentence summary of recommended posture}
## Recommended Behaviors
### DO:
1. {specific action}
2. {specific action}
...
### DON'T:
1. {specific anti-pattern to avoid}
2. {specific anti-pattern to avoid}
...
## Team Health Assessment
{Warning signs detected or "No warning signs detected — monitor for: ..."}
## When to Recalibrate
- {trigger 1}
- {trigger 2}
- {trigger 3}
```
## Key Principles
- **Control is not inherently good or bad** — Too much control on a senior, established team suffocates them. Too little control on a junior, new team leaves them floundering. The right level is determined by context, not by ideology. An architect who always defaults to high control or always defaults to low control is failing to read the room.
- **The score is a starting point, not a verdict** — The 5-factor model provides an objective baseline, but real teams are messier than any model. Use the score to check your instincts. If your gut says "more control" but the score says "less," one of you is wrong — investigate which.
- **Re-evaluate throughout the project lifecycle** — A team that starts at +60 (new team, complex project) may shift to -20 six months later as familiarity grows and complexity becomes understood. The factors are not static. Set a calendar reminder to re-score quarterly.
- **Watch for warning signs regardless of score** — Process loss, pluralistic ignorance, and diffusion of responsibility can appear at any control level. A score of -60 doesn't mean "ignore the team." It means "facilitate rather than direct" — but still observe, still engage, still monitor.
- **Architect personality anti-patterns are the real danger** — The biggest risk is not getting the score wrong. It's letting your natural personality override the score. Control Freaks will over-control even when the score says back off. Armchair Architects will under-engage even when the score says step up. Know your tendency and actively counter it.
- **Team size is a leading indicator** — When a team exceeds 10-12 developers, warning signs almost always appear. Process loss scales non-linearly with team size. If the team is large and there are no warning signs, either you're not looking hard enough or the team is exceptionally well-organized.
## Examples
**Scenario: New architect joining an established senior team**
Trigger: "I'm a new architect joining a team of 20 developers. They've been working together for 3 years on a complex distributed system. Most are senior engineers. The project is expected to last 2 more years."
Process: Scored factors: team familiarity -20 (established 3+ years), team size +20 (20 developers, very large), overall experience -20 (mostly senior), project complexity +20 (complex distributed), project duration +15 (2 more years, long). Total: +15 (balanced). However, flagged team size as a major concern — 20 developers is well above the threshold where process loss appears. Checked for warning signs: with 20 developers on a distributed system, process loss is almost certain. Recommended balanced approach with strong emphasis on monitoring warning signs and potentially recommending team splits. Anti-pattern check: the new architect may be tempted toward Armchair Architect with such a senior team, but the large size and complexity demand active presence.
Output: Control calibration report showing +15 (balanced) with warning about team size and specific monitoring plan for process loss.
**Scenario: Leading a junior team on a simple project**
Trigger: "I'm leading a team of 6 junior developers fresh out of college. We're building a simple internal CRUD tool that should take about 3 months. They've never worked together before."
Process: Scored factors: team familiarity +20 (brand new team), team size -10 (6 is small-to-medium), overall experience +20 (all junior), project complexity -15 (simple CRUD), project duration -15 (3 months). Total: 0 (balanced). Despite the balanced score, noted that two factors are at maximum (+20): familiarity and experience. This means the architect should lean toward more guidance on team process and technical mentoring, even though the simple project and short duration pull the score down. Recommended attending daily stand-ups for the first month, providing coding standards and review templates, but NOT dictating implementation details.
Output: Control calibration report showing 0 (balanced) with nuance that mentoring is critical for this team profile.
**Scenario: Team showing dysfunction warning signs**
Trigger: "I'm the architect for a team of 12 mid-level developers. We're 6 months into a complex microservices migration. Merge conflicts have tripled in the last month, and in our last architecture review, nobody raised any concerns even though I know there are issues."
Process: Scored factors: team familiarity 0 (assumed moderate, not specified), team size +15 (12 is large), overall experience 0 (mid-level), project complexity +20 (complex microservices migration), project duration +10 (assumed 12-18 months for migration). Total: +45 (moderate-high control). But the critical finding was two warning signs: process loss (tripled merge conflicts = developers stepping on each other's code) and pluralistic ignorance (silence in architecture reviews = false consensus). Recommended immediate interventions: for process loss, identify overlapping work areas and assign clear service ownership boundaries; for pluralistic ignorance, switch to 1-on-1 architecture discussions and create anonymous concern channels. Score says moderate-high control, and the warning signs confirm the team needs MORE architect involvement, not less.
Output: Control calibration report showing +45 (moderate-high) with urgent team health alerts for process loss and pluralistic ignorance, including specific remediation steps.
## References
- For detailed factor scoring with intermediate values and worked examples, see [references/control-scoring-guide.md](references/control-scoring-guide.md)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Fundamentals of Software Architecture by Mark Richards, Neal Ford.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/control-scoring-guide.md
# Control Scoring Guide
Detailed breakdown of each factor in the 5-factor architect control model, with intermediate values and scoring rationale.
## Factor 1: Team Familiarity
How well does the team know each other? Have they worked together before?
| Score | Description | Rationale |
|-------|-------------|-----------|
| +20 | Brand new team, never worked together | No established norms, communication patterns, or trust. Architect must help set team dynamics and resolve early conflicts. |
| +10 | Partially new — some members know each other, some are new | Mixed familiarity creates subgroups. Architect helps integrate new members and ensure consistent practices. |
| 0 | Team has worked together 6-12 months | Norms are forming but not fully established. Moderate guidance still beneficial. |
| -10 | Established team, 1-2 years together | Strong norms and communication patterns. Team self-organizes effectively on most issues. |
| -20 | Long-standing team, 2+ years together | Deep trust, established conflict resolution, proven collaboration. Architect intervention in team dynamics is unnecessary and counterproductive. |
**Key insight:** Team familiarity is not about individual skill — it's about how well the group functions as a unit. A team of senior engineers who have never worked together still needs more control than a team of mid-level engineers who have been collaborating for years.
## Factor 2: Team Size
How many developers are on the team?
| Score | Description | Rationale |
|-------|-------------|-----------|
| +20 | 12+ developers | Communication paths grow quadratically (n*(n-1)/2). At 12 people, there are 66 communication channels. Process loss is almost guaranteed. |
| +10 | 10-11 developers | Approaching the danger zone. Coordination overhead is significant. Sub-teams may be forming informally. |
| 0 | 7-9 developers | Standard team size. Manageable communication paths. Can function with moderate oversight. |
| -10 | 5-6 developers | Small enough for direct communication. Low coordination overhead. Fewer opportunities for things to fall through cracks. |
| -20 | 4 or fewer developers | Minimal coordination needed. Everyone knows what everyone else is doing. Over-managing this size team is wasteful. |
**Key insight:** Team size is the most reliable predictor of process loss. The communication formula (n*(n-1)/2) means adding just 2 people to a team of 10 increases communication channels from 45 to 66 — a 47% increase. This is why architects should monitor team size as a leading indicator.
**Warning thresholds:**
- At 10+: Actively look for process loss signals
- At 15+: Strongly recommend splitting into sub-teams
- At 20+: Process loss is almost certain without sub-team structure
## Factor 3: Overall Experience
What is the general experience level of team members?
| Score | Description | Rationale |
|-------|-------------|-----------|
| +20 | Mostly junior (0-2 years experience) | Need guidance on patterns, practices, and architectural thinking. Higher risk of implementation decisions that violate architecture principles. |
| +10 | Mix leaning junior (many 1-3 years, few seniors) | Some experience but lack the pattern recognition that comes with years of practice. Need guardrails more than direction. |
| 0 | Balanced mix or mostly mid-level (3-5 years) | Capable of good implementation decisions. Need architecture context and boundaries but not hand-holding. |
| -10 | Mix leaning senior (many 5+ years, few juniors) | Strong individual contributors who can make most technical decisions independently. Over-guidance feels patronizing. |
| -20 | Mostly senior (8+ years) | Deep expertise. Can identify architecture violations themselves. Architect role shifts to facilitator and vision-setter. |
**Key insight:** Experience level determines the GRAIN of control. With junior teams, the architect defines patterns and reviews implementation. With senior teams, the architect sets principles and trusts the team to apply them. The same boundary ("use event-driven communication between these services") means different things to a junior team (needs examples, templates, code reviews) and a senior team (needs the architectural intent, figures out implementation).
## Factor 4: Project Complexity
How architecturally complex is the project?
| Score | Description | Rationale |
|-------|-------------|-----------|
| +20 | Highly complex — distributed systems, novel domain, multiple integration points, real-time requirements | Architectural decisions have far-reaching consequences. Wrong choices are expensive to fix. Architect must be deeply involved in key technical decisions. |
| +10 | Moderately complex — some distributed components, familiar domain with new requirements | Some architectural decisions are critical, others are routine. Architect focuses on high-impact areas. |
| 0 | Standard complexity — well-understood patterns, moderate integration | Architecture is largely settled. Implementation is the main challenge, not design. |
| -10 | Low complexity — straightforward CRUD, single service, well-understood domain | Few architectural decisions to make. Over-architecting is the bigger risk. |
| -20 | Simple — internal tool, prototype, proof-of-concept | Architecture is trivial. Architect involvement beyond initial setup adds no value. |
**Key insight:** Complexity determines the STAKES of control. On a simple project, a bad architectural decision has limited blast radius. On a complex distributed system, a bad decision can take months to unwind. The architect's role scales with the cost of getting architecture wrong.
## Factor 5: Project Duration
How long is the project expected to last?
| Score | Description | Rationale |
|-------|-------------|-----------|
| +20 | Long (18+ months) | More time for architectural drift, team turnover, requirement changes. Architecture must be actively governed to prevent erosion. |
| +10 | Medium-long (12-18 months) | Sufficient time for problems to accumulate. Periodic architecture reviews needed. |
| 0 | Medium (6-12 months) | Standard project lifecycle. Regular check-ins sufficient. |
| -10 | Short (3-6 months) | Limited time for architecture to drift. Initial decisions carry through to completion. |
| -20 | Very short (< 3 months) | Sprint-like intensity. Architecture is set once and executed. Extended governance adds overhead without value. |
**Key insight:** Duration determines the FREQUENCY of recalibration. Long projects need quarterly re-scoring because factors change: team members leave and join, complexity becomes better understood, the team becomes more familiar. Short projects need one calibration at the start.
## Worked Scoring Examples
### Example 1: Scores +35 (Moderate-High Control)
A fintech company is building a new payment processing platform.
| Factor | Score | Rationale |
|--------|-------|-----------|
| Team familiarity | +10 | Team of 8 pulled from different departments. 3 know each other, 5 are new to the group. |
| Team size | 0 | 8 developers — standard size. |
| Overall experience | +10 | Mix of 3 seniors and 5 mid-levels, skewing toward less experienced for a complex domain. |
| Project complexity | +20 | Payment processing with PCI compliance, multiple external integrations, real-time reconciliation. |
| Project duration | -5 | 9 months — medium with slight short lean. |
| **Total** | **+35** | Moderate-high control. Complex domain with partially new team. |
**Recommended posture:** Attend architecture syncs and key stand-ups. Define service boundaries and integration patterns. Review all external API integration decisions. Create architecture fitness functions for PCI compliance. Trust seniors for implementation but guide the mid-level developers on patterns.
### Example 2: Scores -55 (Moderate-Low Control)
A mature DevOps team is building a new internal deployment dashboard.
| Factor | Score | Rationale |
|--------|-------|-----------|
| Team familiarity | -20 | Team has worked together for 4 years on various internal tools. |
| Team size | -15 | 5 developers. |
| Overall experience | -15 | 4 seniors (10+ years each), 1 mid-level (4 years). |
| Project complexity | -10 | Standard web dashboard with APIs they've built before. Some new charting requirements. |
| Project duration | +5 | 15 months — longer than typical for this type of project due to phased rollout. |
| **Total** | **-55** | Moderate-low control. Experienced, established team on familiar ground. |
**Recommended posture:** Set high-level architecture direction (tech stack, deployment model, data schema approach), then step back. Monthly architecture review is sufficient. Be available for questions but don't attend daily stand-ups. Focus architect time on the phased rollout strategy, which is the one area of complexity.
## Score Interpretation Zones
```
-100 -------- -60 -------- -20 -------- +20 -------- +60 -------- +100
| Low | Mod-Low | Balanced | Mod-High | High |
| | | | | |
| Advisor | Guardrails | Facilitate | Guide | Direct |
| Set vision | Set bounds | Collaborate| Attend key | Attend all |
| Step back | Check in | Be present | Review | Review all |
| | monthly | weekly | decisions | daily |
```
## Re-scoring Triggers
Recalculate the score when any of these events occur:
1. **Team membership changes** — New members reduce familiarity; departures change experience mix
2. **Project phase transition** — Moving from design to implementation, or from implementation to maintenance
3. **Complexity revelation** — The project turns out to be more or less complex than initially estimated
4. **Warning signs appear** — Process loss, pluralistic ignorance, or diffusion of responsibility emerge
5. **Quarterly cadence** — Even without specific triggers, re-score every 3 months on long projects
Choose the correct transaction isolation level and serializability implementation for an application's concurrency patterns. Use when: selecting an isolation...
---
name: transaction-isolation-selector
description: |
Choose the correct transaction isolation level and serializability implementation for an application's concurrency patterns. Use when: selecting an isolation level for a new system; evaluating whether read committed or snapshot isolation is safe for your access patterns; deciding whether to upgrade to serializable and choosing between two-phase locking (2PL) vs. serializable snapshot isolation (SSI); producing an architecture decision record for isolation level choice; or explaining to a team why the database default is insufficient. Distinct from concurrency-anomaly-detector (which scans code for exposed anomalies) — this skill selects the level, not the bugs. Covers PostgreSQL, MySQL InnoDB, Oracle, SQL Server, and distributed databases. Applies a 6-anomaly × 4-isolation-level mapping matrix (dirty read, dirty write, read skew, lost update, write skew, phantom read vs. read uncommitted, read committed, snapshot isolation, serializable) to produce a concrete recommendation with implementation trade-off analysis. Works on any codebase, schema, or workload description.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/transaction-isolation-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [7]
tags: [transactions, isolation-levels, serializability, snapshot-isolation, read-committed, write-skew, phantom-reads, lost-updates, dirty-reads, mvcc, two-phase-locking, serializable-snapshot-isolation, concurrency, race-conditions, postgresql, mysql, oracle, sql-server, acid, database-selection]
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "Application codebase, schema files, docker-compose, or architecture description — any artifact that reveals data access patterns and transaction boundaries"
- type: document
description: "Workload description or requirements document if no codebase is available"
tools-required: [Read, Write, Grep]
tools-optional: [Bash]
mcps-required: []
environment: "Run inside a project directory where codebase or configuration files exist. Falls back to document/description input if no codebase."
discovery:
goal: "Identify the minimum safe isolation level for the application's concurrency patterns and produce a concrete recommendation with implementation trade-off analysis"
tasks:
- "Identify which of the 6 concurrency anomalies the application is exposed to"
- "Map those anomalies to the minimum isolation level that prevents them"
- "Assess performance requirements to select among serializability implementations if serializable is indicated"
- "Identify the database in use and its actual default isolation level"
- "Flag write skew exposure — the most commonly missed anomaly"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "tech-lead", "site-reliability-engineer"]
experience: "intermediate-to-advanced — assumes experience with relational databases and SQL transactions"
triggers:
- "User is choosing an isolation level for a new service or database"
- "User has a concurrency bug and suspects a race condition in their transactions"
- "User wants to understand whether their database's default is safe for their workload"
- "User is migrating from one database to another and needs to verify isolation equivalence"
- "User needs to justify an isolation choice in an architecture decision record"
- "User suspects write skew but isn't sure how to detect or prevent it"
- "User wants to evaluate serializable snapshot isolation vs two-phase locking"
not_for:
- "Distributed transaction coordination across multiple databases — use two-phase commit analysis (Ch 9)"
- "Choosing between replication consistency models (eventual vs linearizable) — use consistency-model-selector"
- "Selecting a storage engine — use storage-engine-selector"
---
# Transaction Isolation Selector
## When to Use
You have a database with concurrent transaction access and need to choose the right isolation level — or you suspect an existing isolation level is inadequate for your concurrency patterns.
This skill applies when any of the following are true:
- You are building a new service and need to decide what isolation level to configure
- You have a bug that appears nondeterministically and involves concurrent reads and writes
- You are migrating to a new database and need to verify the isolation guarantees are equivalent
- Your application touches multiple rows or tables within a single transaction
- Your business logic reads a value and then conditionally writes based on it (the write skew pattern)
**Critical default:** Most databases do NOT default to serializable isolation. Oracle 11g does not implement true serializable at all — its "serializable" level is actually snapshot isolation. PostgreSQL defaults to read committed. MySQL InnoDB defaults to repeatable read (which is snapshot isolation in MySQL's implementation). If you have not explicitly set your isolation level, you are running at a weaker level than serializable, and some anomalies are possible.
**Related skills:**
- `concurrency-anomaly-detector` — if you already have a bug and need to identify the anomaly type
- `consistency-model-selector` — for distributed system consistency guarantees (linearizability, eventual consistency)
---
## Context & Input Gathering
### Required Context (must have — ask if missing)
- **Database in use and its version.** Why: The same isolation level name means different things in different databases. Oracle's "serializable" is snapshot isolation. PostgreSQL's "repeatable read" is snapshot isolation. MySQL's "repeatable read" does not detect lost updates automatically. Without knowing the database, the isolation level name alone is meaningless.
- Check environment for: `docker-compose.yml` (database service images), `requirements.txt` / `pom.xml` / `package.json` (database driver), schema files (database-specific syntax)
- If still missing, ask: "What database are you using, and what version?"
- **Transaction boundaries and what they read/write.** Why: Isolation level requirements are determined by what transactions do — specifically whether they read a value and then write based on what they read. A transaction that only does blind writes (INSERT without a preceding SELECT) has different requirements than one that does SELECT then UPDATE.
- Check environment for: application code (look for BEGIN TRANSACTION / BEGIN / with transaction context managers); ORM code (look for @Transactional, session.begin()); look for read-then-write patterns (SELECT followed by UPDATE or INSERT in the same function)
- If still missing, ask: "Can you describe a typical transaction your application performs? For example: 'we read an account balance, check if it's positive, then deduct an amount.'"
- **Concurrency pattern — how many concurrent users or processes access the same data.** Why: Anomalies only occur under concurrent access. A single-user system with no concurrency has no isolation requirements beyond atomicity. A high-concurrency system where many transactions access the same rows needs strong isolation.
- Check environment for: architecture descriptions (number of instances/workers), load testing configs, queue worker counts
- If still missing, ask: "Is this data accessed by a single process at a time, or by multiple concurrent users or worker processes?"
- **Performance requirements.** Why: Serializable isolation is the safest choice but has a performance cost. The choice between serializability implementations (serial execution, two-phase locking, serializable snapshot isolation) depends heavily on transaction throughput requirements and whether workloads are read-heavy or write-heavy.
- Check environment for: SLA definitions (requirements.md, architecture.md), load test results, existing query timeout configurations
- If still missing, ask: "Are there throughput or latency requirements? For example, transactions per second, or p99 latency SLA?"
### Observable Context (gather from environment)
- **Existing isolation level configuration.** Look for `SET TRANSACTION ISOLATION LEVEL`, `transaction_isolation` config vars, ORM transaction settings, or database configuration files. If already set, assess whether it is sufficient.
- **Read-then-write patterns.** Grep for: SELECT followed by UPDATE/INSERT/DELETE in the same transaction scope; ORM patterns like `find_then_update`; check-and-set patterns; aggregate queries (COUNT, SUM) used as a guard before a write.
- **Multi-object transactions.** Look for transactions that touch more than one table or row. Single-object transactions have simpler isolation requirements than multi-object ones.
- **Long-running transactions.** Look for background jobs, batch jobs, or backup processes that hold transactions open for minutes. These are particularly sensitive to read skew.
### Default Assumptions
When context cannot be observed and asking would be excessive:
- Database isolation level unknown → assume read committed (PostgreSQL/Oracle/SQL Server default); note this assumption explicitly
- Transaction length unknown → assume short OLTP transactions (< 1 second)
- Throughput requirements unknown → assume moderate concurrency; do not pre-optimize away serializable
- Write pattern unknown → assume read-then-write patterns exist (conservative); flag for user confirmation
---
## Process
### Step 1: Identify the Database Default and Current Isolation Level
**ACTION:** Determine what isolation level the database is actually operating at — not what is assumed.
**WHY:** The most common root cause of concurrency bugs is assuming the database provides stronger guarantees than it does. "We use PostgreSQL so we're ACID-compliant" is true for atomicity and durability but does not mean serializable isolation. PostgreSQL defaults to read committed, which allows read skew and does not prevent write skew or phantom reads. Establishing the actual current level — not the desired level — is the necessary starting point.
Check and record:
```
Database: [PostgreSQL | MySQL InnoDB | Oracle | SQL Server | other]
Actual default: [read uncommitted | read committed | snapshot isolation | serializable]
Current setting: [check docker-compose, app config, ORM settings, database session config]
```
**Default isolation levels by database (as of Kleppmann's analysis):**
| Database | Default Isolation Level | Notes |
|----------|------------------------|-------|
| PostgreSQL | Read committed | "Repeatable read" = snapshot isolation. "Serializable" = true SSI (since v9.1). |
| MySQL InnoDB | Repeatable read | MySQL's repeatable read does NOT automatically detect lost updates. Not the same as PostgreSQL's snapshot isolation. |
| Oracle 11g | Read committed | "Serializable" = snapshot isolation. True serializable is not available. |
| SQL Server | Read committed | Snapshot isolation available with READ_COMMITTED_SNAPSHOT=ON. |
| DB2 | Cursor stability (≈ read committed) | "Repeatable read" = serializable in IBM's terminology — opposite of everyone else. |
---
### Step 2: Map the Application's Transaction Patterns to Anomaly Exposure
**ACTION:** For each significant transaction type in the application, identify which of the 6 concurrency anomalies it is exposed to.
**WHY:** Not every application needs serializable isolation. The 6 anomalies exist on a spectrum of severity and commonality. Dirty reads and dirty writes are rare and catastrophic; write skew is subtle and frequently missed; phantom reads matter only in specific patterns. Identifying which anomalies are actually possible in your access patterns lets you select the minimum sufficient isolation level rather than defaulting to either "use serializable always" (overly conservative) or "read committed is fine" (frequently wrong). The minimum sufficient level is the correct engineering answer.
**The 6 anomalies and what they require to occur:**
| Anomaly | What must be true for it to occur | What it looks like |
|---------|----------------------------------|--------------------|
| **Dirty read** | Transaction A reads uncommitted writes from transaction B, then B aborts | Application sees data that was never actually committed — "phantom" changes that disappear |
| **Dirty write** | Transaction A overwrites uncommitted writes from transaction B | Two concurrent writes to the same object mix their results; for example, car sale listing shows one buyer but invoice shows another |
| **Read skew (nonrepeatable read)** | Transaction reads same data twice; a concurrent write commits between the two reads | Bank transfer example: Alice reads account 1 ($500) before transfer, account 2 ($400) after transfer — total appears as $900 not $1000 |
| **Lost update** | Two transactions do read-modify-write cycles concurrently; one's write overwrites the other's | Counter increment race: both read 42, both write 43, result is 43 instead of 44 |
| **Write skew** | Two transactions read overlapping data, each makes a decision based on the read, each writes to disjoint objects | Doctor on-call: both doctors see 2 on-call, both go off-call, result is 0 doctors on call |
| **Phantom read** | Transaction reads a set of objects matching a condition; concurrent transaction inserts/deletes a row matching that condition | Booking system: check shows no conflicts, insert succeeds; concurrent check also shows no conflicts, concurrent insert also succeeds — double-booking |
**Write skew detection checklist** (the most commonly missed anomaly):
A transaction is vulnerable to write skew if ALL of the following are true:
1. It reads one or more rows matching some condition
2. It makes a decision based on the result of that read
3. It writes to the database (INSERT, UPDATE, or DELETE) based on that decision
4. The write changes the precondition that was checked in step 1
5. Another transaction could do the same thing concurrently
**Common write skew patterns by domain:**
| Pattern | Example | Risk |
|---------|---------|------|
| At-least-one constraint | Doctor on-call: check count >= 1, then remove self | Two concurrent removals both pass the check, both remove |
| No-overlap constraint | Meeting room booking: check no conflicts, then insert booking | Two concurrent bookings both pass the check, both insert |
| Unique-per-user constraint | Username claim: check username not taken, then insert user | Two concurrent registrations both pass, both insert |
| Budget constraint | Spending check: verify sum remains positive, then insert spend | Two concurrent spends both see positive sum, both insert — total goes negative |
| Game state validity | Chess: check move is valid, then update position | Two concurrent moves to the same position both pass validity |
---
### Step 3: Apply the Anomaly-to-Isolation Mapping Matrix
**ACTION:** For each identified anomaly exposure, determine the minimum isolation level that prevents it.
**WHY:** Isolation levels exist on a spectrum where each level prevents some anomalies and allows others. The correct level is the minimum one that prevents all anomalies the application is actually exposed to. Choosing a weaker level than necessary risks data corruption; choosing a stronger level than necessary incurs unnecessary performance cost. The mapping matrix makes this selection systematic rather than intuitive.
**The anomaly-to-isolation mapping matrix:**
| Anomaly | Read Uncommitted | Read Committed | Snapshot Isolation | Serializable |
|---------|:----------------:|:--------------:|:-----------------:|:------------:|
| Dirty reads | allowed | **prevented** | prevented | prevented |
| Dirty writes | prevented | prevented | prevented | prevented |
| Read skew | allowed | allowed | **prevented** | prevented |
| Lost updates | allowed | allowed | sometimes* | **prevented** |
| Write skew | allowed | allowed | allowed | **prevented** |
| Phantom reads | allowed | allowed | partially** | **prevented** |
*PostgreSQL and Oracle automatically detect lost updates in snapshot isolation. MySQL InnoDB does NOT.
**Snapshot isolation prevents straightforward phantom reads but NOT phantoms that cause write skew.** A phantom in a read-only query (e.g., a backup scan) is prevented by snapshot isolation. A phantom in a read-write transaction where the phantom affects a write decision (the write skew pattern) is NOT prevented by snapshot isolation — serializable isolation is required.
**Reading the matrix:**
- Find the highest-severity anomaly your application is exposed to (rows are ordered from least to most severe in terms of "hardest to prevent")
- The minimum isolation level is the column where that anomaly is first marked "prevented"
- If multiple anomalies are present, take the maximum (most restrictive) required level
**Decision summary:**
```
Exposed to dirty reads only → Read committed is sufficient
Exposed to read skew → Snapshot isolation is the minimum
Exposed to lost updates → Snapshot isolation (PostgreSQL/Oracle) or
explicit locking (MySQL); verify database behavior
Exposed to write skew or phantoms → Serializable is required; no weaker level prevents these
```
---
### Step 4: If Serializable Is Required — Select an Implementation
**ACTION:** If Step 3 indicates serializable isolation is required, select among the three implementation approaches using the table and decision tree below.
**WHY:** Serializable isolation has a reputation for being unusably slow — this comes from two-phase locking (2PL), the only option for decades. Two newer approaches (serial execution and SSI) have very different profiles. The choice between them is the difference between "blocking all reads when a write is in progress" and "reads and writes never block each other." Selecting the wrong implementation is the primary reason teams unnecessarily abandon correctness for performance.
| Implementation | Key property | Use when | Do not use when |
|----------------|-------------|----------|-----------------|
| **Serial Execution** | No concurrency — one thread, serial order | Dataset in memory; transactions < 10ms; throughput fits a single core; stored procedures only | Long transactions; disk I/O in transactions; cross-partition coordination at scale |
| **Two-Phase Locking (2PL)** | Pessimistic — readers block writers, writers block readers | SSI unavailable (MySQL, SQL Server, DB2); moderate concurrency; low contention | Strict latency SLA with high contention; long + short transactions coexist |
| **Serializable Snapshot Isolation (SSI)** | Optimistic — proceed, detect conflicts at commit, abort if needed | Read-heavy; low-to-moderate contention; PostgreSQL >= 9.1 or FoundationDB | High contention (abort rate dominates); app cannot implement retry logic |
**Decision tree:**
```
Dataset in memory + transactions < 10ms + stored procedures?
→ Yes: Serial Execution (VoltDB, Redis, Datomic)
Database supports SSI + workload read-heavy + low-moderate contention?
→ Yes: SSI (PostgreSQL SERIALIZABLE, FoundationDB)
→ No: Two-Phase Locking (MySQL SERIALIZABLE, SQL Server SERIALIZABLE)
```
**SSI requirement:** Application must implement retry logic. SSI aborts transactions at commit time with SQLSTATE 40001. The entire transaction must be re-executed from scratch. ORM frameworks do not retry by default.
See `references/serializability-implementation-comparison.md` for full per-implementation detail, performance profiles, and retry patterns.
---
### Step 5: Check for the Naming Trap
**ACTION:** Verify that the isolation level name used in the database matches the actual guarantee, not just the name.
**WHY:** The SQL standard's isolation level definitions are ambiguous and inconsistently implemented. Different databases use the same names to mean different things. A team that sets `SERIALIZABLE` in Oracle 11g believes they have full serializability but actually has snapshot isolation — write skew is still possible. A team using `REPEATABLE READ` in MySQL believes they have the same guarantee as PostgreSQL's repeatable read but MySQL's implementation does not automatically detect lost updates. This naming confusion has caused real financial losses and data corruption.
**Critical naming mismatches to check:**
| Database | Name Used | What It Actually Provides |
|----------|-----------|--------------------------|
| Oracle 11g | SERIALIZABLE | Snapshot isolation (write skew still possible) |
| PostgreSQL | REPEATABLE READ | Snapshot isolation (does detect lost updates) |
| MySQL InnoDB | REPEATABLE READ | Snapshot isolation WITHOUT automatic lost update detection |
| MySQL InnoDB | SERIALIZABLE | Two-phase locking — true serializable |
| DB2 | REPEATABLE READ | Serializable (opposite of everyone else) |
| PostgreSQL | SERIALIZABLE | True serializable via SSI (since v9.1) |
**Action:** After selecting the required isolation level, verify the database's actual behavior against this table. If the database name does not provide the required guarantee, select the next stronger level or apply compensating measures.
---
### Step 6: Produce the Recommendation
**ACTION:** Write a structured recommendation with: the anomaly exposure, the minimum required isolation level, the implementation choice (if serializable), the database-specific setting, and compensating measures if the database cannot provide the required level.
**WHY:** The recommendation must be actionable by an engineer configuring a database session or making a pull request. An abstract statement ("use serializable") is insufficient — the team needs the specific database configuration, any compensating measures, and the trade-offs being accepted.
**Output format:**
```
## Transaction Isolation Recommendation
### Current State
Database: [database + version]
Default isolation level: [what the database defaults to]
Current configured level: [what is actually set, if observable]
### Anomaly Exposure Analysis
[For each significant transaction pattern:]
Pattern: [description]
Exposed to: [list of anomalies from Step 2]
Minimum level required: [from Step 3 mapping]
### Recommendation
Isolation Level: [read committed | snapshot isolation | serializable]
Database Setting: [exact configuration statement for the specific database]
Implementation: [serial execution | 2PL | SSI | N/A]
### Trade-offs Accepted
[What anomalies are still possible at the chosen level, if below serializable]
[Performance cost if serializable is chosen]
### Compensating Measures
[If the database cannot provide the required level, or if a weaker level is
chosen deliberately, list the application-level compensations needed:]
- SELECT FOR UPDATE for write skew patterns (if staying below serializable)
- Retry logic for SSI aborts
- Explicit constraint checks at the application layer
### What to Monitor
[Deadlock rate (2PL), abort/retry rate (SSI), or lock contention metrics]
```
---
## What Can Go Wrong
Each of the 6 concurrency anomalies has a distinct detection signature and a specific minimum isolation level. The table below is a quick reference; full per-anomaly detail with worked examples is in `references/anomaly-isolation-matrix.md`.
| Anomaly | Detection signature | Prevented by | Level required |
|---------|--------------------|-----------|----|
| **Dirty read** | App acts on data that was never committed (in-flight write later rolled back) | Read committed + | Read committed |
| **Dirty write** | Two concurrent writes to the same object produce a mixed result (car sale: listing says Bob, invoice says Alice) | All practical levels | Read committed |
| **Read skew** | Long-running read sees different states at different points in time (Alice's $1000 appears as $900 during a transfer) | Snapshot isolation + | Snapshot isolation |
| **Lost update** | Concurrent read-modify-write cycles: both read 42, both write 43, result is 43 not 44 | Snapshot isolation (PG/Oracle auto-detect); explicit locks (MySQL) | Snapshot isolation* |
| **Write skew** | Two concurrent transactions both read a valid precondition, both write to disjoint objects, combined result violates invariant (doctor on-call count goes to 0) | Serializable only | Serializable |
| **Phantom read** | Check-then-insert: both transactions see zero conflicts, both insert, double-booking occurs | Serializable (write skew variant); snapshot isolation (read-only variant) | Serializable |
*MySQL InnoDB snapshot isolation does NOT automatically detect lost updates. Use `SELECT ... FOR UPDATE` or upgrade to SERIALIZABLE.
**The most dangerous gap:** Snapshot isolation does not prevent write skew. Oracle's "serializable" is snapshot isolation. If you are on Oracle and have any write skew pattern, you have no database-level protection. Use explicit `SELECT FOR UPDATE` on the precondition query (when rows exist) or add serializable-level protection at the application boundary.
---
## Key Principles
**The naming trap is the most dangerous pitfall.** Database isolation level names are not standardized in practice. Two databases can both claim to implement "serializable" while providing fundamentally different guarantees. Before configuring an isolation level, look up what that specific database's implementation actually provides, not what the name suggests. Oracle's "serializable" is snapshot isolation; IBM DB2's "repeatable read" is serializable. Trust the behavior, not the label.
**Write skew is the most commonly missed anomaly.** Dirty reads are well-known. Lost updates are familiar. Write skew — where two transactions read the same precondition and update disjoint objects — is subtle enough that it frequently goes undetected until a production incident. Any time application logic follows the pattern "check condition, make decision, write result," write skew is a possibility if two such transactions can run concurrently. The doctor on-call pattern appears in: inventory management, appointment booking, financial spending limits, membership count checks, and game state validation.
**Snapshot isolation is not serializable, despite what Oracle calls it.** Snapshot isolation prevents dirty reads, dirty writes, read skew, and (in some databases) lost updates. It does not prevent write skew or write skew phantoms. The "readers never block writers, writers never block readers" guarantee that makes snapshot isolation attractive is precisely what allows write skew: two transactions can both read the same precondition simultaneously and proceed to write.
**Serializable snapshot isolation (SSI) makes serializable practical.** The historical association of serializable isolation with heavy performance penalties comes from two-phase locking. SSI provides full serializability with much lower overhead — close to snapshot isolation performance at low-to-moderate contention. PostgreSQL has offered SSI since version 9.1. Teams that reject serializable isolation for performance reasons should evaluate whether SSI in their specific database actually has unacceptable overhead for their workload, rather than accepting that overhead as given.
**Explicit locking is the escape hatch, not the solution.** `SELECT FOR UPDATE` and other explicit locks can compensate for weaker isolation levels in specific cases. They work when the rows to lock are known in advance (you can lock the specific rows the precondition query returns). They fail when the write skew involves inserting a new row that matches a condition — there is no existing row to lock. Explicit locking also requires careful code discipline: forgetting one lock somewhere creates a race condition. Serializable isolation is more robust because it applies automatically.
**Application retry logic is not optional for SSI.** SSI aborts transactions that would violate serializability. Unlike 2PL where the transaction blocks waiting for a lock (and succeeds after the lock is released), SSI aborts at commit time. The application must detect the abort and retry the entire transaction from the beginning. ORM frameworks that silently swallow exceptions (like Rails' ActiveRecord with default error handling) will not retry — the error reaches the user and the operation is lost. SSI adoption requires explicit retry handling in the application layer.
---
## Examples
### Example 1: Financial Transfer Service (Write Skew Exposure)
**Scenario:** A fintech service performs fund transfers between accounts. A transfer transaction reads both account balances, verifies the source account has sufficient funds, then decrements the source and increments the destination. The service processes ~500 concurrent requests per second. The database is PostgreSQL (default isolation: read committed).
**Trigger:** "We have a bug where accounts occasionally go negative, but only under load. Our transaction checks the balance before debiting."
**Process:**
- Step 1: PostgreSQL defaults to read committed. The transaction is: SELECT balance, verify > 0, UPDATE source balance, UPDATE destination balance.
- Step 2: The read-then-write pattern (check balance, then debit) is exposed to write skew if multiple transfers from the same source account run concurrently. Two concurrent transfers both read the balance as $100, both decide it's sufficient for a $60 debit, both debit — result is -$20.
- Step 3: Write skew requires serializable isolation. Snapshot isolation would not help here because both transactions update different rows (source and destination balances are different objects).
- Step 4: SSI evaluation: PostgreSQL supports SSI via SERIALIZABLE. Contention is moderate (multiple concurrent transfers from popular accounts). SSI is appropriate; retry logic required.
- Step 5: Naming check: PostgreSQL's SERIALIZABLE actually uses SSI — true serializable.
**Output:**
```
## Transaction Isolation Recommendation
Database: PostgreSQL 15
Default: Read committed
Current: Read committed (bug: insufficient for this workload)
Anomaly Exposure:
Pattern: Read balance → verify sufficient → debit source → credit destination
Exposed to: Write skew (concurrent transfers from same source account)
Minimum level required: Serializable
Recommendation:
Isolation Level: Serializable
Database Setting: SET default_transaction_isolation = 'serializable';
(or per-transaction: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE)
Implementation: SSI (PostgreSQL default for SERIALIZABLE since v9.1)
Trade-offs Accepted:
- SSI may abort transactions under high contention on the same account
- Application must implement retry logic for serialization failures (SQLSTATE 40001)
Compensating Measure (if serializable is not adopted):
- Add SELECT balance FROM accounts WHERE id = $1 FOR UPDATE in the transfer
transaction — locks the source account row before checking the balance
- This prevents concurrent transfers from the same source from passing the
balance check simultaneously
Monitor: pg_stat_activity for serialization failures; track retry rate
```
---
### Example 2: Doctor On-Call Scheduling (Classic Write Skew)
**Scenario:** A hospital scheduling app allows doctors to release their on-call shift if at least one other doctor remains on call. The check is: `SELECT COUNT(*) FROM doctors WHERE on_call = true AND shift_id = $1`. If count >= 2, the requesting doctor is marked off-call. The database is MySQL InnoDB (default: repeatable read).
**Trigger:** "We had an incident where all doctors went off-call simultaneously. Our code checks that at least one other doctor is on call before allowing the change."
**Process:**
- Step 1: MySQL InnoDB defaults to repeatable read. MySQL's repeatable read does not automatically detect lost updates and does not prevent write skew.
- Step 2: Classic write skew — two concurrent off-call requests both pass the count check (both see count = 2), both update their own row (disjoint objects — Alice's record and Bob's record), result is count = 0.
- Step 3: Write skew requires serializable. Snapshot isolation alone does not prevent this.
- Step 4: MySQL InnoDB SERIALIZABLE uses two-phase locking (not SSI). At ~50 concurrent scheduling requests, contention on the doctors table is low — 2PL is acceptable.
- Step 5: Naming check: MySQL's REPEATABLE READ does NOT prevent write skew. MySQL's SERIALIZABLE uses 2PL — true serializable.
**Output:**
```
## Transaction Isolation Recommendation
Database: MySQL InnoDB
Default: Repeatable read (does not prevent write skew)
Current: Repeatable read (insufficient — confirmed root cause of incident)
Anomaly Exposure:
Pattern: COUNT doctors on-call → if >= 2 → update own record to off-call
Exposed to: Write skew (concurrent off-call requests)
Minimum level required: Serializable
Recommendation:
Isolation Level: Serializable
Database Setting: SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
(or SET GLOBAL if appropriate)
Implementation: 2PL (MySQL InnoDB SERIALIZABLE uses 2PL, not SSI)
Trade-offs Accepted:
- 2PL will block concurrent transactions that conflict — scheduling requests may queue
- Deadlocks possible; application must handle and retry
Compensating Measure (if serializable not adopted):
- Upgrade the COUNT query to: SELECT COUNT(*) FROM doctors
WHERE on_call = true AND shift_id = $1 FOR UPDATE
- This locks all on-call rows for the shift, preventing concurrent reads of
the same precondition. Works because the rows being checked exist.
- This is application-level mitigation, not a substitute for serializable isolation.
Monitor: Deadlock rate (SHOW ENGINE INNODB STATUS); lock wait timeouts
```
---
### Example 3: Meeting Room Booking System (Phantom Causing Write Skew)
**Scenario:** A booking system checks for conflicting reservations before inserting a new booking. The check: `SELECT COUNT(*) FROM bookings WHERE room_id = $1 AND end_time > $2 AND start_time < $3`. If count = 0, insert the booking. Database is PostgreSQL with snapshot isolation (REPEATABLE READ).
**Trigger:** "We occasionally get double-booked rooms. The conflict check runs inside a transaction, so we thought it was safe."
**Process:**
- Step 1: PostgreSQL REPEATABLE READ = snapshot isolation. Not true serializable.
- Step 2: This is a write skew phantom — the transaction checks for the absence of rows, then inserts a new row matching the same condition. Two concurrent bookings both see count = 0, both proceed to insert. Neither transaction modified an existing row — they both inserted new rows that didn't exist when the check ran. Snapshot isolation does not prevent this.
- Step 3: Write skew phantom requires serializable. Snapshot isolation explicitly does NOT prevent phantoms in read-write transactions.
- Step 4: SSI evaluation: PostgreSQL SERIALIZABLE uses SSI. Booking conflicts are low-frequency per room (users don't typically book the same room simultaneously in high volume). SSI is appropriate.
- Step 5: Naming check: PostgreSQL SERIALIZABLE = SSI = true serializable. Correct.
**Output:**
```
## Transaction Isolation Recommendation
Database: PostgreSQL
Default / Current: Repeatable read (snapshot isolation) — insufficient for this pattern
Anomaly Exposure:
Pattern: SELECT count of conflicts → if zero → INSERT booking
Exposed to: Write skew phantom (two concurrent inserts both pass the zero-conflict check)
Minimum level required: Serializable
Recommendation:
Isolation Level: Serializable
Database Setting: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
(applied to all booking transactions)
Implementation: SSI (PostgreSQL default for SERIALIZABLE)
Trade-offs Accepted:
- High-contention booking windows (popular rooms, peak hours) will see SSI aborts
- Application must retry on serialization failure (SQLSTATE 40001)
Compensating Measure (if serializable not adopted):
- Materializing conflicts: create a table of room/time-slot locks populated
ahead of time. Use SELECT FOR UPDATE on the relevant time slots before
checking for conflicts. This is complex and couples concurrency logic
into the data model — use as last resort only.
- A unique constraint on (room_id, time_slot) works for discrete time slots
but not for arbitrary time ranges.
Monitor: SSI abort rate per booking endpoint; alert if retry rate > 5%
```
---
## References
| File | Contents | When to read |
|------|----------|--------------|
| `references/anomaly-isolation-matrix.md` | Full 6×4 anomaly-to-isolation mapping matrix with per-cell explanations; database-specific implementation notes; examples for each anomaly type | When working through Step 3 or explaining anomaly coverage to a team |
| `references/serializability-implementation-comparison.md` | Side-by-side comparison of serial execution, two-phase locking, and SSI across 8 dimensions (throughput, latency, abort rate, contention behavior, deadlock risk, implementation complexity, database support, operational overhead); decision tree for selecting among them | When Step 4 selection is needed or when justifying implementation choice to a team |
| `references/write-skew-patterns.md` | Detailed catalog of 5 write skew patterns (at-least-one constraint, no-overlap, unique claim, budget enforcement, game state validity); detection checklist; SQL patterns for explicit locking mitigation per pattern | When diagnosing whether a specific transaction pattern is vulnerable to write skew |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/anomaly-isolation-matrix.md
# Anomaly-to-Isolation Mapping Matrix
Full reference for Step 3 of the transaction-isolation-selector process.
---
## The Matrix
| Anomaly | Read Uncommitted | Read Committed | Snapshot Isolation | Serializable |
|---------|:----------------:|:--------------:|:-----------------:|:------------:|
| Dirty reads | allowed | **prevented** | prevented | prevented |
| Dirty writes | prevented | prevented | prevented | prevented |
| Read skew | allowed | allowed | **prevented** | prevented |
| Lost updates | allowed | allowed | sometimes* | **prevented** |
| Write skew | allowed | allowed | allowed | **prevented** |
| Phantom reads (read-only) | allowed | allowed | **prevented** | prevented |
| Phantom reads (write skew) | allowed | allowed | allowed | **prevented** |
*Snapshot isolation detects lost updates automatically in PostgreSQL and Oracle. MySQL InnoDB does NOT detect them automatically.
**Read this matrix:** Find your worst anomaly exposure. The column where it is first marked "prevented" is your minimum required isolation level.
---
## Per-Cell Explanations
### Dirty Reads
**Read Uncommitted:** Allowed. A transaction can read uncommitted writes from other transactions. If the writing transaction aborts, the reading transaction has acted on data that never existed. Uncommon in production — almost no database defaults to this level.
**Read Committed and above:** Prevented. The database keeps the previously committed value available to readers while a write transaction is in progress. Only when the write commits do readers see the new value. Implementation: most databases use multi-version storage (keeping both old and new values) rather than requiring a read lock on every read.
---
### Dirty Writes
**All levels including Read Uncommitted:** Dirty writes are prevented at every practical isolation level. When transaction A is writing to an object, transaction B must wait until A commits or aborts before writing to the same object. Implementation: row-level write locks held until commit. This is the one anomaly that isolation levels universally prevent.
---
### Read Skew (Nonrepeatable Read)
**Read Committed:** Allowed. A transaction may see different values for the same object across two reads within the same transaction, because a concurrent transaction committed a write between the two reads. Each read is of committed data (no dirty reads), but the data is from different points in time.
**Snapshot Isolation and above:** Prevented. The transaction reads from a consistent snapshot taken at the start of the transaction. All reads within the transaction see the database as it was at that point in time, even if concurrent transactions commit writes in the interim. Implementation: multi-version concurrency control (MVCC) — the database retains multiple versions of each row, tagged with the transaction ID that created them.
**Critically affected operations:**
- Long-running backup processes (reading from multiple tables over minutes or hours)
- Integrity checks and aggregate queries that scan large portions of the database
- Multi-step reads where an early result influences a later query
---
### Lost Updates
**Read Committed:** Allowed unless the application explicitly prevents it. Two concurrent read-modify-write cycles can each read the same value, compute updates independently, and write back — the second write overwrites the first without incorporating it.
**Snapshot Isolation:** Database-dependent.
- **PostgreSQL:** Automatically detects lost updates and aborts the conflicting transaction. The transaction must be retried.
- **Oracle:** Automatically detects lost updates.
- **MySQL InnoDB:** Does NOT automatically detect lost updates. Two concurrent `UPDATE` statements where both read the old value can lose one update silently.
- **SQL Server:** With SNAPSHOT isolation (READ_COMMITTED_SNAPSHOT=ON), detects lost updates.
**Prevention methods below serializable:**
1. Atomic operations: `UPDATE t SET v = v + 1 WHERE k = $1` — no application-layer read-modify-write; the database executes the increment atomically.
2. Explicit locking: `SELECT ... FOR UPDATE` — locks the row on read, blocking concurrent reads until the transaction completes.
3. Compare-and-set: `UPDATE t SET v = $new WHERE k = $1 AND v = $old` — only updates if the value hasn't changed. Caution: may not be safe if the database reads from a snapshot that doesn't reflect the latest committed value.
---
### Write Skew
**Read Committed and Snapshot Isolation:** Allowed. This is the key gap in snapshot isolation. Two transactions each read the same precondition (a count, an existence check, an aggregate), make independent decisions, and write to different objects. Each write is individually valid; together they violate a constraint that neither transaction was aware of being violated.
**Serializable:** Prevented. Serializable isolation detects that the two transactions' reads and writes form a dependency cycle and aborts one of them.
**Why snapshot isolation cannot prevent write skew:**
Snapshot isolation's guarantee is "readers never block writers, writers never block readers." This property is exactly what allows write skew — two readers can observe the same state simultaneously and proceed to write independently. Preventing write skew requires that concurrent writers who read the same precondition be forced into a serial order, which requires either blocking (2PL) or detection-and-abort (SSI).
**Database-specific behavior:**
- Oracle 11g "SERIALIZABLE" = snapshot isolation. Write skew is possible despite the name.
- PostgreSQL REPEATABLE READ = snapshot isolation. Write skew is possible.
- PostgreSQL SERIALIZABLE = SSI = true serializable. Write skew is prevented.
- MySQL InnoDB SERIALIZABLE = 2PL = true serializable. Write skew is prevented.
---
### Phantom Reads
**Two forms with different prevention requirements:**
**Read-only phantom (prevented by snapshot isolation):** A transaction reads a set of rows matching a condition. A concurrent transaction inserts rows matching the same condition. Under read committed, if the first transaction re-executes the query, it sees the new rows. Under snapshot isolation, the first transaction reads from a frozen snapshot — new rows inserted after the snapshot was taken are invisible.
**Write skew phantom (NOT prevented by snapshot isolation):** A transaction reads a set of rows matching a condition (may return empty set), then inserts a row matching that same condition. A concurrent transaction does the same. Both transactions see no conflict; both insert. The concurrent inserts violate a constraint (no double-booking, no duplicate username, etc.).
Snapshot isolation cannot prevent this because:
1. The transactions are reading an empty set — there are no rows to lock with `FOR UPDATE`
2. The writes are inserts — they create new rows that didn't exist when the check ran
3. Snapshot isolation has no mechanism to detect that one transaction's write invalidates another transaction's read premise when the read found nothing
Prevention requires serializable isolation with predicate locks (2PL) or SSI's write-tracking mechanism.
---
## Database Default Isolation Levels
| Database | Default Level | "Repeatable Read" means | "Serializable" means |
|----------|--------------|------------------------|---------------------|
| PostgreSQL | Read committed | Snapshot isolation (MVCC; detects lost updates) | True serializable via SSI (since v9.1) |
| MySQL InnoDB | Repeatable read | Snapshot isolation (MVCC; does NOT detect lost updates automatically) | 2PL (true serializable) |
| Oracle 11g | Read committed | Snapshot isolation (Oracle calls this "serializable") | Not available — "serializable" = snapshot |
| SQL Server | Read committed | Snapshot isolation (if READ_COMMITTED_SNAPSHOT=ON) | 2PL (true serializable) |
| IBM DB2 | Cursor stability (≈ read committed) | Serializable (IBM inverts the term) | Serializable |
| CockroachDB | Serializable | N/A (only level offered) | SSI |
| FoundationDB | Serializable | N/A (only level offered) | SSI |
---
## Minimum Level Required — Decision Table
| Worst anomaly exposure | Minimum required level |
|------------------------|----------------------|
| None (single-user system or no shared data) | None (atomicity sufficient) |
| Dirty reads possible | Read committed |
| Read skew in long-running reads (backups, analytics) | Snapshot isolation |
| Lost updates in read-modify-write cycles | Snapshot isolation (PostgreSQL/Oracle) OR explicit locking (MySQL) |
| Write skew in any transaction | Serializable |
| Write skew phantoms (check-then-insert pattern) | Serializable |
FILE:references/serializability-implementation-comparison.md
# Serializability Implementation Comparison
Reference for Step 4 of the transaction-isolation-selector process. Use when serializable isolation is required and you need to select among the three implementations.
---
## The Three Implementations
| Dimension | Serial Execution | Two-Phase Locking (2PL) | Serializable Snapshot Isolation (SSI) |
|-----------|:----------------:|:----------------------:|:-------------------------------------:|
| **Concurrency model** | None — single thread | Pessimistic — block on conflict | Optimistic — proceed, detect, abort |
| **Throughput ceiling** | Single CPU core / partition | Limited by lock contention | Near-snapshot-isolation throughput |
| **Latency profile** | Very low, very predictable | Unpredictable at high percentiles | Predictable (aborts, not waits) |
| **Read blocking** | All reads serialized | Readers block writers; writers block readers | Readers never block writers |
| **Write blocking** | All writes serialized | Writers block readers and writers | Writers never block readers |
| **Deadlocks** | None (no concurrency) | Common; database auto-detects and aborts | None (optimistic; no locks held) |
| **Abort type** | None | On deadlock detection; requires retry | On serialization violation; requires retry |
| **Long-running reads** | Block all writes on that partition | Block all writes on locked rows | Safe — reads from consistent snapshot |
| **Cross-partition transactions** | ~10–1000x slower | Standard behavior | Standard behavior |
| **Dataset constraint** | Must fit in memory | No constraint | No constraint |
| **Transaction length** | Must be very short (ms) | Short preferred; long causes lock pile-ups | Short preferred; long increases abort risk |
| **Implementation complexity** | Simple (stored procedures required) | Moderate (lock management automatic) | Moderate (retry logic required in app) |
| **Database support** | VoltDB, Redis, Datomic | MySQL InnoDB, SQL Server, DB2 | PostgreSQL >= 9.1, FoundationDB |
---
## When to Use Each
### Serial Execution
Use when:
- The entire active dataset fits in RAM (disk access inside a serial transaction stalls the single thread)
- All transactions are milliseconds-fast (one slow transaction stalls every other transaction)
- Write throughput fits on a single CPU core, OR the data can be cleanly partitioned so most transactions are single-partition
- The application can express transactions as stored procedures submitted as a unit (no interactive client-server round-trips mid-transaction)
Do not use when:
- Transactions involve user interaction (a human deciding mid-transaction)
- Transactions access data not in memory
- Throughput requirements exceed a single core and data cannot be partitioned cleanly
- Transactions span many partitions
**Performance note from Kleppmann:** VoltDB reports single-partition throughput that scales linearly with CPU cores (each core gets its own partition). Cross-partition transactions are "orders of magnitude" slower — approximately 1,000 cross-partition writes/sec regardless of node count.
---
### Two-Phase Locking (2PL)
Use when:
- SSI is not available in the target database (MySQL InnoDB, SQL Server, DB2)
- Workload has moderate concurrency and short transactions
- Read/write contention on the same rows is low to moderate
Do not use when:
- Strict latency SLA (p99 < 10ms) AND high contention — lock wait queues make tail latency unbounded
- Workload has long-running transactions coexisting with short OLTP transactions — one slow transaction's locks stall all others
- Deadlock rate becomes operationally significant — each deadlock requires aborting and retrying a transaction
**Performance characterization from Kleppmann:** 2PL has "significantly worse" throughput and response times than weak isolation. Unstable latencies at high percentiles. Deadlocks frequent under 2PL (more so than under read committed). A slow transaction can cause the "rest of the system to grind to a halt."
**2PL lock mechanics:**
- Shared lock (read): multiple transactions can hold simultaneously; exclusive lock by any transaction blocks all
- Exclusive lock (write): blocks all readers and all writers
- Upgrade: transaction that reads then writes upgrades shared → exclusive
- Lock held until: commit or abort (both phases — acquire throughout, release at end)
- Predicate locks: prevent phantoms by locking a search condition (e.g., "all bookings for room 123 between noon and 1pm"), not just specific rows
- Index-range locks: practical approximation of predicate locks; attaches shared lock to an index range; slightly over-locks but much lower overhead
---
### Serializable Snapshot Isolation (SSI)
Use when:
- The database supports SSI (PostgreSQL >= 9.1, FoundationDB)
- Workload is read-heavy — SSI's "readers never block writers" property makes read throughput nearly identical to snapshot isolation
- Contention between transactions is low to moderate — high contention causes high abort rates and retry overhead
- Transactions are short — long-running transactions accumulate more read/write tracking overhead and conflict with more concurrent transactions
Do not use when:
- The database does not support SSI
- The application cannot implement retry logic — SSI aborts at commit time with a serialization failure error (SQLSTATE 40001); the transaction must be retried from scratch
- Workload has very high contention (many concurrent writes to the same rows) — abort rate becomes high enough to dominate throughput
**SSI mechanics from Kleppmann:**
SSI detects two patterns that indicate a serialization conflict:
1. **Detecting stale MVCC reads:** When transaction B commits, SSI checks whether any in-flight transaction read data that B modified (a write that was uncommitted when the read occurred). If so, the reading transaction's premise may be outdated, and it is aborted at commit time.
2. **Detecting writes that affect prior reads:** SSI tracks which transactions have read which key ranges (similar to index-range locks but non-blocking — they act as tripwires). When transaction B writes, it checks the tripwires to see if any concurrent transaction read the affected data. If so, SSI notifies those transactions that their read may be outdated. If the conflicting write commits before the reading transaction, the reader must abort.
**SSI vs 2PL key difference:** 2PL blocks — a transaction waits for a lock. SSI notifies and aborts — a transaction proceeds and is checked at commit time. Under SSI, read-only transactions never need to abort (they do not write; no serialization violation is possible). This makes SSI particularly attractive for mixed read-write and analytics workloads.
**PostgreSQL SSI since v9.1:** PostgreSQL uses theory from Michael Cahill's PhD thesis (2008) to reduce unnecessary aborts — it can sometimes prove that a conflict would not actually violate serializability and allow the transaction to commit. This reduces the abort rate below what a naive SSI implementation would produce.
---
## Abort / Retry Requirements
Both 2PL (on deadlock) and SSI (on serialization conflict) require the application to retry transactions. This is not automatic in most frameworks.
**Conditions that require retry (OK to retry):**
- Deadlock detected (2PL): retry the aborted transaction
- Serialization failure (SSI, SQLSTATE 40001): retry the entire transaction from the beginning
- Transient network errors
**Conditions that do NOT warrant retry:**
- Constraint violations (duplicate key, foreign key failure): a retry without changing the data will fail again
- Business logic errors (insufficient balance): a retry will produce the same result
- Permanent errors
**ORM framework warning:** Rails' ActiveRecord and Django's ORM do not retry aborted transactions by default. A serialization failure typically bubbles up as an exception to the user. Applications that adopt SSI must implement explicit retry loops around transaction boundaries. The retry must re-execute the entire transaction (re-read all data, not reuse cached reads from the first attempt).
**Retry pattern:**
```python
MAX_RETRIES = 5
for attempt in range(MAX_RETRIES):
try:
with connection.transaction(isolation="serializable"):
# All reads and writes here
result = perform_transaction()
break # Success — exit retry loop
except SerializationFailure:
if attempt == MAX_RETRIES - 1:
raise # Give up after MAX_RETRIES
continue # Retry from scratch
```
---
## Decision Tree
```
Serializable isolation is required.
Q1: Does the database support SSI?
(PostgreSQL >= 9.1, FoundationDB)
→ No: Go to Q3
Q2: Is the workload read-heavy with low-to-moderate contention?
→ Yes: Use SSI
→ No (high contention, many aborts expected): Go to Q3
Q3: Is the dataset in memory, throughput fits a single core,
and transactions are short stored procedures?
→ Yes: Use Serial Execution
→ No: Use Two-Phase Locking (2PL)
Q4 (after selecting 2PL): Is there a strict latency SLA with high contention?
→ Yes: Consider partitioning the data to enable per-partition serial execution,
OR accept SSI with retry overhead as the lesser of two evils,
OR narrow the serializable scope to specific high-risk transactions only
and use explicit SELECT FOR UPDATE elsewhere
```
FILE:references/write-skew-patterns.md
# Write Skew Patterns
Reference for Step 2 of the transaction-isolation-selector process. Use when diagnosing whether a specific transaction pattern is vulnerable to write skew.
---
## Write Skew Detection Checklist
A transaction is vulnerable to write skew if ALL of the following are true:
- [ ] The transaction executes a SELECT query that checks a condition (count, existence, sum, aggregate)
- [ ] Application code makes a decision based on the result of that query
- [ ] The transaction executes a write (INSERT, UPDATE, or DELETE) based on that decision
- [ ] The write changes data that affects the condition checked in the SELECT
- [ ] Another transaction could execute the same pattern concurrently
**If all five are true:** The transaction is vulnerable to write skew. Snapshot isolation will not prevent it. Serializable isolation is required, or explicit `SELECT FOR UPDATE` can be used as a mitigation when the rows being checked exist in advance.
**The phantom variant:** If the write is an INSERT (adding a new row that matches the checked condition), there are no rows to lock with `SELECT FOR UPDATE`. Only serializable isolation prevents this form.
---
## The 5 Write Skew Patterns
### Pattern 1: At-Least-One Constraint
**Description:** A resource requires at least one unit to remain active. Concurrent transactions each verify the count is >= 2 (safe to remove one), then each remove one.
**Domain examples:**
- Hospital: at least one doctor on call per shift
- Customer support: at least one agent assigned to a support queue
- Infrastructure: at least one replica must remain running
**Transaction structure:**
```sql
BEGIN;
SELECT COUNT(*) FROM doctors
WHERE on_call = true AND shift_id = $shift_id;
-- Application checks: if count >= 2, proceed
UPDATE doctors SET on_call = false
WHERE name = $doctor AND shift_id = $shift_id;
COMMIT;
```
**Race condition:** Two concurrent transactions both read count = 2. Both proceed. Both update their own row. Count becomes 0.
**Under snapshot isolation:** Both transactions read from their own consistent snapshot showing count = 2. Both see a valid precondition. Both commit. The combined result violates the constraint.
**Mitigation (without serializable):**
```sql
SELECT COUNT(*) FROM doctors
WHERE on_call = true AND shift_id = $shift_id
FOR UPDATE;
```
This locks all on-call rows for the shift. A second concurrent transaction must wait for the first to commit before it can read the count. After the first commits (count = 1), the second reads count = 1 and aborts.
**Mitigation limitation:** `FOR UPDATE` on a COUNT query locks every row returned. If the shift has 50 doctors on call, this locks all 50 rows for the duration of the transaction. Consider whether this lock scope is acceptable.
---
### Pattern 2: No-Overlap Constraint
**Description:** Resources must not overlap in some dimension (time, space, ID range). A transaction checks for absence of overlap, then inserts a non-overlapping resource.
**Domain examples:**
- Meeting room booking: no two bookings for the same room in the same time window
- Flight seat assignment: no two passengers assigned the same seat
- IP address allocation: no two servers assigned the same IP
**Transaction structure:**
```sql
BEGIN;
SELECT COUNT(*) FROM bookings
WHERE room_id = $room AND
end_time > $start AND start_time < $end;
-- Application checks: if count = 0, proceed
INSERT INTO bookings (room_id, start_time, end_time, user_id)
VALUES ($room, $start, $end, $user);
COMMIT;
```
**Race condition:** Two transactions both query count = 0 for the same room and time range. Both insert a booking. Result: double-booked room.
**Under snapshot isolation:** Both transactions read from consistent snapshots that do not include each other's in-flight insert. Both see count = 0. Both insert. Both commit.
**Why `FOR UPDATE` does not help:** The SELECT returns zero rows (no conflicts found). `FOR UPDATE` on an empty result set locks nothing — there are no rows to lock. This is the phantom variant of write skew.
**Mitigation (without serializable):**
- **Materializing conflicts:** Create a table of time slots ahead of time (e.g., 15-minute slots for each room for the next 6 months). Lock the relevant slots with `SELECT FOR UPDATE` before checking for conflicts. Ugly but functional.
- **Unique constraint:** If bookings can be modeled as discrete units (e.g., seat numbers), a UNIQUE constraint on (room_id, seat_number) allows the database to enforce no-overlap with a constraint violation rather than application logic.
- **Serializable isolation:** The clean solution.
---
### Pattern 3: Unique Claim
**Description:** A user claims a unique resource (username, identifier, role). The transaction checks for non-existence, then creates the claim.
**Domain examples:**
- Username registration: check username not taken, then create account
- Document lock: check document not locked, then lock it
- Prize claim: check prize not claimed, then record claim
**Transaction structure:**
```sql
BEGIN;
SELECT COUNT(*) FROM users WHERE username = $name;
-- Application checks: if count = 0, proceed
INSERT INTO users (username, email, ...) VALUES ($name, $email, ...);
COMMIT;
```
**Race condition:** Two concurrent registrations for the same username both see count = 0 and both insert. Result: two accounts with the same username.
**Mitigation:** A UNIQUE constraint on the `username` column is the correct solution here. The second INSERT will fail with a constraint violation. This is one case where a database constraint can enforce the invariant without requiring serializable isolation.
**Rule of thumb:** If the write skew pattern involves inserting a single "canonical" value that must be globally unique, a UNIQUE constraint is the right tool. If the constraint is more complex (involving multiple rows or aggregates), serializable isolation is needed.
---
### Pattern 4: Budget / Sum Constraint
**Description:** The sum of some values must remain within a bound (positive, below a limit). A transaction reads the current sum, verifies the constraint is not violated, then adds a new value.
**Domain examples:**
- Spending limit: verify total spending + new purchase <= credit limit
- Inventory allocation: verify allocated units + new allocation <= available stock
- Double-spend prevention: verify account balance + new spend >= 0
**Transaction structure:**
```sql
BEGIN;
SELECT SUM(amount) FROM spending WHERE user_id = $user;
-- Application checks: if sum + new_amount <= limit, proceed
INSERT INTO spending (user_id, amount, description)
VALUES ($user, $new_amount, $desc);
COMMIT;
```
**Race condition:** Two concurrent purchases both read sum = $900 against a $1000 limit. Both see $900 + $150 = $1050 > $1000 would fail, but $900 + $80 = $980 is fine. Wait — both try to spend $80. Both read $900. Both see $900 + $80 = $980 <= $1000. Both insert. Total spending becomes $1060 — over the limit.
**Under snapshot isolation:** Both read from snapshots that predate each other's insert. Both see $900. Both pass the constraint check. Both insert.
**Mitigation (without serializable):**
```sql
SELECT SUM(amount) FROM spending
WHERE user_id = $user FOR UPDATE;
```
Locks all spending rows for the user. A second concurrent transaction must wait. After the first commits, the second reads the updated sum. This works because the rows being aggregated exist (they're the user's existing spending records). The new INSERT is what creates the problem — the aggregate includes existing rows, which can be locked.
**Note:** This is the case where `FOR UPDATE` works on a write skew pattern because the read is over existing rows (not checking for absence of rows).
---
### Pattern 5: Game / Validity State
**Description:** A transition is valid only if the current state satisfies some condition. Multiple transactions each verify validity and each apply a transition to different parts of the state.
**Domain examples:**
- Chess: verify a move is valid (piece at correct position, move is legal), then update piece position
- Workflow: verify a document is in the correct state (e.g., "draft"), then transition it to "review"
- Two-player resource allocation: verify total allocations <= capacity, then add one allocation each
**Transaction structure:**
```sql
BEGIN;
SELECT position FROM figures
WHERE name = $piece AND game_id = $game;
-- Application validates move legality based on position
UPDATE figures SET position = $new_position
WHERE name = $piece AND game_id = $game;
COMMIT;
```
**Race condition:** Two concurrent moves to different pieces both pass validity checks based on the game state at the time of their read. One move makes another move's precondition invalid, but neither transaction sees the other's write.
**Mitigation (without serializable):**
```sql
SELECT * FROM figures
WHERE name = $piece AND game_id = $game
FOR UPDATE;
```
Locks the specific piece being moved. If validity depends on other pieces' positions, those must also be locked with `FOR UPDATE`. This works when the rows being checked exist. The chess example in Kleppmann uses this approach for the specific piece being moved — but two players moving different pieces to the same square requires serializable isolation or a unique constraint on (game_id, position).
---
## SQL Pattern Reference
### When SELECT FOR UPDATE Works
`FOR UPDATE` is effective when:
- The rows checked in the precondition SELECT exist in the database
- The write updates those same rows (or rows locked by the same query)
- The transaction is not checking for the absence of rows
```sql
-- WORKS: Lock specific rows that exist
SELECT * FROM doctors WHERE on_call = true AND shift_id = $shift FOR UPDATE;
UPDATE doctors SET on_call = false WHERE name = $name;
-- WORKS: Lock an aggregate over existing rows
SELECT SUM(amount) FROM spending WHERE user_id = $user FOR UPDATE;
INSERT INTO spending (user_id, amount) VALUES ($user, $amount);
```
### When SELECT FOR UPDATE Does NOT Work
`FOR UPDATE` does not prevent write skew when:
- The SELECT returns no rows (checking for absence)
- The write is an INSERT that creates a new row matching the checked condition
```sql
-- DOES NOT WORK: Checking for absence, inserting if absent
SELECT COUNT(*) FROM bookings
WHERE room_id = $room AND overlap($start, $end) FOR UPDATE;
-- If count = 0, there are no rows to lock
INSERT INTO bookings (...) VALUES (...);
```
### Serializable Setting by Database
```sql
-- PostgreSQL
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- or: SET default_transaction_isolation = 'serializable';
-- MySQL
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- or: SET GLOBAL TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- SQL Server
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Oracle: not available as true serializable — use application-level locking
-- Oracle's SERIALIZABLE = snapshot isolation (write skew still possible)
```
### Application Retry for SSI Aborts (PostgreSQL)
```python
import psycopg2
from psycopg2 import errors
def execute_with_retry(conn, transaction_fn, max_retries=5):
for attempt in range(max_retries):
try:
with conn.cursor() as cur:
cur.execute("BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE")
result = transaction_fn(cur)
conn.commit()
return result
except errors.SerializationFailure:
conn.rollback()
if attempt == max_retries - 1:
raise RuntimeError("Transaction failed after max retries")
continue
except Exception:
conn.rollback()
raise
```
Design a stream processing system for unbounded, continuously arriving data. Use when choosing a message broker (Kafka vs RabbitMQ), implementing change data...
---
name: stream-processing-designer
description: |
Design a stream processing system for unbounded, continuously arriving data. Use when choosing a message broker (Kafka vs RabbitMQ), implementing change data capture (CDC) from PostgreSQL, MySQL, or MongoDB via Debezium or Maxwell, selecting window types for aggregation (tumbling, hopping, sliding, session), joining event streams or enriching events from a table, or configuring exactly-once fault tolerance. Trigger phrases: "should I use Kafka or RabbitMQ?", "how do I sync my database to Elasticsearch in real time?", "how do I implement CDC for Postgres?", "how do I get exactly-once semantics in Flink or Kafka Streams?", "should I use Lambda or Kappa architecture?", "how do I keep derived data systems in sync without dual writes?", "how do I join two event streams?". Covers log-based vs. traditional broker selection, four window types, three join types (stream-stream, stream-table, table-table), CDC bootstrap strategy, and microbatching vs. checkpointing trade-offs. Does not apply to bounded offline datasets (see batch-pipeline-designer) or multi-store integration architecture (see data-integration-architect).
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/stream-processing-designer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: [encoding-format-advisor]
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [11]
tags: [stream-processing, kafka, message-broker, log-based-broker, change-data-capture, cdc, event-sourcing, windowing, tumbling-window, hopping-window, sliding-window, session-window, stream-join, stream-table-join, fault-tolerance, microbatching, checkpointing, idempotent, exactly-once, lambda-architecture, kappa-architecture, debezium, maxwell, flink, spark-streaming, kafka-streams, event-time, processing-time]
execution:
tier: 2
mode: hybrid
inputs:
- type: document
description: "Event stream description, system architecture document (architecture.md), docker-compose.yml showing current infrastructure, or requirements document describing latency/throughput targets"
- type: code
description: "Application source files showing current data access patterns, existing pipeline code, or schema definitions for events"
tools-required: [Read, Write]
tools-optional: [Grep]
mcps-required: []
environment: "Any agent environment. Works with pasted system descriptions, architecture.md, docker-compose.yml, or codebase analysis. Codebase helps identify existing databases and their types."
discovery:
goal: "Produce a concrete stream processing architecture recommendation: broker type, CDC tool (if applicable), window type(s), join type(s), fault tolerance strategy — with trade-offs documented"
tasks:
- "Classify the event source (user activity, database changes, sensors, application events) to determine entry point"
- "Select broker type (log-based vs. traditional) based on replay, ordering, and throughput requirements"
- "If syncing from a database, select CDC tool per database type and design the bootstrap strategy"
- "Select window type(s) for aggregation requirements using the four-type framework"
- "Select join type(s) for enrichment/correlation requirements using the three-type framework"
- "Choose fault tolerance strategy based on latency requirements and output side-effect profile"
- "Apply the end-to-end exactly-once argument to the full pipeline, not just the stream processor"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "tech-lead", "site-reliability-engineer"]
experience: "intermediate-to-advanced — assumes familiarity with databases, message queues, and building data pipelines"
triggers:
- "Team is building a real-time data pipeline and needs to choose between message broker options"
- "Application writes to a primary database and needs derived systems (search index, cache, data warehouse) kept in sync without dual writes"
- "Engineer needs to add windowed aggregations (hourly counts, rolling averages) to a stream processor"
- "Pipeline needs to join a stream of events with a reference dataset that changes over time"
- "Team is experiencing inconsistency between the primary database and a derived data store"
- "System needs to tolerate stream processor failures without reprocessing all historical data from the start"
- "Team is debating Lambda vs. Kappa architecture for a new data platform"
- "Application needs event sourcing and the team must design the event store and projection strategy"
not_for:
- "Batch pipeline design for bounded datasets — use batch-pipeline-designer"
- "Choosing the data model for event storage (relational vs. document) — use data-model-selector first"
- "Encoding format for events on the wire — use encoding-format-advisor"
- "Replication strategy for the source database itself — use replication-strategy-selector"
---
# Stream Processing Designer
## When to Use
You are building or evaluating a system that processes data continuously as it arrives — not in periodic batch runs — and need to make concrete decisions about message transport, windowed aggregation, stream joins, and fault tolerance.
This skill applies when:
- You need to keep derived data systems (search index, cache, data warehouse, analytics) in sync with a primary database in near-real time
- You are processing a stream of events with time-based aggregations (counts per minute, rolling averages, session analytics)
- You need to join two or more event streams, or enrich a stream with data from a reference table
- You are choosing between Kafka and a traditional message broker (RabbitMQ, ActiveMQ, Amazon SQS)
- You need to implement change data capture for PostgreSQL, MySQL, MongoDB, or Oracle
- You are designing event sourcing and need to understand how it differs from CDC
- You need to reason about exactly-once semantics end-to-end, not just within the stream processor
**This skill produces architecture decisions, not code.** For encoding format selection (Avro vs. Protobuf for event schemas), use `encoding-format-advisor` first. For batch pipeline design, use `batch-pipeline-designer`. For overall data integration across multiple systems, use `data-integration-architect`.
---
## Context and Input Gathering
Before applying the frameworks, collect:
### Required
- **Event source type:** Where do events come from? User actions (clicks, purchases), database writes, sensor readings, or application state changes?
- **Processing requirements:** What transformation is needed? Filtering, aggregation, enrichment (joining with reference data), pattern detection, or materialized view maintenance?
- **Latency requirements:** How stale can the output be? Sub-second, seconds, minutes, or hours?
- **Exactly-once requirement:** Does incorrect duplicate processing cause visible harm (double-charging a customer, double-counting inventory)? Or is it tolerable (approximate metrics)?
### Important
- **Consumer count and patterns:** How many downstream consumers read the same stream? Do they need to read independently, at their own pace, or replay past data?
- **Message ordering requirements:** Must messages be processed in the order they were produced? Per-key ordering, or global ordering?
- **Output side effects:** Does processing write to external systems (databases, email services, payment APIs)? This determines whether framework-level exactly-once is sufficient.
- **Existing infrastructure:** What databases, brokers, and stream processors already exist in the environment?
### Useful but Optional
- **Event volume:** Events per second at peak — this affects partitioning and consumer parallelism decisions
- **Message size distribution:** Mostly small events, or large payloads? Affects broker memory configuration
- **Retention requirement:** How long must past events be replayable? Affects log-based broker configuration
---
## Process
### Step 1 — Select Broker Type
**WHY:** The choice between a log-based and traditional message broker determines whether consumers can replay events, whether multiple consumers can read independently, and what happens when a consumer falls behind. Getting this wrong forces an expensive migration later.
Use the broker selection framework in `references/broker-selection-framework.md`. Key decision signals:
**Choose a log-based broker (Kafka, Amazon Kinesis, Apache Pulsar) when:**
- Multiple independent consumers need to read the same events (fan-out without coupling)
- Consumers may need to replay past events — for debugging, reprocessing after a bug fix, or bootstrapping a new derived data system
- Message ordering within a partition is important (log-based brokers give total order within a partition)
- High throughput with many small messages (log-based brokers write every message to disk regardless; throughput is constant and predictable)
- The stream processor may restart and needs to resume from where it left off (consumer offsets)
**Choose a traditional broker (RabbitMQ, ActiveMQ, Amazon SQS, Azure Service Bus) when:**
- Each message should be processed by exactly one consumer, and load balancing across consumers is the primary concern
- Messages are expensive to process individually and need fine-grained per-message acknowledgment with arbitrary redelivery
- Message ordering is not critical and you want per-message parallelism (traditional brokers assign individual messages to consumers; log-based brokers assign whole partitions)
- The team already operates one and the system does not need replay or fan-out
**Critical difference in replay behavior:** In a log-based broker, consuming a message is a read-only operation — the consumer's offset advances, but the log is unchanged. You can reprocess by resetting the offset. In a traditional broker, acknowledgment deletes the message — reprocessing is impossible unless you saved it elsewhere.
### Step 2 — Design the Event Source (CDC or Direct Production)
**WHY:** Two patterns exist for getting events into the stream: direct event production (application code writes to the broker) and change data capture (CDC, where the database's replication log is tapped). Each has different consistency guarantees, and the wrong choice causes the dual-write race condition — a form of data loss where two systems permanently diverge.
**Avoid dual writes:** Writing to both the database and the broker in application code creates a race condition. Two concurrent writers can reach the database and broker in opposite orders, leaving the systems permanently inconsistent — with no error and no detection. See the race condition in the "Keeping Systems in Sync" section of the source chapter (page 452).
**When to use CDC:**
- The application already uses a mutable relational database (PostgreSQL, MySQL, Oracle) as the system of record
- You need to sync derived systems (search index, cache, data warehouse) from the database
- You cannot change the application code to produce events directly
- You need strong ordering guarantees (CDC preserves the database's write order)
**Per-database CDC tool mapping:**
| Database | Tool | Mechanism | Notes |
|---|---|---|---|
| PostgreSQL | Debezium, Bottled Water | WAL (write-ahead log) parsing | Debezium uses logical replication slots; Bottled Water uses a dedicated API |
| MySQL | Debezium, Maxwell | binlog parsing | Both parse the MySQL binary log; Maxwell is simpler, Debezium has broader ecosystem |
| MongoDB | Debezium, Mongoriver | oplog tailing | MongoDB oplog is a capped collection; must be large enough to survive processing delays |
| Oracle | GoldenGate | LogMiner / proprietary | Requires Oracle licensing; GoldenGate is the standard production choice |
| Any | Kafka Connect | Plugin-based | Kafka Connect wraps CDC tools; use when events need to land in Kafka topics |
**CDC bootstrap strategy (initial snapshot):**
A CDC pipeline that taps the replication log only captures changes from the time it starts. For rebuilding a derived system from scratch, you need the full historical state.
1. Take a consistent snapshot of the database (e.g., `pg_dump` with a known log sequence number, or a tool-integrated snapshot)
2. Record the log offset at snapshot time
3. Load the snapshot into the derived system
4. Start the CDC consumer from that recorded offset — this ensures no changes are missed between snapshot and live CDC
Some CDC tools (Debezium) handle this bootstrap automatically. For others, it is a manual operation. Confirm before assuming automation.
**Log compaction as an alternative to periodic snapshots:** If using a log-based broker with log compaction enabled, the compacted topic always contains the most recent value for every key. A new derived system can bootstrap by reading the compacted topic from offset 0, then switch to live CDC without taking a database snapshot.
**When to use event sourcing instead of CDC:**
- The application is being designed from scratch and domain events are first-class
- Business events ("order cancelled", "seat reserved") are more meaningful than database row changes
- You need the full event history for audit, compliance, or behavioral analytics — not just the current state
- CDC applies at the infrastructure level (database internals). Event sourcing applies at the application level (explicit business facts)
**Key difference:** In CDC, the application writes to a mutable database and the log is extracted afterward. In event sourcing, the application writes immutable events to an event log first, and current state is derived from the log. The event log is the system of record.
**Commands vs. events in event sourcing:** A command is a request that may fail (validation, constraint checks). When accepted, it becomes an immutable event. Validate commands synchronously before committing them as events — once an event is written to the log, downstream consumers cannot reject it.
### Step 3 — Select Window Type(s)
**WHY:** Windows bound an otherwise infinite stream so aggregations are computable. The wrong window type either produces results that don't match business intent (tumbling where smoothing is needed), consumes excessive memory (sliding on high-volume streams), or fails to capture session behavior (any fixed window for user activity).
Use the window type reference in `references/window-type-selection.md`. Decision framework:
**Tumbling window** — Fixed-length, non-overlapping. Every event belongs to exactly one window.
- Use when: Producing periodic reports (hourly totals, daily counts, per-minute error rates)
- Implement: Round event timestamp down to nearest window boundary
- Example: Count requests per minute → 1-minute tumbling window on event timestamps
**Hopping window** — Fixed-length with overlap (hop size < window size). Events appear in multiple windows.
- Use when: Producing smoothed metrics where abrupt transitions between windows are undesirable
- Implement: Compute tumbling windows at the hop size, then aggregate over multiple adjacent tumbling windows
- Example: 5-minute rolling average updated every 1 minute → 5-minute window, 1-minute hop
**Sliding window** — Variable-length. Groups all events within a fixed time interval of each other.
- Use when: Detecting events that co-occur within a time proximity (two events within 5 minutes of each other), regardless of fixed boundaries
- Implement: Buffer sorted by timestamp, evict events that expire from the window
- Example: Detect rapid successive login failures within 10 minutes for fraud detection
**Session window** — No fixed duration. Groups events from the same user/entity with gaps smaller than a timeout.
- Use when: Measuring user engagement, session duration, or any activity that has natural periods of inactivity
- Implement: Merge events into the same session if less than the gap threshold apart; close session on timeout
- Example: Website session analytics — group clicks within 30-minute inactivity window per user
**Event time vs. processing time:** Always use event timestamps (event time) for window boundaries when event correctness matters. Processing time (the wall clock of the stream processor) produces artifacts when the processor restarts — a backlog of old events appears as a sudden spike when consumed, producing false anomalies. Use processing time only when event delay is negligibly small and approximate results are acceptable.
**Straggler events:** When using event-time windows, events can arrive after the window has closed (delayed by network, buffering, or offline clients). Two options:
1. Ignore stragglers and track them as a metric — acceptable when straggler rate is low
2. Publish a corrected result when stragglers arrive — required for billing, compliance, or any system where downstream users act on window results
### Step 4 — Select Join Type(s)
**WHY:** Stream joins require the processor to maintain state — buffered events or a local copy of a table — to match events from different inputs. Each join type has different state requirements, ordering sensitivity, and time-dependence. Choosing the wrong type produces nondeterministic results or excessive memory consumption.
See `references/join-type-reference.md` for detailed implementation patterns.
**Stream-stream join (window join):**
- What: Correlate two event streams where related events occur close in time, joined by a shared key
- State: Buffer of recent events from both streams within the window, indexed by join key
- Use when: Correlating user actions across sessions (search query + subsequent click), matching request with response events, detecting cause-effect patterns within a time window
- Example: Join search events with click events by session ID within 1 hour to compute click-through rate
- Time-dependence: Results are nondeterministic if ordering across streams is undetermined. If the same job is rerun, events may interleave differently.
**Stream-table join (stream enrichment):**
- What: Enrich each event from a stream with data from a reference table that changes over time
- State: Local copy of the entire reference table (or relevant partition), updated via a CDC stream from the source database
- Use when: Adding user profile data to activity events, looking up product metadata for order events, applying country-specific configuration to events
- Example: Enrich user activity stream with user profile (name, tier, preferences) from user database
- Key difference from batch: The reference table is not a static snapshot — it is updated via a separate CDC stream. The stream processor joins against the version of the table that exists at the time each event arrives.
- Time-dependence: The result depends on when the profile update arrives relative to activity events. Use slowly changing dimension (SCD) versioning if historical correctness is required.
**Table-table join (materialized view maintenance):**
- What: Maintain a continuously updated materialized view of the join between two tables, both represented as changelog streams
- State: Both tables maintained in local storage; recompute the join when either changes
- Use when: Maintaining a denormalized read model (like a user timeline cache) that combines data from multiple source tables
- Example: Maintain a per-user timeline by joining tweet events with follow-relationship events (equivalent to continuously refreshing `SELECT follows.follower_id, array_agg(tweets.*) FROM tweets JOIN follows ON follows.followee_id = tweets.sender_id GROUP BY follows.follower_id`)
- This pattern is how Twitter's home timeline cache works: each new tweet fans out to all followers' cached timelines
**Selecting the right join type:**
| Situation | Join Type |
|---|---|
| Two event streams, correlate events within a time window | Stream-stream |
| One event stream + one reference dataset | Stream-table |
| Maintaining a combined read model from two source tables | Table-table |
| Enriching events with slowly changing data | Stream-table with SCD versioning |
### Step 5 — Choose Fault Tolerance Strategy
**WHY:** A stream processor runs indefinitely. Unlike a batch job that can be fully restarted from its immutable input, restarting a stream job from scratch means reprocessing potentially years of events. The fault tolerance strategy determines how much reprocessing happens after a failure and whether outputs are produced multiple times.
See `references/fault-tolerance-comparison.md` for a full comparison.
**Microbatching (Spark Streaming, Spark Structured Streaming):**
- Mechanism: Break the stream into small fixed-size blocks (typically ~1 second). Treat each block as a miniature batch job. Failed blocks are retried; successful blocks are committed.
- Exactly-once guarantee: Within the framework — each input record contributes to exactly one output record in steady state
- Implicit window: Microbatching implicitly creates a tumbling window at the batch size. Larger windows require explicit state carry-over between microbatches.
- Latency floor: Cannot be lower than the batch interval (~1 second minimum)
- Choose when: Latency of 1-5 seconds is acceptable; team is familiar with batch processing semantics; existing Spark infrastructure
**Checkpointing (Apache Flink, Apache Samza):**
- Mechanism: Periodically snapshot all operator state and write it to durable storage. On failure, restart from the last checkpoint and discard any output produced between the checkpoint and the crash.
- Exactly-once guarantee: Within the framework, using barrier-based checkpointing (Flink) or changelog-based recovery (Samza)
- No implicit window: Window size is independent of checkpoint interval
- Latency floor: Sub-second (checkpoints happen asynchronously; normal processing is not interrupted)
- Choose when: Sub-second latency is required; large operator state (joins, aggregations) needs efficient recovery; you need large windows without per-window state carry-over overhead
**Idempotent processing:**
- Mechanism: Design every output operation to be safe to apply multiple times. On failure, replay input messages from the broker offset and re-apply. If an output was already applied, re-applying has no effect.
- Examples: Writing to a key-value store where the write sets a fixed value (not an increment), including the message offset in the write to detect duplicates, using database upserts keyed by event ID
- Exactly-once guarantee: Achieved without distributed transactions if all output operations are idempotent and the broker supports offset replay (log-based brokers)
- Constraints: The processing must be deterministic. Replay must produce the same messages in the same order (log-based brokers guarantee this within a partition). No other concurrent writer may update the same keys.
- Choose when: Output goes to a key-value store or a database with natural upsert semantics; you want exactly-once without the overhead of distributed transactions
**The end-to-end exactly-once argument:**
Framework-level exactly-once (microbatching or checkpointing) only governs what happens inside the stream processor. As soon as output leaves the processor — a write to an external database, an email sent, a message published to another broker — the framework cannot discard that output on failure. If the processor restarts, the side effect happens again.
True end-to-end exactly-once requires that all output side effects are either:
1. **Idempotent** — safe to apply twice with the same result, or
2. **Atomically committed** — tied to advancing the consumer offset in a single atomic operation (available in some frameworks, e.g., Google Cloud Dataflow's atomic commit facility, Kafka transactions)
Practically: if you send emails or charge payments from a stream processor, microbatching and checkpointing alone are not sufficient. Design the output operation to be idempotent (e.g., deduplicate email sends by event ID), or use atomic commits if available.
### Step 6 — Evaluate Lambda vs. Kappa Architecture
**WHY:** Lambda architecture (separate batch and speed layers with a merge layer) was designed to get the correctness of batch processing with the low latency of stream processing. Kappa architecture challenges whether this complexity is necessary.
**Lambda architecture:**
- Batch layer: Reprocesses all historical data periodically; produces correct, complete output
- Speed layer: Processes recent data in real time; produces approximate or recent-only output
- Serving layer: Merges batch and speed outputs for queries
- Problem: Two separate code paths for the same computation. Business logic must be kept in sync across both. When the batch layer produces new results, it must replace the speed layer output atomically.
**Kappa architecture:**
- Single stream processing layer handles both historical reprocessing and live data
- Reprocessing is done by replaying the event log from offset 0 with a new version of the code, writing output to a new location, then switching consumers to the new output
- Requires: A log-based broker that retains the full event history (or long enough history)
- Advantage: One code path, no synchronization between batch and speed layers
- Limitation: Reprocessing large history is slower than batch processing on a distributed file system. For very large historical datasets, Kappa may not complete reprocessing quickly enough.
**Choose Lambda when:** You have very large historical datasets (petabytes) where stream reprocessing is too slow; you need the batch layer's correctness guarantees independently; the team already operates separate batch and stream infrastructure.
**Choose Kappa when:** Your log-based broker retains sufficient history; reprocessing speed is acceptable; you want operational simplicity; the stream processing framework can handle both historical and live data at acceptable throughput.
---
## Examples
### Example 1 — Real-Time Search Index Sync
**Scenario:** An e-commerce application uses PostgreSQL as its primary database. Product search is powered by Elasticsearch. Currently, the application writes to both PostgreSQL and Elasticsearch directly (dual write). The search index is frequently stale or inconsistent with the database.
**Trigger:** "How do I keep Elasticsearch in sync with PostgreSQL without dual writes?"
**Process:**
1. **Broker:** Log-based (Kafka). Multiple consumers needed: Elasticsearch sink, analytics warehouse, audit log. Replay is needed when Elasticsearch is rebuilt.
2. **Event source:** CDC with Debezium on PostgreSQL WAL. Debezium captures the database's write order, eliminating the race condition in dual writes. The database is the only writer; Debezium observes.
3. **Bootstrap:** Use Debezium's snapshot mode (automatically takes a consistent snapshot + records the WAL offset). Elasticsearch is populated from the snapshot, then Debezium switches to live CDC from that offset.
4. **Window/join:** Not applicable — this is a materialized view maintenance pipeline. No aggregation windows. The Kafka Connect Elasticsearch sink connector writes each change event as a document update.
5. **Fault tolerance:** Idempotent writes to Elasticsearch (document ID = primary key). If Debezium restarts and replays a change, Elasticsearch simply overwrites the same document to the same value.
**Output:** PostgreSQL → Debezium → Kafka topic (log-compacted) → Elasticsearch Sink Connector → Elasticsearch. No dual writes. Consistent ordering. Replay-safe.
### Example 2 — Fraud Detection with Session Windows
**Scenario:** A payments platform needs to detect when a user makes more than 5 failed payment attempts within 30 minutes (a pattern suggesting card testing fraud). Events arrive from multiple microservices with occasional delays of up to 2 minutes.
**Trigger:** "How do I detect rapid failed payment attempts in a stream processor?"
**Process:**
1. **Broker:** Kafka (log-based). Replay needed for reprocessing after rule changes; multiple consumers (fraud engine, analytics, compliance).
2. **Event source:** Application publishes `payment.attempt` events directly to Kafka when processing payments. Event sourcing pattern: the event is the record of what happened.
3. **Window type:** Session window with 30-minute inactivity timeout, keyed by user ID. A session closes if no payment attempt arrives for 30 minutes. Counts failed attempts within each session.
4. **Event time:** Use event timestamp (when the payment was attempted), not processing time. The processor may restart and consume a backlog; processing-time windows would produce false spikes.
5. **Fault tolerance:** Checkpointing (Flink). Session state must survive restarts. Checkpointing writes session state to durable storage; on restart, sessions resume from the last checkpoint without losing partial counts.
6. **Exactly-once:** Fraud alerts go to a downstream Kafka topic (not email). The alert sink is idempotent (alert ID = session ID + window start). On reprocessing, duplicate alerts are deduplicated by the downstream alert router.
**Output:** Flink job with session window (30-minute gap) on `payment.attempt` stream → emits `fraud.alert` event when session count exceeds threshold → downstream alert router deduplicates and notifies.
### Example 3 — Activity Stream Enrichment (Stream-Table Join)
**Scenario:** A SaaS analytics platform processes a high-volume stream of user activity events (page views, feature usage). Each event contains a user ID but not user metadata (subscription tier, company, region). Analytics queries always segment by these dimensions. Currently, a nightly batch job denormalizes the data into a data warehouse. The 24-hour delay makes same-day analytics impossible.
**Trigger:** "How do I enrich activity events with user profile data in real time?"
**Process:**
1. **Broker:** Kafka. Fan-out to multiple analytics consumers; replay needed when enrichment logic changes.
2. **Event source:** Application produces activity events directly. User profile changes are captured via CDC (Debezium on the user PostgreSQL database) into a separate Kafka topic.
3. **Join type:** Stream-table join. Activity event stream joined with the user profile changelog stream. The stream processor loads the user profile table into local state (in-memory hash map or local RocksDB index), updated continuously via CDC events.
4. **Time-dependence:** A user that upgrades their subscription tier will have subsequent activity events enriched with the new tier. Activity events before the upgrade retain the old tier. This is correct — it reflects the tier at the time of activity. Document this behavior explicitly for downstream users.
5. **Fault tolerance:** Checkpointing. The local user profile state (potentially millions of records) must survive restarts without reloading from the source database each time.
6. **Output:** Enriched events written to a new Kafka topic, consumed by the data warehouse (Snowflake, BigQuery) and a real-time dashboard.
**Output:** Activity stream + User profile CDC stream → stream-table join in Flink → enriched event stream → warehouse and dashboard consumers. Latency reduced from 24 hours to seconds.
---
## References
- `references/broker-selection-framework.md` — Detailed comparison of log-based vs. traditional brokers with scoring criteria
- `references/window-type-selection.md` — Window type decision table, implementation notes, and straggler handling patterns
- `references/join-type-reference.md` — Join type patterns, state management requirements, and time-dependence analysis
- `references/fault-tolerance-comparison.md` — Microbatching vs. checkpointing vs. idempotent writes comparison, with exactly-once end-to-end argument
**Related skills:**
- `encoding-format-advisor` — Choose encoding format (Avro, Protobuf) for events on the broker (dependency)
- `batch-pipeline-designer` — For bounded dataset processing; compare Lambda vs. Kappa decisions
- `data-integration-architect` — Cross-system integration decisions when stream processing is one component of a larger data architecture
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-encoding-format-advisor`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/broker-selection-framework.md
# Broker Selection Framework
Reference for Step 1 of `stream-processing-designer`.
---
## Taxonomy
### Traditional message brokers (JMS/AMQP style)
Examples: RabbitMQ, ActiveMQ, HornetQ, IBM MQ, Azure Service Bus, Amazon SQS, Google Cloud Pub/Sub (JMS mode)
Standard: JMS (Java Message Service) and AMQP define the interface. Behavior varies by implementation.
Key characteristics:
- Messages are deleted from the broker once successfully delivered and acknowledged
- Consumers register; the broker assigns individual messages to consumers (load balancing)
- Fan-out via topic subscriptions or exchange bindings (AMQP), but the default model is queue-per-consumer
- Broker maintains very little history — working assumption is the queue is short
- Secondary indexes not supported; topic subscriptions provide some filtering
- Clients are notified when new messages arrive (push model)
Operational implication: If a consumer is shut down, its queue must be explicitly deleted. Otherwise it accumulates messages and consumes broker memory. Contrast with log-based brokers where a stopped consumer simply pauses its offset with no broker-side state.
### Log-based message brokers
Examples: Apache Kafka, Amazon Kinesis Streams, Apache Pulsar, Twitter DistributedLog
Core mechanism: An append-only log on disk, partitioned across machines. Each partition is a separate, independently readable log. Producers append; consumers read sequentially and record their offset.
Key characteristics:
- Messages are NOT deleted on delivery — reading is a read-only operation
- Consumer offset is maintained by the consumer (or a consumer group registry); the broker does not track per-message delivery
- Fan-out is trivial: multiple consumers read the same partition independently
- Partition assignment: the broker assigns entire partitions to consumer group members (coarse-grained load balancing)
- Sequential reads from disk are fast; total-order within a partition is guaranteed
- Log compaction: the broker retains only the most recent value per key, enabling reconstruction of full database state without unbounded storage
Storage model: Log is divided into segments. Old segments are deleted or moved to archive when a configurable retention period (time or size) is reached. This creates a large but bounded circular buffer. At 150 MB/s sequential write speed on a 6 TB disk, the buffer holds ~11 hours of data at maximum write rate — in practice, days to weeks.
---
## Comparison Table
| Dimension | Traditional Broker | Log-Based Broker |
|---|---|---|
| Message replay | Not possible (deleted on ack) | Yes — reset consumer offset |
| Fan-out to N consumers | Via topic subscriptions | Trivially — each consumer reads independently |
| Load balancing | Per-message (round-robin or work queue) | Per-partition (coarse-grained) |
| Max consumers per topic | Unlimited | Limited to partition count (one consumer per partition per group) |
| Head-of-line blocking | No — each consumer gets a different message | Yes — one slow message blocks its partition |
| Message ordering | Best-effort; redelivery can reorder | Total order within partition |
| Throughput behavior | Degrades as queue grows (disk spills) | Constant (always writes to disk) |
| Slow consumer effect | Memory pressure on broker | Consumer falls behind; does not affect others |
| Operational overhead | Must delete queues for stopped consumers | Consumers stop reading; offset remains |
| History / audit | No | Yes — up to retention period |
| Monitoring | Queue depth | Consumer lag (offset delta) |
---
## Decision Signals
**Choose log-based when any of the following are true:**
1. Multiple independent consumers need to read the same events
2. A consumer may need to replay events (debugging, bug fix reprocessing, bootstrapping a new derived system)
3. Message ordering within a key is important
4. High throughput with many small messages (log-based throughput is constant; traditional degrades under queue depth)
5. You want to replay historical events with a different processing version (Kappa architecture)
6. You are implementing CDC or event sourcing — immutability of the event log is a feature, not a side effect
**Choose traditional when all of the following are true:**
1. Each message is consumed by exactly one consumer (no fan-out)
2. Messages may be expensive to process and require per-message retry logic with arbitrary backoff
3. Message ordering does not matter
4. The team already operates a traditional broker and the system does not need replay
**Hybrid:** Most teams building serious data infrastructure end up with a log-based broker as the central spine and traditional queues for specific work queue patterns (job queues, email dispatch) where replay is not needed.
---
## Load Balancing vs. Fan-Out
Two consumer patterns (Figure 11-1 in source):
**Load balancing:** Each message is delivered to one consumer. Consumers share the work. In log-based brokers, implemented by assigning partitions to consumer group members. Maximum parallelism = partition count.
**Fan-out:** Each message is delivered to all consumers. Independent consumers each process every message. In log-based brokers, this is the default — each consumer group reads the full log independently.
**Combined:** Multiple consumer groups each receive all messages (fan-out between groups); within each group, partitions are divided for load balancing.
---
## Consumer Offsets and Fault Tolerance
In a log-based broker, each consumer or consumer group records its current offset per partition. On consumer restart, it resumes from the last committed offset.
If a consumer fails after processing messages but before committing its offset, those messages are processed again. This is at-least-once delivery by default. To achieve exactly-once, the consumer must make its downstream writes idempotent, or atomically commit the offset with the downstream write.
The consumer offset is analogous to the log sequence number (LSN) in database replication — the same principle that allows a replica to reconnect to a leader after a disconnect and resume replication without missing any writes.
---
## Disk Space and Retention
If the consumer falls so far behind that its offset points to a deleted segment, it will miss messages. The log-based broker effectively drops old messages when it fills. This is a large bounded buffer (days to weeks in practice), not unlimited storage.
Monitor consumer lag continuously. Alert when lag grows significantly. The large buffer gives the operations team time to intervene before messages are lost.
FILE:references/fault-tolerance-comparison.md
# Fault Tolerance Comparison
Reference for Step 5 of `stream-processing-designer`.
---
## The Core Problem
Batch jobs tolerate failures by restarting failed tasks from their immutable input. The output of failed tasks is discarded. Because input is immutable and output is only made visible when a task completes successfully, the result is the same as if no failure occurred — exactly-once semantics without any special mechanism.
Stream jobs cannot use this approach because the stream never ends. You cannot "wait until the job finishes" to make output visible — the job runs indefinitely. And restarting from the beginning of the stream means reprocessing potentially years of events.
Three strategies handle this problem, each with different trade-offs.
---
## Strategy 1: Microbatching
**Mechanism:** Break the stream into small fixed-size blocks, typically ~1 second. Treat each block as a miniature batch job. Process the block, produce output, commit the output, advance the offset. Failed blocks are retried from the start of the block.
**Used by:** Apache Spark Streaming, Spark Structured Streaming
**Latency floor:** The batch interval. Cannot produce output more frequently than once per batch. Minimum practical batch interval: ~500ms–1s.
**Implicit window:** Microbatching implicitly creates a tumbling window at the batch size, windowed by processing time (not event time). Jobs that require windows larger than the batch interval must explicitly carry state forward across microbatches.
**Exactly-once guarantee:** Within the framework — each input record contributes to exactly one output batch in steady state. On failure, the failed microbatch is retried; the previous microbatch's output was already committed.
**State management:** State that spans microbatch boundaries (e.g., session state, running totals) must be explicitly persisted between batches. Spark handles this via stateful transformations (e.g., `updateStateByKey`, `mapGroupsWithState`).
**Overhead:** Each microbatch incurs scheduling and coordination overhead. Smaller batches (lower latency) = higher overhead. Performance tuning requires balancing batch size against acceptable latency.
**Choose microbatching when:**
- Latency of 1-5 seconds is acceptable
- Team is familiar with batch processing (Spark) semantics
- Existing Spark infrastructure is already operated
- Workload consists of high-throughput, low-state operations (filtering, simple aggregations)
---
## Strategy 2: Checkpointing
**Mechanism:** Periodically snapshot all operator state (window buffers, join state, aggregation accumulators) and write snapshots to durable storage (HDFS, S3, GCS). On failure, restart from the most recent checkpoint and discard any output generated between the checkpoint and the crash.
**Used by:** Apache Flink (barrier-based checkpointing), Apache Samza (changelog-based state recovery)
**Latency floor:** Sub-second. Checkpointing is asynchronous — normal processing continues while snapshots are written. The latency floor is the processing time per event, typically milliseconds.
**No implicit window:** Window size is entirely independent of checkpoint interval. A 1-hour window with 30-second checkpoints is entirely valid.
**Exactly-once guarantee:** Within the framework. Flink uses checkpoint barriers: special markers injected into the event stream. When all operators have processed events up to the barrier, state is snapshotted. On recovery, all operators restore from the snapshot and replay only the events after the last barrier.
**State management:** Flink manages operator state automatically. Large state (e.g., a local copy of a reference table for stream-table joins) is stored in RocksDB on local disk and checkpointed to durable storage. Recovery reads from the checkpoint rather than reloading from the source database.
**Checkpoint interval trade-off:** More frequent checkpoints = less reprocessing after failure, but more I/O overhead. Typical production checkpoints: every 30 seconds to 5 minutes.
**Choose checkpointing when:**
- Sub-second latency is required
- Large operator state must survive restarts efficiently (stream-table joins with large reference tables, session windows across millions of users)
- Large windows are needed (hours or days) without per-window carry-over overhead
- The team operates or is willing to operate Flink or Samza
---
## Strategy 3: Idempotent Processing
**Mechanism:** Design every output operation so that applying it multiple times produces the same result as applying it once. On failure, replay input messages from the broker offset and reapply. If the output was already applied, the duplicate application has no effect.
**Not a framework feature:** Idempotence is a property of the output operation, not of the stream processor. It can be used with any stream processor (or even without one).
**Exactly-once guarantee:** Achievable without distributed transactions, provided:
1. All output operations are idempotent
2. The broker supports offset replay (log-based broker — the consumer can reset its offset)
3. Processing is deterministic (replaying the same input produces the same output)
4. No other concurrent process writes to the same output keys
**Examples of naturally idempotent operations:**
- Setting a key in a key-value store to a fixed value: `SET user:42:tier premium` — applying twice sets it to the same value
- Upserting a database row by primary key: if the row already exists with the same value, no change
- Writing a file at a fixed path with fixed content: idempotent if the content is deterministic
**Examples of non-idempotent operations that can be made idempotent:**
- Incrementing a counter: non-idempotent. Make idempotent by writing the absolute value instead, or by deduplicating by message offset before incrementing.
- Sending an email: non-idempotent. Make idempotent by maintaining a "sent" set keyed by event ID. Check before sending.
- Publishing a message to another broker: non-idempotent. Make idempotent by including a unique event ID in the message and deduplicating at the consumer.
**Technique — offset in the write:**
When writing to an external database from a Kafka consumer, include the Kafka partition + offset in the write. Before writing, check if that offset has already been applied. If yes, skip. This makes the write idempotent even for counter increments.
```python
# Non-idempotent:
db.execute("UPDATE counters SET count = count + 1 WHERE key = ?", key)
# Made idempotent with offset tracking:
db.execute("""
INSERT INTO processed_offsets (partition, offset) VALUES (?, ?)
ON CONFLICT DO NOTHING
""", partition, offset)
if db.rowcount > 0: # Not a duplicate
db.execute("UPDATE counters SET count = count + 1 WHERE key = ?", key)
```
**Fencing:** When failing over from one consumer node to another, the old node may not know it has been replaced. If it sends a stale write after the new node has already written a newer value, it could overwrite the newer value with an older one. Use fencing tokens (monotonically increasing version numbers) to reject stale writes.
**Choose idempotent processing when:**
- Output goes to a system with natural upsert or set semantics (key-value store, document database with upsert)
- You want exactly-once without the overhead of distributed transactions or complex framework configuration
- The stream processor is simple (stateless filter + enrich, no large aggregation state)
- You can enforce determinism and sequential ordering via a log-based broker
---
## Comparison Table
| Dimension | Microbatching | Checkpointing | Idempotent Processing |
|---|---|---|---|
| Minimum latency | ~1 second | Sub-second | Depends on output system |
| Framework | Spark Streaming | Flink, Samza | Any |
| State recovery | Reprocess failed batch | Restore from checkpoint snapshot | Replay + reapply (no state to restore) |
| Large state efficiency | Poor (must carry state across batches) | Good (RocksDB + checkpoint) | Good (no framework state) |
| Window independence | No (implicit window = batch interval) | Yes | Yes |
| Exactly-once scope | Within framework | Within framework | End-to-end if all outputs are idempotent |
| Setup complexity | Low (batch semantics) | Medium | Medium (design each output for idempotence) |
---
## The End-to-End Exactly-Once Argument
This is the most important point in this reference and the most commonly misunderstood.
**Framework-level exactly-once is not end-to-end exactly-once.**
Microbatching and checkpointing guarantee that each input record contributes to exactly one output within the stream processing framework. But as soon as output leaves the framework — a write to an external database, a message published to another broker, an email sent, a payment charged — the framework cannot discard that side effect on failure.
Scenario:
1. Stream processor processes a batch/checkpoint period
2. The batch produces an email notification and writes to an external database
3. The processor crashes after sending the email but before committing the offset
4. On restart, the processor reprocesses the same events
5. The email is sent again, and the database write is applied again
The framework reports "exactly-once" — each input record was counted once in the framework's accounting. But the external world saw two emails and two database writes.
**True end-to-end exactly-once requires that all output side effects are either:**
1. **Idempotent** — applying twice produces the same result (safe to reprocess)
2. **Atomically committed** with the consumer offset — all side effects of a batch are committed in a single atomic transaction that also advances the consumer offset. If the transaction fails, neither the side effects nor the offset advance. On retry, the same events are reprocessed and the same (idempotent) side effects are applied.
Atomic commit is available in:
- Google Cloud Dataflow (managed atomic commit facility)
- VoltDB (continuous export with transactional guarantees)
- Kafka Transactions (atomic offset commit + producer publish in a single transaction, available within the Kafka ecosystem)
For side effects that go outside these systems (email, external payment APIs), idempotence is the practical path.
**Practical guidance:**
- For writes to a database or key-value store: design upserts, use offset-keyed deduplication
- For messages published to another Kafka topic: use Kafka Transactions (exactly-once within Kafka)
- For emails, SMS, or external API calls: maintain a "sent" deduplication log keyed by event ID; check before sending
- For payment charges: use idempotency keys (all major payment APIs support them)
- For webhook calls: include the event ID; build the receiver to deduplicate by event ID
FILE:references/join-type-reference.md
# Join Type Reference
Reference for Step 4 of `stream-processing-designer`.
---
## Why Joins Are Different on Streams
In batch processing, both sides of a join are bounded datasets loaded at a point in time. The join reads both sides fully, matches on the key, and produces output. The datasets do not change during the join.
On streams, both sides are unbounded and continuously changing. New events can arrive at any time on either side. The stream processor must maintain state — buffered events or a local copy of a reference table — to match events from different inputs. This state grows over time unless it is bounded by a window.
Three join types cover the common patterns. Each has different state requirements, ordering sensitivity, and time-dependence.
---
## Stream-Stream Join (Window Join)
**What it does:** Correlates events from two event streams where related events occur close in time, joined by a shared key.
**State required:** A buffer of recent events from both streams within a configurable time window, indexed by the join key.
**How it works:**
1. When an event from stream A arrives, add it to the A-side buffer indexed by join key
2. Check the B-side buffer for matching events (same join key, within the time window)
3. For each match, emit a joined output event
4. When an event from stream B arrives, same process in reverse
5. Evict events from both buffers when they fall outside the time window
**Time-dependence:** Results are time-dependent. If a user updates their profile between two correlated events, the join may pair them with different profile versions depending on the order of arrival. If events on different streams arrive in different orders on reruns, the join results differ.
**Ordering across streams:** There is typically no guaranteed ordering across different streams or partitions. Events on stream A and stream B may arrive in any interleaving order at the stream processor, even if they were produced in a clear causal order at the source.
**Use cases:**
- Click-through rate: join search query events with click events by session ID within 1 hour
- Advertising attribution: join ad impression events with conversion events by user ID within 24 hours
- Fraud detection: correlate transaction events with location events by user ID within 5 minutes
- Request-response matching: join request events with response events by request ID within a timeout
**Example — search click-through:**
```
Stream A: {session_id: "s1", query: "laptop", timestamp: 10:00}
Stream B: {session_id: "s1", url: "example.com/laptop", timestamp: 10:00:30}
Join key: session_id
Window: 1 hour
Output: {session_id: "s1", query: "laptop", clicked_url: "example.com/laptop", latency_sec: 30}
```
If no click arrives within the window, emit a "no click" event for the search.
**Configuration considerations:**
- Window size: Must be large enough to capture the natural delay between related events. Too small — misses real correlations. Too large — retains excessive state and joins unrelated events.
- State size: `buffer_size = event_rate * window_duration * avg_event_size`. For high-volume streams with large windows, this can be gigabytes. Choose checkpointing for recovery.
---
## Stream-Table Join (Stream Enrichment)
**What it does:** Enriches each event from an event stream with data from a reference table that is continuously updated via a changelog stream (typically CDC from a source database).
**State required:** A local copy of the reference table (or the relevant partition of it), maintained in the stream processor's local storage (in-memory hash map for small tables, RocksDB index for large tables).
**How it works:**
1. The stream processor subscribes to both the event stream and a CDC changelog of the reference table
2. When a table change event arrives, update the local copy
3. When an event from the main stream arrives, look up the join key in the local table copy and enrich the event
4. Emit the enriched event
**Key difference from batch:** In a batch job, the reference table is a point-in-time snapshot. In a stream-table join, the local copy is continuously updated — the processor joins each event against the version of the table that exists at the time the event is processed.
**Time-dependence:** This is a form of time-dependence. An activity event processed before a profile update is enriched with the old profile. An event processed after the update is enriched with the new profile. This is usually the desired behavior (reflect the state at the time of activity), but it must be documented.
**For historical correctness:** If reprocessing historical events, the local table copy will reflect the current state, not the historical state at the time of the original events. This can produce different results than the original processing run. Use slowly changing dimension (SCD) versioning if historical correctness is required:
- Assign a version ID to each table record when it changes
- Include the version ID in the event at processing time
- On reprocessing, join against the versioned table using the stored version ID
**Use cases:**
- Enrich user activity events with user profile (name, subscription tier, region)
- Enrich order events with product metadata (category, price tier, weight class)
- Enrich IoT sensor readings with device metadata (location, calibration factors)
- Enrich log events with service metadata (owner, SLA tier, dependency graph)
**Example — activity enrichment:**
```
Activity stream: {user_id: "u42", action: "checkout", amount: 199.00, ts: 10:05}
Profile CDC stream: {user_id: "u42", tier: "premium", region: "us-east", updated: 09:50}
Local table state at 10:05: user "u42" → {tier: "premium", region: "us-east"}
Output: {user_id: "u42", action: "checkout", amount: 199.00, tier: "premium", region: "us-east", ts: 10:05}
```
**Similarity to stream-stream join:** A stream-table join is a stream-stream join where the table-side join uses a conceptually infinite window (back to the beginning of time), with newer records overwriting older ones. The event-side join uses no window.
---
## Table-Table Join (Materialized View Maintenance)
**What it does:** Maintains a continuously updated materialized view of the join between two tables, both represented as changelog streams. When either table changes, the materialized view is recomputed.
**State required:** Both tables maintained in full in local storage. When either table changes, the affected portion of the join result is recomputed and emitted as a change event.
**How it works:**
1. Subscribe to changelog streams for both tables (Table A and Table B)
2. Maintain local copies of both tables
3. When a record in Table A changes, look up all matching records in Table B and emit updated join results
4. When a record in Table B changes, look up all matching records in Table A and emit updated join results
5. The output is a changelog stream of the materialized view
**Use cases:**
- Maintaining a per-user timeline cache (tweets table joined with follows table)
- Maintaining a denormalized product catalog view (products table joined with inventory table)
- Keeping a search index current (events table joined with metadata table)
**Twitter timeline example (canonical table-table join):**
```sql
-- The materialized view being maintained:
SELECT follows.follower_id AS timeline_id,
array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id
```
Event processing rules:
- New tweet by user U → add tweet to timeline of every follower of U
- Tweet deleted → remove from all followers' timelines
- User A follows user B → add B's recent tweets to A's timeline
- User A unfollows user B → remove B's tweets from A's timeline
The stream processor maintains the follower list for each user as a local table, updated via the follows changelog.
**Scale consideration:** Table-table joins can produce large fan-out: one tweet from a user with 10M followers triggers 10M timeline updates. At scale, this fan-out may require rate limiting, batching, or selective materialization (only materialize timelines for active users).
---
## Summary Table
| Dimension | Stream-Stream | Stream-Table | Table-Table |
|---|---|---|---|
| State | Both-sided event buffer (bounded by window) | Full reference table (or partition) | Both tables in full |
| Time scope | Bounded window on both sides | Infinite window on table side; no window on event side | Infinite on both sides |
| Trigger | Matching event arrives on either side | Event arrives on the stream side | Change arrives on either side |
| Typical latency sensitivity | High (correlating near-simultaneous events) | Medium (enrichment per event) | Medium (view updates on table changes) |
| Memory cost | Window size × event rate | Table size | Both table sizes |
| Nondeterminism risk | High (ordering across streams) | Medium (table version at processing time) | Medium (ordering of table changes) |
| Reprocessing behavior | May differ (different interleaving) | Will differ if table has changed | Will differ if tables have changed |
---
## Time-Dependence in Joins: The Slowly Changing Dimension Problem
All three join types share a common challenge: the joined data changes over time. If you rerun the same job on the same input, and the reference data has changed since the original run, you get different output. This is nondeterminism from state mutation.
**In data warehouses**, this is called the slowly changing dimension (SCD) problem. Standard solutions:
**SCD Type 1 — Overwrite:** Keep only the current value. Simple. Reprocessing produces results based on current data, not historical data. Acceptable when historical correctness is not required.
**SCD Type 2 — Versioning:** Create a new record for each version, with validity dates. Join against the version valid at the time of the event. Preserves historical correctness. Prevents log compaction (all versions must be retained).
Apply SCD Type 2 in stream joins when: billing, compliance, or audit requires that reprocessed results match original results exactly; the reference data changes frequently and the join result is sensitive to which version is used.
FILE:references/window-type-selection.md
# Window Type Selection Reference
Reference for Step 3 of `stream-processing-designer`.
---
## Why Windows Exist
A stream is infinite. Aggregations (count, sum, average, percentile) require a bounded dataset. Windows bound the stream by time (or event count, but time-based windows dominate in practice), making aggregations computable.
Windows are also the mechanism by which stream processing achieves the equivalent of `GROUP BY time_bucket(...)` in SQL — except the "table" is unbounded and rows arrive continuously.
---
## Event Time vs. Processing Time
**Event time:** The timestamp embedded in the event itself — the time the event actually occurred. Correct for analytics and any computation where the result should reflect when things happened.
**Processing time:** The wall-clock time on the stream processor machine when the event is being processed. Simpler (no timestamp parsing), but produces artifacts when:
- The processor restarts and processes a backlog: all backlog events are assigned the current processing time, producing a false spike
- Events are delayed by network, queue backlog, or client buffering: delayed events appear in the wrong window
- The processor is redeployed and reprocesses historical events: results differ from the original run
**Rule:** Use event time for any computation where correctness matters. Use processing time only for monitoring-of-the-monitor (e.g., measuring stream processor lag itself) or when delays are negligibly small and approximation is acceptable.
**Implementation note:** Using event time requires that the stream processor read the timestamp from each event and route the event to the correct window based on that timestamp, not the current clock. This is how Flink's EventTime processing mode works.
---
## The Four Window Types
### Tumbling Window
**Shape:** Fixed length, non-overlapping. Every event belongs to exactly one window.
**Visualization:**
```
Time: |---window1---|---window2---|---window3---|
Events: A B C D E F G H I J K L M
```
**Properties:**
- Each event belongs to exactly one window
- Contiguous — no gaps between windows
- Simplest to implement and reason about
**Implementation:** `window_id = floor(event_timestamp / window_size) * window_size`
**Use when:**
- Periodic reporting (hourly totals, daily summaries, per-minute request counts)
- Business metrics that align to calendar periods (hourly SLAs, daily revenue)
- Any aggregation where "which window does this event belong to" has a single clear answer
**Example:** Count HTTP requests per minute for a rate dashboard.
---
### Hopping Window
**Shape:** Fixed length with overlap. Hop size < window size. Each event belongs to `window_size / hop_size` windows.
**Visualization (5-min window, 1-min hop):**
```
W1: |--10:00 to 10:05--|
W2: |--10:01 to 10:06--|
W3: |--10:02 to 10:07--|
```
**Properties:**
- Smoothing effect: the window advances in small steps, so abrupt changes at window boundaries are reduced
- Higher computation cost than tumbling: each event is processed multiple times
- Multiple windows overlap at any point in time
**Implementation:** Compute tumbling windows at the hop size, then aggregate over `window_size / hop_size` adjacent tumbling windows.
**Use when:**
- Rolling averages: "average over the last 5 minutes, updated every 1 minute"
- Trend detection where abrupt metric changes at tumbling window boundaries would create false alarms
- Smoothed metrics for dashboards
**Example:** 5-minute rolling 99th percentile response time, updated every 30 seconds.
---
### Sliding Window
**Shape:** Variable-length based on event proximity. Groups all events within a fixed time interval of each other, regardless of fixed boundaries.
**Visualization (5-min sliding):**
```
Events at: 10:03:39, 10:08:12 → same window (8m12s - 3m39s = 4m33s < 5min)
Events at: 10:03:39, 10:09:00 → different windows (9m00s - 3m39s = 5m21s > 5min)
```
**Properties:**
- No fixed boundaries — window is defined by proximity between events
- Correctly captures co-occurrence regardless of which "period" the events fall in
- Higher memory cost: requires buffering all recent events sorted by time
**Implementation:** Maintain a buffer of recent events sorted by timestamp. When a new event arrives, include all events within `[now - window_size, now]`. Evict events older than `window_size`.
**Use when:**
- Detecting events that co-occur within a time proximity (rapid successive failures, correlated errors)
- Computing "events within N minutes of each other" where fixed boundaries would split natural pairs
- Proximity-based detection in monitoring, fraud detection, anomaly detection
**Example:** Detect 5 failed login attempts within any 10-minute window, regardless of clock-aligned boundaries.
---
### Session Window
**Shape:** No fixed duration. Groups events from the same entity (user, device, session) that occur close together, with a gap closing the window when no events arrive for a timeout period.
**Visualization (30-min timeout):**
```
User A: [click, click, click]---30min gap---[click, click]
|--- session 1 ---| |-- session 2 --|
```
**Properties:**
- Window duration is data-driven — active users have long sessions, inactive users have short ones
- Windows are per-entity (keyed by user ID, device ID, session ID)
- Complex state management: sessions must be merged if new events arrive that bridge a gap
**Implementation:** For each entity key, maintain the session start time and last-event time. On new event: if `event_time - last_event_time < gap_timeout`, extend session; otherwise, close current session and start new one.
**Use when:**
- Website or app session analytics (session duration, page views per session)
- User engagement measurement (time spent in app between inactivity periods)
- Any "activity burst" pattern where the natural unit is a continuous period of activity
**Example:** E-commerce session analytics — group all page views and clicks per user into sessions with a 30-minute inactivity timeout.
---
## Decision Table
| Requirement | Window Type |
|---|---|
| Periodic reports aligned to calendar periods | Tumbling |
| Smoothed metrics without abrupt boundaries | Hopping |
| Events co-occurring within a time proximity, regardless of period | Sliding |
| User activity grouped into natural activity bursts | Session |
| Multiple aggregation periods simultaneously | Multiple tumbling windows |
| Rolling percentiles updated frequently | Hopping |
| Fraud pattern: N events within any N-minute period | Sliding |
| Session duration and depth analytics | Session |
---
## Straggler Event Handling
When using event-time windows, some events arrive after the window has been declared complete. Causes:
- Mobile clients buffering events offline and sending them when connectivity resumes (hours or days of delay)
- Network delays or queue backlog causing modest delays (seconds to minutes)
- Clock skew on the producing device
**Option 1 — Ignore stragglers:**
Declare the window complete after a configurable watermark (e.g., close window when processing time has advanced 5 minutes past the window end). Emit the result. Track straggler count as a metric. Alert if straggler rate exceeds a threshold.
Appropriate when: Straggler rate is low; downstream users understand results are approximate; correction is operationally costly.
**Option 2 — Publish corrections:**
Emit a preliminary result when the watermark is reached. Emit a corrected result if stragglers arrive within a further allowance period. The downstream system must handle retraction and replacement of the earlier result.
Appropriate when: Results are used for billing, compliance, SLA reporting, or any context where downstream users act on results and need corrections.
**Watermark:** A signal from the stream processor indicating "all events with timestamps earlier than T have been processed." Watermarks advance as event timestamps advance. A watermark can be generated from the minimum observed event timestamp across all partitions, minus a configurable lateness allowance.
---
## Clock Skew on Producing Devices
For events produced by mobile devices or IoT sensors, the device clock may be wrong. Strategy: record three timestamps per event:
1. Event occurrence time (device clock) — the semantic timestamp
2. Event send time (device clock) — when the device attempted to send
3. Event receive time (server clock) — when the server received it
Estimate clock offset: `receive_time - send_time` (assuming negligible network delay). Apply this offset to the event occurrence timestamp to approximate the true event time.
This does not eliminate clock skew, but reduces systematic errors. Document the approach in the pipeline — downstream users need to know that event timestamps are estimates with bounded error.
Select the right storage engine architecture (LSM-tree, B-tree, or in-memory) for a database workload using a 7-dimensional scored trade-off analysis. Use wh...
---
name: storage-engine-selector
description: |
Select the right storage engine architecture (LSM-tree, B-tree, or in-memory) for a database workload using a 7-dimensional scored trade-off analysis. Use when evaluating RocksDB vs InnoDB vs LevelDB, diagnosing write amplification in production, choosing between write-optimized vs read-optimized storage, selecting a compaction strategy (size-tiered vs leveled), or deciding whether to skip disk with an in-memory database. Also use for: comparing Cassandra vs PostgreSQL storage internals; justifying an existing engine choice to a team; assessing whether compaction pauses are causing latency spikes. Covers LSM-tree family (LevelDB, RocksDB, Cassandra, HBase), B-tree family (PostgreSQL, MySQL InnoDB, SQLite), and in-memory stores (Redis, Memcached, VoltDB).
For choosing between relational/document/graph models, use data-model-selector instead. For OLTP vs. analytics routing, use oltp-olap-workload-classifier instead. For replication topology, use replication-strategy-selector instead.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/storage-engine-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [3]
tags: [storage-engine, lsm-tree, b-tree, sstable, compaction, write-amplification, rocksdb, leveldb, cassandra, hbase, innodb, postgresql, in-memory-database, write-throughput, read-performance, latency, space-efficiency, transactions, database-selection, workload-analysis]
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "Application codebase, docker-compose, schema files, or architecture description — any artifact that reveals data access patterns"
- type: document
description: "Workload description or requirements document if no codebase is available"
tools-required: [Read, Write, Bash]
tools-optional: [Grep]
mcps-required: []
environment: "Run inside a project directory where codebase or configuration files exist. Falls back to document/description input if no codebase."
discovery:
goal: "Identify the optimal storage engine architecture for a given workload and produce a concrete recommendation backed by a 7-dimensional scored comparison"
tasks:
- "Classify the workload (read-heavy, write-heavy, mixed; point lookups vs range scans)"
- "Score LSM-tree, B-tree, and in-memory across 7 dimensions for this workload"
- "Select a compaction strategy if LSM-tree is recommended"
- "Identify concrete database products matching the recommendation"
- "Flag compaction risk, write amplification risk, or latency predictability issues"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "site-reliability-engineer", "tech-lead"]
experience: "intermediate-to-advanced — assumes experience with databases and SQL"
triggers:
- "User is choosing between database products and wants to understand storage internals"
- "User has a write-heavy workload and wants to know if RocksDB or Cassandra is appropriate"
- "User is experiencing write amplification or compaction stalls in production"
- "User wants to justify switching from B-tree (PostgreSQL) to LSM-tree (RocksDB) storage"
- "User is evaluating whether to use an in-memory database"
- "User is comparing Cassandra vs PostgreSQL for a new service"
- "User is reviewing an architecture decision involving database storage"
not_for:
- "Choosing between OLTP and OLAP systems — use oltp-olap-workload-classifier first"
- "Selecting a data model (relational, document, graph) — use data-model-selector"
- "Replication or partitioning strategy — use replication-strategy-selector or partitioning-strategy-advisor"
---
# Storage Engine Selector
## When to Use
You have a workload — a new service, an existing system with performance problems, or an architecture decision pending — and need to choose between storage engine architectures: log-structured merge-tree (LSM-tree), B-tree, or in-memory.
This skill applies when the storage engine choice is open, contested, or needs justification. It produces a concrete recommendation (engine family + product + compaction strategy if applicable) backed by a scored 7-dimensional comparison your team can review and challenge.
**Prerequisite check:** If you do not yet know whether the workload is OLTP or OLAP, run `oltp-olap-workload-classifier` first. This skill assumes an OLTP or mixed workload. Column-oriented storage for analytics is out of scope here.
**Related skills:**
- `data-model-selector` — choose between relational, document, and graph models before choosing a storage engine
- `oltp-olap-workload-classifier` — classify workload type if uncertain
---
## Context & Input Gathering
### Required Context (must have — ask if missing)
- **Write/read ratio:** Why: The fundamental split. LSM-trees are write-optimized; B-trees are read-optimized. Without the ratio, the primary dimension cannot be scored.
- Check prompt for: "write-heavy", "read-heavy", "mixed", throughput numbers, writes/sec, reads/sec
- Check environment for: application code (count DB write vs read calls), docker-compose (any write-intensive services like queues, event logs), schema (append-only tables suggest write-heavy)
- If still missing, ask: "What is your approximate write-to-read ratio? For example: 80% writes / 20% reads, or mostly reads with occasional bulk imports?"
- **Query patterns:** Why: Range scans favor B-trees (keys are sorted in-place on disk); point lookups are fine for both but LSM-trees may need Bloom filters. Full-text search requires LSM-tree-backed indexes (e.g., Lucene).
- Check prompt for: "range queries", "ORDER BY", "BETWEEN", "time-series", "prefix scan", "key lookup", "GET by ID"
- Check environment for: schema.sql (range indexes, composite indexes), application code (scan vs get patterns)
- If still missing, ask: "Do your queries primarily look up individual records by key, or do they scan ranges of records (e.g., 'all events between timestamp A and B', 'all users with names starting with X')?"
- **Latency requirements:** Why: B-trees provide more predictable latency because reads/writes go to a fixed page. LSM-trees have compaction pauses that can spike tail latency unpredictably. SLA-sensitive services need to know this.
- Check prompt for: SLA numbers (p99 latency), "latency-sensitive", "real-time", "user-facing"
- Check environment for: requirements.md, architecture.md (SLA definitions), config files (read timeout settings)
- If still missing, ask: "Do you have a latency SLA? For example, p99 response time under 50ms? Or is this a background/batch service where occasional slowdowns are acceptable?"
- **Durability requirements:** Why: In-memory databases lose data on restart (unless configured otherwise). If the dataset must survive process failures without replica recovery, disk-based storage is required.
- Check prompt for: "durable", "ACID", "must not lose data", "crash recovery", or conversely "cache", "ephemeral", "rebuilt on restart"
- Check environment for: docker-compose restart policies, backup configurations, any mention of replication
- If still missing, ask: "If the database process crashes, is it acceptable to lose recent writes (rebuilding from a replica or cache), or must every write be durable to disk immediately?"
### Observable Context (gather from environment)
- **Existing database choice:** Look for `docker-compose.yml`, `requirements.txt`, `pom.xml`, or `package.json` for database driver imports. Tells you what is already in use and whether this is a greenfield or migration decision.
- **Data volume:** Look for schema definitions (row counts, partitioning hints), README, or architecture docs. Affects in-memory feasibility.
- **Access pattern code:** Grep the codebase for DB write vs read call ratios; look for bulk insert patterns, streaming writes, or time-series data accumulation.
- **Existing compaction config:** Look for `rocksdb.ini`, `cassandra.yaml`, or similar — signals existing LSM-tree usage and whether compaction is already tuned.
### Default Assumptions
When context cannot be observed and asking would be excessive:
- Write/read ratio unknown → assume mixed (50/50); note this assumption explicitly
- Range query usage unknown → assume point lookups dominate
- Latency SLA unknown → assume best-effort; latency predictability is a "nice to have"
- Data volume unknown → assume larger than available RAM (rules out pure in-memory without further information)
---
## Process
### Step 1: Classify the Workload
**ACTION:** Determine the workload profile across three axes: (a) write intensity, (b) query pattern, (c) latency sensitivity.
**WHY:** The storage engine decision follows directly from these three inputs. Write intensity determines the primary axis (LSM vs B-tree). Query pattern determines whether range scan optimization matters. Latency sensitivity determines whether compaction jitter is acceptable. Getting this classification right avoids the most common mistake: choosing a write-optimized engine (LSM-tree) for a read-heavy workload with range scans, or a read-optimized engine (B-tree) for a write-heavy append-log workload.
Produce a one-line classification:
```
Workload: [write-heavy | read-heavy | mixed]
[point-lookup-dominant | range-scan-dominant | mixed queries]
[latency-SLA-strict | best-effort latency]
```
**IF** the user describes OLAP, analytics, or columnar access → stop here and recommend `oltp-olap-workload-classifier` + column-oriented storage (out of scope for this skill).
**ELSE** → proceed to Step 2 with the OLTP/mixed classification.
---
### Step 2: Score All Three Engine Families on 7 Dimensions
**ACTION:** Score LSM-tree, B-tree, and in-memory across all 7 dimensions for this specific workload. Use a 1–5 scale per dimension (5 = strong fit, 1 = poor fit or disqualifying).
**WHY:** Scoring all dimensions — not just the obvious one — prevents premature convergence. Engineers often pick "write-heavy → Cassandra" without checking latency predictability (LSM compaction can spike p99 badly for low-latency SLAs). Running the full matrix surfaces disqualifying factors that a shortcut misses. It also produces a reviewable artifact that makes the trade-off explicit for the team.
**The 7 dimensions:**
| Dimension | LSM-tree | B-tree | In-memory |
|-----------|----------|--------|-----------|
| **D1: Write Throughput** | High — sequential SSTable writes; compaction batches rewrites. Typical: 100K-1M+ writes/sec on commodity hardware. | Moderate — must update a page in-place; WAL + page write = 2+ disk writes per record. Random page updates are slow on HDDs. | Highest — writes go directly to RAM; disk persistence (if any) is async append. |
| **D2: Write Amplification** | Moderate-to-high — each write may be rewritten multiple times across compaction levels. Leveled compaction: ~10x typical. Size-tiered: lower initial amplification but larger space use. | High — B-tree index must write every piece of data at least twice (WAL + page); page splits cause additional parent writes. | None (for cache-only). Minimal (for durable in-memory with append log). |
| **D3: Read Performance** | Moderate — reads must check memtable + multiple SSTable levels. Bloom filters help for point lookups; absent key lookups require checking all levels. Range scans are efficient once SSTables are sorted. | High — O(log n) traversal to leaf page; keys are sorted in-place; range scans read contiguous pages. Predictable. | Highest — reads served entirely from RAM with no disk I/O. |
| **D4: Latency Predictability** | Low-to-moderate — compaction runs in background threads and competes for disk bandwidth. At high write throughput, compaction may lag; tail latency spikes (p99/p999). | High — reads and writes go to fixed pages; no background reorganization that spikes latency. Well-established, mature behavior. | High — no disk I/O on the read path; latency is consistent. Network round-trip dominates. |
| **D5: Space Efficiency** | Better than B-tree — no page fragmentation; leveled compaction removes redundant copies; compression across sorted blocks is more effective. | Lower — pages have reserved space for future inserts; page splits leave partially-full pages; fragmentation accumulates over time. | Lowest space-to-cost ratio — RAM is 10-100x more expensive per GB than SSD/HDD. |
| **D6: Transactional Semantics** | Weaker by default — a key may exist in multiple SSTable segments; each segment is a snapshot, not a single source of truth. Row-level locks are complex to implement. Some LSM engines (RocksDB transactions, Cassandra LWT) add transaction support, but it is not native. | Strong — each key exists in exactly one place in the index. Range locks attach directly to B-tree pages. Relational databases (PostgreSQL, MySQL InnoDB) build full ACID transactions on top of B-trees. | Varies — depends on implementation. VoltDB/MemSQL offer full ACID. Redis is single-threaded (atomic per command) but not ACID across commands. |
| **D7: Compaction Risk** | Real — if write throughput exceeds compaction rate, unmerged segment files accumulate; read performance degrades (more segments to check); disk space grows unbounded. Requires active monitoring. Not applicable in size-tiered compaction. | None — no compaction process. Pages are reused in-place. Fragmentation grows slowly but is managed by VACUUM (PostgreSQL) or similar. | None — no disk segments. |
**Scoring template:**
```
Dimension LSM-tree B-tree In-memory
D1: Write Throughput [1-5] [1-5] [1-5]
D2: Write Amplification [1-5] [1-5] [1-5]
D3: Read Performance [1-5] [1-5] [1-5]
D4: Latency Predictability [1-5] [1-5] [1-5]
D5: Space Efficiency [1-5] [1-5] [1-5]
D6: Transactional Semantics [1-5] [1-5] [1-5]
D7: Compaction Risk [1-5] [1-5] [1-5]
------ ------ ---------
Weighted Total [X/35] [X/35] [X/35]
```
Score each dimension relative to the workload classification from Step 1. A write-heavy workload makes D1 and D2 high-weight; a strict-latency workload makes D4 a potential disqualifier.
See `references/scoring-guide.md` for per-dimension scoring rubrics and worked examples.
---
### Step 3: Apply Disqualifying Filters
**ACTION:** Check for hard disqualifiers before ranking by total score. A single disqualifying condition overrides a high total score.
**WHY:** Total score averaging can hide a fatal flaw. A storage engine that scores 4/5 on six dimensions but 1/5 on one critical dimension (e.g., durability for in-memory when data loss is unacceptable) should be eliminated, not ranked second. Disqualifiers must be applied before totaling.
**Disqualifying conditions:**
| Condition | Disqualifies |
|-----------|-------------|
| Data must survive crash without replica recovery | In-memory (cache-only configurations) |
| Strict ACID transactions required (e.g., financial ledger, inventory) | LSM-tree (unless RocksDB transactions or similar explicitly configured) |
| Latency SLA < 10ms p99, write throughput > 100K/sec simultaneously | LSM-tree (compaction stalls cannot be fully eliminated at high write rates) |
| Dataset >> available RAM, no tolerance for cache misses | In-memory |
| Range scans > 50% of queries AND write volume is low | LSM-tree (reads must check multiple SSTable levels; B-tree range scans are O(log n) with contiguous page reads) |
| Compaction monitoring is not feasible (no ops team) | LSM-tree (compaction lag is a production risk that requires active monitoring) |
Apply disqualifiers. If a family is disqualified, remove it from further scoring.
---
### Step 4: Select Compaction Strategy (if LSM-tree survives)
**ACTION:** If LSM-tree is not disqualified, select between size-tiered and leveled compaction based on the workload.
**WHY:** Compaction strategy is the primary tuning lever inside the LSM-tree family and has significant impact on space efficiency, read amplification, and write amplification. Choosing the wrong strategy is a common cause of LSM-tree production problems. LevelDB and RocksDB use leveled compaction by default; Cassandra supports both and defaults to size-tiered. This choice must be explicit.
**Size-tiered compaction:**
- How it works: Newer, smaller SSTables are merged into older, larger SSTables. Tables are organized in tiers by size.
- Best for: Write-heavy workloads where space amplification is acceptable and write throughput is paramount. Lower write amplification during active writes.
- Tradeoff: More space used temporarily (multiple copies of overlapping key ranges exist during compaction). Less suited for read-heavy workloads (more overlap = more segments to check per read).
- Used by: Cassandra (default), HBase.
**Leveled compaction:**
- How it works: The key range is split into smaller SSTables organized into levels. Each level is 10x larger than the previous. A key appears in at most one SSTable per level.
- Best for: Balanced read/write workloads where space efficiency and read performance matter. Better for range scans. Less disk space wasted.
- Tradeoff: Higher write amplification (a key may be rewritten ~10x as it moves through levels).
- Used by: LevelDB, RocksDB (default).
**Decision rule:**
- Write throughput >> read performance AND space is not constrained → size-tiered
- Balanced workload OR space efficiency matters → leveled
- Unsure → leveled (better-known behavior; easier to reason about production issues)
---
### Step 5: Identify Concrete Database Products
**ACTION:** Map the winning engine family and compaction strategy (if LSM-tree) to specific database products available in the ecosystem.
**WHY:** The engine family is the architectural choice; the product is what the team actually installs. Different products within the same family have significantly different operational characteristics, ecosystem support, and cloud availability. The recommendation must be concrete to be actionable.
**LSM-tree family:**
| Product | Best for | Notes |
|---------|----------|-------|
| **RocksDB** | Embedded key-value store; high write throughput; used as storage engine in other DBs (MySQL MyRocks, CockroachDB, TiKV) | Leveled compaction default; highly configurable; C++ library, not a standalone server |
| **LevelDB** | Embedded key-value, simpler than RocksDB | Leveled compaction; less tunable; good for learning or simple embedded use |
| **Cassandra** | Distributed, multi-datacenter, high write throughput at scale; time-series, IoT, event logs | Size-tiered or leveled compaction; CQL interface; no joins; eventual consistency by default |
| **HBase** | HDFS-backed, Hadoop ecosystem; wide-column, very large datasets | Based on Google Bigtable; size-tiered compaction; strong consistency with ZooKeeper |
| **Elasticsearch / Lucene** | Full-text search; inverted index built on SSTable-like structures | LSM-tree internals for term dictionaries; not a general-purpose key-value store |
**B-tree family:**
| Product | Best for | Notes |
|---------|----------|-------|
| **PostgreSQL** | Full ACID, complex queries, relational model, JSON support | B-tree indexes; MVCC; mature; best general-purpose choice for OLTP |
| **MySQL InnoDB** | ACID OLTP; clustered index (primary key = clustered B-tree); wide deployment | InnoDB is the default engine in MySQL; MyISAM is legacy |
| **SQLite** | Embedded; single-writer; mobile/desktop apps | B-tree storage; full ACID; not for concurrent high-throughput |
| **LMDB** | High-read, embedded; copy-on-write B-tree; used in OpenLDAP | No WAL; crash-safe by design; single-writer model |
**In-memory family:**
| Product | Best for | Notes |
|---------|----------|-------|
| **Redis** | Cache, session store, leaderboards, pub/sub | Weak durability (async AOF/RDB); not ACID; single-threaded per command |
| **Memcached** | Pure LRU cache; no persistence; simpler than Redis | Data loss on restart is expected |
| **VoltDB / MemSQL (SingleStore)** | ACID in-memory OLTP; financial transactions at speed | Full SQL, full ACID; data survives restart via disk snapshots/replication |
| **RAMCloud** | Research prototype; log-structured in-memory + disk | Durable in-memory with log-structured persistence |
---
### Step 6: Produce the Recommendation
**ACTION:** Write a structured recommendation covering the winning engine family, concrete product(s), compaction strategy (if applicable), and the key trade-offs that drove the decision.
**WHY:** A concrete recommendation with explicit rationale enables team alignment and prevents relitigating the decision. The scoring table makes trade-offs transparent. The "what we're giving up" section is essential — it prevents future surprise when the selected engine's weaknesses surface in production.
**Output format:**
```
## Storage Engine Recommendation
### Workload Classification
[One-line workload profile from Step 1]
### Recommendation
**Engine Family:** [LSM-tree | B-tree | In-memory]
**Compaction Strategy:** [Size-tiered | Leveled | N/A]
**Primary Product:** [Specific database product]
**Alternative Product:** [Second option if applicable]
### 7-Dimension Score Summary
| Dimension | LSM-tree | B-tree | In-memory | Weight for this workload |
|--------------------------|:--------:|:------:|:---------:|:------------------------:|
| D1: Write Throughput | [score] | [score]| [score] | [High/Medium/Low] |
| D2: Write Amplification | [score] | [score]| [score] | [High/Medium/Low] |
| D3: Read Performance | [score] | [score]| [score] | [High/Medium/Low] |
| D4: Latency Predictability | [score] | [score]| [score] | [High/Medium/Low] |
| D5: Space Efficiency | [score] | [score]| [score] | [High/Medium/Low] |
| D6: Transactional Semantics | [score] | [score]| [score] | [High/Medium/Low] |
| D7: Compaction Risk | [score] | [score]| [score] | [High/Medium/Low] |
| **Weighted Total** | **[X]** | **[X]**| **[X]** | |
### Why [Winning Engine]
[2-3 sentences connecting the workload classification to the winning dimensions]
### What We're Giving Up
[1-2 sentences on the primary trade-off — what the selected engine does poorly
and how the team should mitigate it]
### Disqualifiers Applied
[If any engine families were disqualified, state why]
### Compaction Risk Flag (LSM-tree only)
[If LSM-tree is selected: what write throughput level risks outpacing compaction,
what metric to monitor, and what the failure mode looks like]
### Operational Notes
[Specific tuning recommendations or gotchas for the selected product]
```
---
## What Can Go Wrong
**Compaction cannot keep up with write rate.** If write throughput exceeds the compaction thread's ability to merge SSTable files, the number of unmerged segments on disk grows unboundedly. Read performance degrades (each read must check more segments), disk space grows, and eventually the system runs out of space. LSM-tree-based engines like Cassandra do not throttle incoming writes when compaction lags — they rely on the operator to monitor this. Monitor: number of SSTables per partition (Cassandra), compaction pending tasks, disk space growth rate.
**Write amplification degrades SSD lifespan.** B-tree and LSM-tree engines both cause write amplification — each logical write results in multiple physical writes. On SSDs, which have a limited number of program/erase cycles per block, sustained write amplification accelerates wear. Leveled compaction in LSM-trees has ~10x write amplification; B-trees typically have 2-4x. For write-heavy workloads on SSDs, track disk write bytes vs application write bytes to measure actual amplification.
**LSM-tree read amplification for absent keys.** If many reads query keys that do not exist in the database, LSM-tree engines must check the memtable and then each SSTable level before confirming the key is absent. Without Bloom filters, this causes multiple disk reads per absent-key lookup. Bloom filters (used by LevelDB, RocksDB, Cassandra) eliminate most absent-key disk reads but require additional memory.
**B-tree fragmentation and VACUUM costs.** B-trees leave partially-full pages after page splits and row deletions. In PostgreSQL, dead row versions accumulate until VACUUM reclaims them. A system that never runs VACUUM (or runs it too infrequently) will see table bloat, degraded scan performance, and transaction ID wraparound issues. VACUUM competes with production queries for I/O.
**In-memory data loss on restart.** Products like Redis (without persistent configuration) and Memcached lose all data on process restart. This is expected behavior for caches, but is a catastrophic failure mode for systems that stored durable state. Verify durability configuration before using any in-memory product for non-ephemeral data.
**Choosing B-tree for write-heavy append workloads on HDDs.** On magnetic hard drives, random writes to B-tree pages are dramatically slower than sequential writes. A write-heavy workload that uses a B-tree engine on HDD may see 10-100x worse write throughput than the same workload on an LSM-tree engine, because each page update requires a disk head seek. LSM-trees write sequentially by design, which aligns with HDD hardware characteristics.
---
## Key Principles
**The engine-workload match is the primary decision axis — not the database brand.** Choosing PostgreSQL vs Cassandra is often framed as a "SQL vs NoSQL" decision, but the real driver is B-tree vs LSM-tree storage internals. Many teams pick a database for API familiarity and suffer performance problems because the storage engine is mismatched to the workload. Map workload characteristics first, then identify products that match.
**Write amplification is a system-level concern, not just a performance metric.** Both B-trees (WAL + page write) and LSM-trees (compaction rewrites) amplify writes. On SSDs, write amplification directly affects hardware lifespan and effective I/O bandwidth. On HDDs, the pattern (random vs sequential) matters more than the amplification factor. Quantify write amplification before concluding that "SSD makes the difference negligible."
**LSM-tree compaction strategy is not a one-time decision — it requires ongoing operations.** Selecting leveled vs size-tiered compaction sets the default behavior, but production workloads change. A write rate that doubled over 6 months may outpace a compaction configuration that worked at launch. LSM-tree engines require operators who monitor compaction health and tune it as load evolves. If the team lacks this capacity, prefer B-tree or a managed cloud service (Amazon DynamoDB, Google Bigtable) that handles compaction operationally.
**In-memory is not just "cache." It is a storage architecture choice with trade-offs.** The performance advantage of in-memory databases is not primarily that they avoid disk reads — the OS page cache already caches hot data in memory for disk-based engines. In-memory databases are faster because they avoid the overhead of encoding in-memory data structures into on-disk formats. They also enable richer in-memory data structures (priority queues, sets, sorted sets in Redis) that are expensive to implement on disk. Choose in-memory for the data structure capabilities, not only for read speed.
**Range queries are inherently easier for B-trees.** In an LSM-tree, sorted key ranges exist within each SSTable, but range scans that span multiple SSTables and levels require merging results from multiple files. In a B-tree, keys at the same level are stored on contiguous pages; a range scan follows sibling pointers across leaf pages. For range-scan-dominant workloads (time-series ranges, alphabetical ranges, geospatial ranges), B-tree provides structurally lower read amplification.
**B-tree transaction semantics are a genuine advantage, not just a feature.** B-tree engines store each key in exactly one location. This makes it straightforward to attach range locks to tree nodes, implement MVCC (multi-version concurrency control), and guarantee snapshot isolation. LSM-tree engines may have multiple copies of the same key across SSTable segments; enforcing single-writer semantics or range locks requires coordination layers on top of the storage engine. For workloads that require serializable isolation or complex multi-key transactions, B-tree is the structurally simpler choice.
---
## Examples
### Example 1: Time-Series IoT Event Log (Write-Heavy, Eventual Consistency Acceptable)
**Scenario:** A device telemetry platform ingests sensor readings from 500K devices at 5 events/device/second (2.5M events/sec peak). Queries are mostly recent-window reads ("last 24 hours per device") and aggregate dashboards. There is no requirement for multi-row transactions. The team runs on AWS with a 4-person platform engineering team.
**Trigger:** "We're choosing between Cassandra and PostgreSQL for our IoT ingestion layer. Write volume is the primary constraint."
**Process:**
- Step 1: Write-heavy (99% writes at peak), mixed queries (mostly recent-range, some point lookups), best-effort latency (dashboards tolerate 1-2 second delays)
- Step 2: D1 write throughput → LSM-tree=5, B-tree=2; D4 latency predictability → not a disqualifier; D6 transactions → not required (score equally)
- Step 3: No disqualifiers for LSM-tree; B-tree disqualified by write throughput at this scale without extreme horizontal sharding
- Step 4: Size-tiered compaction — write rate is sustained and high; space amplification is acceptable; Cassandra size-tiered is default and well-tested at this scale
- Step 5: Cassandra (distributed, multi-region, proven at IoT scale); DynamoDB as managed alternative
**Output summary:**
```
## Storage Engine Recommendation
### Recommendation
Engine Family: LSM-tree
Compaction Strategy: Size-tiered (Cassandra default)
Primary Product: Apache Cassandra
Alternative Product: Amazon DynamoDB (managed LSM-tree; removes compaction ops burden)
### Why LSM-tree
2.5M events/sec sustained writes require sequential SSTable appends, not random
B-tree page updates. At this write volume, B-tree random I/O would require 50+
nodes to achieve parity. LSM-tree sequential writes scale linearly with nodes.
### What We're Giving Up
Complex multi-row transactions and ad-hoc joins are not available in Cassandra.
Design data models around denormalized, query-first schemas.
### Compaction Risk Flag
Monitor: nodetool tpstats (Cassandra) — CompactionExecutor pending tasks > 100
is a warning sign. At 2.5M events/sec, size-tiered compaction requires at least
2 dedicated compaction threads per node and ≥50% free disk space headroom.
```
---
### Example 2: Financial Ledger Service (Read-Balanced, ACID Required)
**Scenario:** A fintech startup is building a transaction ledger. Every debit/credit must be ACID-compliant (no partial writes, no duplicate entries). The read/write ratio is approximately 60% reads / 40% writes. The business has a strict p99 < 20ms latency SLA for balance lookups. The dataset is expected to be 500GB after 2 years — fits comfortably in a well-provisioned RDS instance.
**Trigger:** "We need a database for our ledger. Someone suggested Cassandra because it's 'scalable.' Does that make sense?"
**Process:**
- Step 1: Mixed workload (60/40), point-lookup-dominant (balance queries by account ID), latency-SLA-strict (p99 < 20ms)
- Step 2: D6 transactions → critical; D4 latency predictability → critical; D1 write throughput → moderate importance
- Step 3: LSM-tree disqualified — ACID multi-row transactions required (debit + credit must be atomic); compaction jitter risks p99 latency SLA. In-memory disqualified — durability required without replica dependency.
- Step 5: PostgreSQL (full ACID, B-tree, mature, proven for financial workloads)
**Output summary:**
```
## Storage Engine Recommendation
### Recommendation
Engine Family: B-tree
Compaction Strategy: N/A
Primary Product: PostgreSQL
Alternative Product: MySQL InnoDB (same storage family; more familiar to some teams)
### Why B-tree
ACID transactional semantics (debit + credit as atomic operation) require B-tree's
single-key-location model. LSM-tree's multi-segment structure makes range locking
and snapshot isolation operationally complex. PostgreSQL provides serializable
isolation, mature VACUUM tooling, and predictable p99 latency without compaction
stalls.
### What We're Giving Up
Write throughput ceiling is lower than LSM-tree at extreme scale. If the ledger
exceeds ~50K writes/sec, horizontal sharding or CockroachDB (distributed B-tree)
should be evaluated.
```
---
### Example 3: Product Catalog with In-Memory Consideration (Read-Heavy, Small Dataset)
**Scenario:** An e-commerce platform serves a product catalog of 2 million SKUs (approx. 8GB uncompressed). Reads are 95% of traffic (product page loads); writes are batch updates (nightly price refresh from suppliers). The dataset is small enough to fit in RAM on a 32GB server. The team wants sub-millisecond reads.
**Trigger:** "Should we put our product catalog in Redis for speed? Or keep it in PostgreSQL?"
**Process:**
- Step 1: Read-heavy (95% reads), point-lookup-dominant (product by SKU), latency-SLA-strict (sub-millisecond)
- Step 2: D3 read performance → in-memory=5; D1 write throughput → not important (nightly batch); D6 transactions → not required; D7 compaction risk → not a concern for in-memory
- Step 3: No disqualifiers for in-memory if nightly refresh process handles data loss on restart via rebuild. B-tree (PostgreSQL) is viable but slower for sub-ms reads.
- Step 4: N/A (in-memory selected)
- Step 5: Redis (with RDB snapshot for nightly rebuild on restart); PostgreSQL retained as source of truth for writes
**Output summary:**
```
## Storage Engine Recommendation
### Recommendation
Engine Family: In-memory (cache layer)
Primary Product: Redis
Architecture Pattern: Write-through cache — PostgreSQL is source of truth; Redis
serves reads; nightly batch job refreshes Redis from PostgreSQL.
### Why In-memory
8GB dataset fits in RAM. Sub-millisecond read SLA is achievable with Redis
(PostgreSQL with OS page cache typically serves at 1-5ms, not sub-millisecond).
Nightly batch writes are low-frequency and do not stress any storage engine.
### What We're Giving Up
Redis is not the source of truth. A Redis failure requires rebuilding from
PostgreSQL (nightly batch or on-demand rebuild, ~5-10 minute recovery time).
Do not store writes in Redis alone — all mutations go to PostgreSQL first.
### Durability Note
Configure Redis with RDB snapshots or AOF persistence if the 2-million-SKU
rebuild time on restart is unacceptable. Without persistence, a Redis restart
means serving reads from PostgreSQL until the cache warms.
```
---
## References
| File | Contents | When to read |
|------|----------|--------------|
| `references/scoring-guide.md` | Per-dimension scoring rubrics (1-5 scale) with worked examples for write-heavy, read-heavy, and mixed workloads; compaction strategy selection decision tree; workload-to-product routing table | When scoring Step 2 or selecting compaction in Step 4 |
| `references/engine-internals.md` | LSM-tree write path (memtable → WAL → SSTable → compaction), B-tree write path (WAL → page update → page split), in-memory durability patterns; Bloom filter mechanics; write amplification calculation method | When a deeper technical explanation is needed for a team discussion or ADR |
| `references/compaction-monitoring.md` | Cassandra compaction metrics (nodetool tpstats, cfstats), RocksDB compaction stats, write stall conditions, disk space headroom rules, alert thresholds | When LSM-tree is selected and operational guidance is needed |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/compaction-monitoring.md
# Compaction Monitoring Reference
This reference provides operational guidance for monitoring and managing compaction in LSM-tree storage engines. Read this when LSM-tree is selected as the storage engine and you need to define production alerting, tune compaction, or diagnose a compaction problem.
---
## Why Compaction Monitoring Is Non-Optional
LSM-tree engines (Cassandra, RocksDB, LevelDB, HBase) do not throttle incoming writes when compaction falls behind. If write throughput exceeds the compaction thread's capacity to merge SSTable files:
1. **SSTable file count grows** — each uncompacted write creates or expands an SSTable file
2. **Read performance degrades** — each read must check more SSTable files (more Bloom filter checks, more potential disk reads)
3. **Disk space grows unboundedly** — old versions of keys accumulate in uncompacted SSTables; space is not reclaimed
4. **Eventually: disk full** — the system stops accepting writes or crashes
This failure mode is insidious because it degrades gradually. A workload that ran fine for months can suddenly become critical as data volume crosses a threshold.
**Rule:** When selecting an LSM-tree engine, define compaction monitoring before the system goes to production.
---
## Cassandra Compaction Metrics
### Key metrics (via `nodetool`)
```bash
# Show pending compaction tasks
nodetool tpstats | grep -A5 "CompactionExecutor"
# Show per-table SSTable count and size
nodetool cfstats <keyspace>.<table>
# Show compaction history
nodetool compactionhistory
# Show active compaction operations
nodetool compactionstats
```
### Interpreting nodetool tpstats output
```
Pool Name Active Pending Completed Blocked All time blocked
CompactionExecutor 0 3 85432 0 0
```
| Field | Healthy range | Warning threshold | Critical threshold |
|-------|--------------|------------------|--------------------|
| Active | ≤ configured compaction threads | — | — |
| Pending | < 5 | 5-20 | > 20 |
| Blocked | 0 | > 0 briefly | Sustained > 0 |
### SSTable count per table (cfstats)
```
Table: events
SSTable count: 12 ← target < 20 for size-tiered; < 5 for leveled
Space used (live): 45GB
Space used (total): 67GB ← difference = space waiting for compaction reclaim
```
**Target SSTable counts:**
- Size-tiered compaction: < 20 SSTables per table is normal; 20-50 is elevated; > 50 is a warning
- Leveled compaction: Level 0 should have < 4 SSTables (triggers compaction at 4); if Level 0 > 8, compaction is lagging
### Alert thresholds
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Pending compaction tasks | > 15 | > 50 | Increase compaction thread count (compaction_throughput_mb_per_sec) |
| SSTable count (size-tiered) | > 30 | > 60 | Check write rate vs compaction rate; reduce write throughput or add nodes |
| SSTable count (leveled, L0) | > 6 | > 12 | Compaction is behind; check disk I/O saturation |
| Disk space used / disk space total | > 60% | > 80% | Add disk capacity; compaction needs headroom |
| Space used (total) vs (live) ratio | > 1.5x | > 2.5x | Tombstones or stale versions not being collected; trigger manual compaction |
### Tuning compaction throughput (Cassandra)
```yaml
# cassandra.yaml
compaction_throughput_mb_per_sec: 64 # default; increase to 256+ for high-write workloads
concurrent_compactors: 2 # default; increase to 4-8 on nodes with > 8 CPU cores
```
Note: Increasing compaction throughput competes with read/write I/O. On disk-bound systems, more compaction throughput means less available for foreground queries. Test under production-representative load before increasing.
---
## RocksDB Compaction Metrics
### Key statistics (via RocksDB stats API or LOG file)
RocksDB writes compaction statistics to a LOG file and exposes them via `GetProperty`.
```bash
# From the RocksDB LOG file (location: /path/to/db/LOG)
grep "Compaction" /path/to/rocksdb/LOG | tail -50
```
### Key RocksDB properties
```cpp
// In application code or via rocksdb-cli
db->GetProperty("rocksdb.stats", &stats);
db->GetProperty("rocksdb.compaction-pending", &pending);
db->GetProperty("rocksdb.estimate-pending-compaction-bytes", &pending_bytes);
```
| Property | Meaning | Alert threshold |
|----------|---------|-----------------|
| `rocksdb.compaction-pending` | 1 if compaction is pending | Not a direct alert; use pending bytes |
| `rocksdb.estimate-pending-compaction-bytes` | Estimated bytes awaiting compaction | > 10GB: elevated; > 50GB: warning; > 100GB: critical |
| `rocksdb.num-files-at-level0` | Level 0 SSTable file count | > 20: warning; > 40: compaction cannot keep up |
| `rocksdb.total-sst-files-size` | Total SSTable files on disk | Compare to raw data size to measure space amplification |
### Write stall detection
RocksDB introduces write stalls and write stops when compaction cannot keep up. These are logged and can be monitored:
```
# In rocksdb LOG:
[WARN] Stalling writes because we have X level-0 files
# Write stop (engine refuses writes entirely):
[WARN] Stopping writes because we have X level-0 files
```
Write stall thresholds (configurable):
- Default write stall: 20 Level-0 files
- Default write stop: 36 Level-0 files
If write stalls are occurring, reduce write throughput, add compaction threads, or increase Level-0 file count thresholds (short-term mitigation only).
---
## HBase Compaction Metrics
HBase exposes compaction metrics via the HBase Master web UI (default: port 16010) and JMX.
### Key JMX metrics
| Metric | Path | Alert threshold |
|--------|------|-----------------|
| Compaction queue size | `Hadoop:service=HBase,name=RegionServer,sub=Server` → `compactionQueueLength` | > 10: elevated; > 50: warning |
| Store file count | HMaster UI → Tables → per-region store file count | > 10 per store: elevated |
| Write request rate | JMX → `writeRequestCount` | Compare to compaction throughput |
### Trigger manual major compaction (use sparingly)
```bash
# HBase shell
hbase shell
> major_compact 'tablename'
```
Major compaction rewrites all SSTables for a table into a single SSTable per region, reclaiming all tombstone space. It is I/O-intensive — run during off-peak hours only.
---
## Disk Headroom Rules
Insufficient disk headroom prevents compaction from proceeding. If the disk is > 80% full, compaction output files cannot be written, and the engine enters a degraded state.
**Minimum free disk space by compaction strategy:**
| Compaction strategy | Minimum free space | Recommended free space |
|--------------------|-------------------|-----------------------|
| Leveled (RocksDB, LevelDB default) | 30% | 40% |
| Leveled (Cassandra) | 30% | 40% |
| Size-tiered (Cassandra default, HBase) | 50% | 60-70% |
| Size-tiered during peak write burst | 70% | 80% |
Size-tiered compaction needs more headroom because during a compaction cycle, the engine temporarily holds both the input SSTables and the output SSTable on disk simultaneously. If the table being compacted is 30GB, compaction requires 30GB of free space to write the output before deleting the inputs.
---
## Failure Mode: Compaction Cannot Keep Up
**Symptoms:**
- SSTable count grows monotonically over time despite active compaction
- Read latency increases gradually as more SSTables must be checked per query
- `estimate-pending-compaction-bytes` (RocksDB) or pending compaction tasks (Cassandra) grow without decreasing
- Disk usage grows faster than raw write rate (uncompacted duplicates accumulating)
**Diagnosis:**
```bash
# Step 1: Is compaction running at all?
nodetool compactionstats # Cassandra
grep "Compaction" /path/to/rocksdb/LOG | tail -20 # RocksDB
# Step 2: Is disk I/O saturated?
iostat -x 1 # Linux — watch %util for the disk device; > 90% = saturated
# Step 3: What is the write rate vs compaction throughput?
# If write rate (bytes/sec) > compaction throughput (bytes/sec), lag will grow
```
**Remediation options (in order of preference):**
1. **Reduce write rate** — shed load, batch writes, add rate limiting. Temporary relief.
2. **Increase compaction thread count** — more threads = more compaction throughput. Competes with foreground I/O.
3. **Switch compaction strategy** — size-tiered to leveled reduces space amplification; leveled to size-tiered reduces write amplification temporarily.
4. **Add nodes** — distribute the write load and compaction load across more machines.
5. **Add disk capacity** — buying time; does not fix the underlying throughput imbalance.
6. **Reduce TTL / tombstone accumulation** — if the workload has frequent deletes, tombstones accumulate and must be compacted. Shorter TTLs reduce tombstone buildup.
**What not to do:**
- Do not increase `compaction_throughput_mb_per_sec` beyond disk I/O capacity — it will not help and may make foreground performance worse.
- Do not stop and restart the database hoping compaction will "catch up" — restart does not change the throughput balance.
- Do not ignore growing SSTable counts — the system will eventually stop accepting writes or run out of disk space.
FILE:references/engine-internals.md
# Engine Internals Reference
This reference provides technical depth on LSM-tree, B-tree, and in-memory storage internals. Read this when preparing a team explanation, writing an Architecture Decision Record, or debugging a production storage issue.
Source: Designing Data-Intensive Applications, Chapter 3 (Kleppmann).
---
## LSM-Tree Write Path
LSM-tree (Log-Structured Merge-Tree) storage engines never modify data in place. Every write is an append.
### Write sequence
```
Client write (key=K, value=V)
│
▼
1. Write-Ahead Log (WAL)
- Append K,V to an append-only disk log
- WHY: If the process crashes before step 2 completes, the WAL lets
the engine recover the memtable. Without the WAL, recent writes
in the memtable would be permanently lost on crash.
│
▼
2. Memtable (in-memory balanced tree)
- Insert K,V into a red-black tree or AVL tree (keys maintained sorted)
- WHY: Keeping writes in a sorted in-memory structure allows the engine
to flush a complete, sorted SSTable to disk in one sequential write.
If writes went directly to disk unsorted, the engine would need
random I/O on every write.
│
[When memtable exceeds threshold — typically 64MB-256MB]
│
▼
3. SSTable flush to disk
- Write the sorted key-value pairs from memtable to a new SSTable file
- SSTable = Sorted String Table: keys in sorted order, one entry per key
- A sparse in-memory index tracks byte offsets of keys in the file
- A new memtable begins accepting writes while the flush happens
- WHY: SSTable flush is a sequential write (fast on HDD; efficient on SSD).
The sort order comes for free from the memtable's tree structure.
│
▼
4. Compaction (background)
- Merge multiple SSTable files, removing outdated values for the same key
- The mergesort algorithm merges sorted files efficiently, even when
the total size exceeds available RAM
- After merge, old SSTable files are deleted
- WHY: Without compaction, the number of SSTable files grows unboundedly.
Read performance degrades (must check more files per query).
Disk space grows (old values are never reclaimed).
```
### Read sequence
```
Read request (key=K)
│
▼
1. Check memtable — is K in the current in-memory tree?
[Found] → return value
│
▼ [Not found]
2. Check Bloom filter for most recent SSTable
- Bloom filter: probabilistic structure; can say "definitely not in this file"
- False positives possible (file check triggered even if K is absent)
- False negatives impossible (if Bloom filter says absent, K is definitely absent)
- WHY: Without Bloom filters, every absent-key lookup requires reading
all SSTable levels, causing many unnecessary disk reads.
│
[Bloom filter says "maybe present"] → check SSTable file
[Bloom filter says "absent"] → skip this SSTable, check next level
│
▼
3. Binary search within SSTable file using sparse in-memory index
- Index stores offsets for some keys; binary search narrows to a block
- Scan forward within the block to find K
│
▼
4. Repeat for next-older SSTable level if not found
- Check all SSTable levels from newest to oldest
- WHY: The most recent SSTable has the most current value for a key.
Older SSTables may have stale values that have been superseded.
```
### Write amplification in LSM-trees
Write amplification = (physical bytes written to disk) / (logical bytes written by application)
For leveled compaction:
- Level 0 → Level 1: key is rewritten when Level 0 SSTable is compacted into Level 1
- Level 1 → Level 2: rewritten again
- Typical: 10x write amplification for 7 levels of 10x size ratio
For size-tiered compaction:
- Newer SSTables merge into older ones; fewer levels; initial write amplification is lower
- But temporary space usage is higher (multiple full-size SSTable copies exist during merge)
---
## B-Tree Write Path
B-trees organize data into fixed-size pages (blocks), typically 4KB-16KB. Every page has a fixed address on disk. Updates overwrite the page in place.
### Write sequence
```
Client write (key=K, value=V)
│
▼
1. Write-Ahead Log (WAL / redo log)
- Append the operation to the WAL before modifying the B-tree
- WHY: Page updates are not atomic at the hardware level. If the process
crashes mid-page-write, the page is corrupted. The WAL lets the engine
replay the operation and restore a consistent state on restart.
(This is the same purpose as the LSM-tree WAL, but the failure mode
being protected against is different: LSM-tree WAL protects memtable;
B-tree WAL protects partially-written pages.)
│
▼
2. Find the leaf page containing key K
- Traverse from root: at each internal node, follow the page reference
whose key range encompasses K
- Depth is O(log n): a 4-level tree with 4KB pages and branching factor
500 stores up to 256TB of data
│
▼
3a. Key K already exists in this leaf page → overwrite value V in place
│
3b. Key K does not exist in this page
├── Page has space → insert K,V into the page (maintain sorted order)
└── Page is full → page split
- Split the full page into two half-full pages
- Update parent page to add a reference to the new page
- If parent is also full → parent splits too (cascade upward)
- WHY: Page splits preserve the O(log n) tree depth property.
Without splitting, the tree would degrade toward O(n) lookup.
│
▼
4. Write modified page(s) to disk
- Overwrite the page at its fixed disk address
- WHY: The B-tree design assumes each key exists at exactly one page
address. Overwriting in place keeps this invariant. Moving pages
would require updating all parent references — expensive.
```
### Write amplification in B-trees
- Minimum: 2 writes per logical write (WAL + page)
- On page split: 3+ writes (WAL + split pages + parent page update)
- If the engine avoids partial writes (writes full page even for 1-byte change): additional overhead proportional to page size
- Some engines (e.g., PostgreSQL) write pages twice to avoid partial write corruption (double buffering)
### Crash recovery
B-tree crash recovery uses the WAL to replay uncommitted operations. The WAL is an append-only log; replaying it after a crash brings the B-tree pages to a consistent state. This is why the WAL must be written (and fsync'd) before the page write — the WAL is the source of truth for recovery.
---
## In-Memory Database Durability Patterns
In-memory databases serve reads entirely from RAM. Writes may or may not persist to disk, depending on configuration.
### No persistence (cache-only)
- Example: Memcached, Redis without AOF/RDB
- All data lost on process restart
- Acceptable for: session caches, computed view caches, leaderboards where rebuild is fast
- Not acceptable for: any data that cannot be reconstructed from another source
### Append-only file persistence (AOF)
- Example: Redis with AOF enabled
- Every write command is appended to a log file on disk (like a WAL)
- On restart: replay the log to reconstruct state
- Tradeoff: AOF can grow very large; requires periodic compaction (AOF rewrite)
- Durability: configurable — fsync every write (durable, slower) or fsync every second (fast, up to 1 second of data loss possible)
### Periodic snapshot (RDB)
- Example: Redis with RDB enabled
- Full in-memory dataset is serialized to disk periodically (e.g., every 5 minutes or every 1000 writes)
- On restart: load the most recent snapshot; recent writes (since last snapshot) are lost
- Tradeoff: faster restarts than AOF replay; more data loss risk
### Replication-based durability
- Example: Redis Sentinel, Redis Cluster
- Writes are replicated to one or more replicas synchronously or asynchronously
- A failed primary can be replaced by a replica with minimal data loss
- Tradeoff: depends on replica availability; not a substitute for disk persistence in single-node deployments
### Full ACID in-memory (VoltDB, MemSQL/SingleStore)
- Write path: WAL written to disk + write applied to in-memory tables
- On restart: WAL replayed to rebuild in-memory state
- Full ACID semantics: serializable isolation, multi-table transactions
- Tradeoff: WAL write is on critical path; recovery time proportional to WAL size since last checkpoint
---
## Bloom Filter Mechanics
Bloom filters are used by LSM-tree engines (LevelDB, RocksDB, Cassandra, HBase) to avoid unnecessary SSTable file reads when looking up absent keys.
### How it works
A Bloom filter is a bit array of size m, with k hash functions.
**Insert key K:**
1. Hash K with all k hash functions → k bit positions
2. Set those k bits to 1
**Query "is K in the set?":**
1. Hash K with all k hash functions → k bit positions
2. If any bit is 0 → K is definitely NOT in the set (no false negatives)
3. If all bits are 1 → K is probably in the set (false positives possible)
False positive rate decreases as the bit array grows (more bits per key = fewer false positives). A well-tuned Bloom filter with 10 bits per key achieves ~1% false positive rate.
### Why it matters for LSM-trees
Without Bloom filters, a lookup for an absent key requires:
- Checking the memtable (1 in-memory lookup)
- Reading each SSTable level from disk (multiple disk reads)
With Bloom filters:
- Check the Bloom filter (in-memory, fast)
- If the filter says "absent" → skip this SSTable entirely (0 disk reads for absent keys)
- If the filter says "maybe present" → read the SSTable (only ~1% false positive overhead)
This is the primary optimization that makes LSM-tree read performance for point lookups competitive with B-trees.
---
## Write Amplification: Calculation Method
To measure write amplification in production:
```bash
# Linux: monitor disk write throughput
iostat -x 1 | grep <disk_device>
# Compare disk writes (bytes/sec) to application write rate (bytes/sec)
write_amplification = disk_write_bytes_per_sec / application_write_bytes_per_sec
```
Expected ranges:
| Engine | Expected write amplification |
|--------|------------------------------|
| B-tree, stable (no splits) | 2-4x |
| B-tree, high insert rate (frequent splits) | 4-8x |
| LSM-tree, size-tiered, low write rate | 3-5x |
| LSM-tree, leveled, steady state | 8-12x |
| LSM-tree, leveled, high write rate | 10-20x |
If measured write amplification significantly exceeds these ranges, investigate:
- B-tree: high page split rate (index fragmentation; consider FILLFACTOR tuning)
- LSM-tree: compaction not keeping up with writes; size-tiered compaction merging frequently
FILE:references/scoring-guide.md
# Scoring Guide: Storage Engine Selector
This reference provides per-dimension scoring rubrics (1–5), a compaction strategy decision tree, and a workload-to-product routing table for use with Step 2 and Step 4 of the `storage-engine-selector` skill.
---
## Per-Dimension Scoring Rubrics
### D1: Write Throughput
Score the engine's suitability for the workload's write volume.
| Score | Meaning | Example condition |
|-------|---------|-------------------|
| 5 | Engine is structurally optimized for this write volume; handles it with commodity hardware | LSM-tree for 100K+ writes/sec sustained |
| 4 | Engine handles this write volume well with standard configuration | B-tree for 10K-50K writes/sec on SSD with WAL tuning |
| 3 | Engine can handle this write volume with tuning or vertical scaling | B-tree for 50K-100K writes/sec; requires fast NVMe + significant RAM |
| 2 | Engine can technically handle this write volume but requires significant sharding/overprovisioning | B-tree for 200K+ writes/sec — expensive and operationally complex |
| 1 | Engine cannot sustain this write volume without architectural workarounds that change the fundamental access model | B-tree for 1M+ writes/sec — not viable without a queue/buffer layer |
**Write volume thresholds (approximate, commodity hardware, 2024 SSD):**
| Volume | LSM-tree fit | B-tree fit |
|--------|-------------|-----------|
| < 5K writes/sec | Both fine | Both fine |
| 5K-50K writes/sec | Both viable | Both viable |
| 50K-500K writes/sec | LSM-tree preferred | B-tree viable with tuning |
| 500K-5M writes/sec | LSM-tree required | B-tree not recommended |
| > 5M writes/sec | Distributed LSM-tree (Cassandra/HBase) | B-tree not viable |
---
### D2: Write Amplification
Score the engine's write amplification relative to the workload's durability requirements and hardware constraints.
| Score | Meaning |
|-------|---------|
| 5 | Minimal write amplification; each logical write causes ≤2 physical writes |
| 4 | Low write amplification (2-4x); typical for well-tuned B-tree with WAL |
| 3 | Moderate write amplification (4-8x); typical for LSM-tree with leveled compaction |
| 2 | High write amplification (8-15x); typical for LSM-tree with leveled compaction under sustained high load |
| 1 | Very high write amplification (>15x); SSD wear concern; B-tree with frequent page splits |
**Notes:**
- In-memory (cache-only): score 5 — no physical disk writes for writes; async persistence only
- In-memory (durable with AOF): score 4 — append-only log; no page-level amplification
- Size-tiered LSM: score 4 during active writes, score 2-3 during compaction bursts
- Leveled LSM: score 3 steady-state; consistent amplification (~10x per level traversal)
---
### D3: Read Performance
Score the engine's suitability for the workload's dominant read pattern.
| Score | Meaning |
|-------|---------|
| 5 | Engine is structurally optimized for this read pattern; predictable O(log n) or better |
| 4 | Good read performance; occasional extra lookups but Bloom filters or caching mitigate |
| 3 | Adequate read performance; requires tuning (Bloom filters, block cache size) |
| 2 | Suboptimal for this read pattern; works but degrades as dataset grows |
| 1 | Structurally poorly suited; each read requires checking many data structures |
**Read pattern to engine fit:**
| Read pattern | LSM-tree | B-tree | In-memory |
|-------------|----------|--------|-----------|
| Point lookup (key exists) | 4 (memtable + 1-2 SSTable levels with Bloom filter) | 5 (O(log n) to leaf) | 5 (O(1) hash) |
| Point lookup (key absent) | 2-4 (Bloom filter critical; without it = all levels) | 4 (O(log n) confirms absence) | 5 |
| Range scan, narrow | 4 (sorted within SSTable; merge across levels) | 5 (contiguous leaf pages) | 3 (no structural range support unless sorted set) |
| Range scan, wide | 3 (must merge multiple SSTable levels) | 5 (sibling page pointers; sequential I/O) | 3 |
| Full table scan | 3 (sequential SSTable reads; good compression) | 3 (sequential page scan; fragmentation affects) | 4 (RAM bandwidth limited) |
---
### D4: Latency Predictability
Score the engine's ability to maintain consistent latency under the workload's conditions.
| Score | Meaning |
|-------|---------|
| 5 | p99 and p999 latency are consistent; no background process interferes with foreground I/O |
| 4 | Mostly consistent; occasional brief spikes during maintenance (VACUUM, checkpoint) that are predictable and schedulable |
| 3 | Periodic latency spikes from background compaction; spikes are measurable but tolerable for non-SLA-critical workloads |
| 2 | Compaction competes with foreground I/O; p99 spikes are frequent at high write rates |
| 1 | Compaction lag causes persistent latency degradation; indistinguishable from a failure state |
**Latency SLA to engine fit:**
| Latency SLA | LSM-tree fit | B-tree fit | In-memory fit |
|------------|-------------|-----------|--------------|
| p99 < 5ms, always | 2 (compaction spikes risk this) | 4 | 5 |
| p99 < 20ms, always | 3 (manageable with compaction tuning) | 5 | 5 |
| p99 < 100ms, mostly | 4 | 5 | 5 |
| Best-effort, no SLA | 5 | 5 | 5 |
---
### D5: Space Efficiency
Score the engine's disk space usage relative to the raw data size.
| Score | Meaning |
|-------|---------|
| 5 | Space overhead < 10% of raw data size; excellent compression |
| 4 | Space overhead 10-30%; minor fragmentation or redundancy |
| 3 | Space overhead 30-60%; some fragmentation or compaction-in-progress redundancy |
| 2 | Space overhead 60-100%; significant fragmentation or size-tiered compaction temp space |
| 1 | Space overhead > 100%; multiple full copies of data exist simultaneously |
**Rule of thumb disk headroom requirements:**
- B-tree: maintain 20-30% free for page splits and VACUUM
- LSM-tree leveled: maintain 30-40% free for compaction output staging
- LSM-tree size-tiered: maintain 50-100% free to avoid compaction being unable to proceed
- In-memory: RAM cost is 10-100x SSD/HDD per GB; factor into cost analysis
---
### D6: Transactional Semantics
Score the engine's native support for the workload's transaction requirements.
| Score | Meaning |
|-------|---------|
| 5 | Full ACID transactions (atomicity, consistency, isolation, durability) natively supported; serializable isolation available |
| 4 | Read committed or snapshot isolation available; most OLTP workloads covered |
| 3 | Basic atomicity per key (single-row atomic); multi-row transactions require application-level coordination |
| 2 | Eventual consistency by default; lightweight transactions (CAS operations) available but expensive |
| 1 | No transaction support; all consistency is application-level |
**Engine to transaction support:**
| Engine family | Native transaction support |
|--------------|---------------------------|
| B-tree (PostgreSQL, InnoDB) | 5 — full ACID, serializable isolation, row-level locks |
| LSM-tree (RocksDB with transactions) | 4 — optimistic and pessimistic transactions available |
| LSM-tree (Cassandra, HBase) | 2 — eventual consistency default; CAS (Lightweight Transactions in Cassandra) |
| In-memory (VoltDB, MemSQL) | 5 — full ACID in-memory |
| In-memory (Redis) | 3 — single-command atomic; MULTI/EXEC for multi-command atomicity; no isolation |
| In-memory (Memcached) | 1 — no transactions |
---
### D7: Compaction Risk
Score the operational risk introduced by the engine's background maintenance process.
| Score | Meaning |
|-------|---------|
| 5 | No compaction process; background maintenance is lightweight and schedulable |
| 4 | Background maintenance (VACUUM, checkpoint) is predictable and easily scheduled during off-peak hours |
| 3 | Compaction runs continuously; risk is low with proper monitoring and headroom |
| 2 | Compaction can lag at high write rates; requires active monitoring and capacity planning |
| 1 | Compaction lag is a known production issue at this write volume; requires dedicated ops attention |
**When to flag compaction risk as a disqualifier:**
- Team has no dedicated database operations capacity (< 2 engineers with DB expertise)
- Write rate is expected to grow 3x+ in 12 months without a corresponding ops team scaling plan
- Budget for disk capacity headroom is constrained (< 50% free disk; size-tiered compaction needs more)
---
## Compaction Strategy Decision Tree
```
Is LSM-tree the selected engine family?
├── NO → Compaction strategy N/A
└── YES
├── Is write throughput the primary optimization target?
│ (sustained writes > 500K/sec; insert-dominant; batch ingest)
│ ├── YES → Size-tiered compaction
│ │ └── Ensure 50-100% disk headroom for compaction temp space
│ └── NO (balanced or read-leaning)
│ ├── Are range queries > 20% of reads?
│ │ ├── YES → Leveled compaction (better range scan; key per level)
│ │ └── NO
│ │ ├── Is disk space constrained (< 50% headroom)?
│ │ │ ├── YES → Leveled compaction (less space waste)
│ │ │ └── NO → Either; default to leveled (better-known behavior)
│ └── Default → Leveled compaction
```
---
## Workload-to-Product Routing Table
| Workload | Recommended engine | Recommended product | Key consideration |
|----------|-------------------|--------------------|--------------------|
| High-volume event ingestion (IoT, clickstream, logs) | LSM-tree, size-tiered | Cassandra, Kafka + RocksDB | Monitor compaction lag at peak write rates |
| Time-series metrics | LSM-tree, leveled | InfluxDB (internal TSM = LSM variant), RocksDB, TimescaleDB (on PostgreSQL) | Retention policies drive compaction cadence |
| Financial ledger, inventory | B-tree, ACID | PostgreSQL, MySQL InnoDB | Serializable isolation required; horizontal scale via Citus or CockroachDB |
| Key-value cache, session store | In-memory | Redis, Memcached | Decide on persistence requirement before choosing |
| General-purpose OLTP (e-commerce, SaaS) | B-tree | PostgreSQL | Default choice; only deviate with evidence of write bottleneck |
| Full-text search | LSM-tree (inverted index) | Elasticsearch, OpenSearch, Meilisearch | Not a general-purpose storage choice; use alongside a primary DB |
| Embedded key-value (application-internal) | LSM-tree | RocksDB, LevelDB | Not a standalone server; embedded in the application process |
| Wide-column, Hadoop ecosystem | LSM-tree | HBase | Requires ZooKeeper, HDFS operational knowledge |
| Read-heavy product catalog | In-memory + B-tree | Redis (cache) + PostgreSQL (source of truth) | Write-through or write-behind cache pattern |
| Distributed SQL (multi-region ACID) | B-tree (distributed) | CockroachDB, Google Spanner, YugabyteDB | Higher latency than single-region B-tree; consensus protocol adds overhead |
Choose a replication topology (single-leader, multi-leader, or leaderless) and configure it correctly — including sync vs. async mode, quorum parameters (w +...
---
name: replication-strategy-selector
description: |
Choose a replication topology (single-leader, multi-leader, or leaderless) and configure it correctly — including sync vs. async mode, quorum parameters (w + r > n), and consistency guarantees. Use when designing replication for a new system, configuring quorum values for Cassandra/Riak/DynamoDB, deciding how to handle multi-leader write conflicts, or comparing PostgreSQL/MySQL streaming replication vs. CouchDB multi-leader vs. Cassandra leaderless for your architecture. Also use for: selecting a conflict resolution strategy (last-write-wins vs. version vectors); designing multi-datacenter replication; choosing between WAL shipping, logical replication, and statement-based replication log formats.
For diagnosing an existing replication failure (failover gone wrong, lag spike, quorum misconfiguration, split brain), use replication-failure-analyzer instead. For consistency model selection (eventual vs. causal vs. linearizable), use consistency-model-selector instead. For partitioning strategy, use partitioning-strategy-advisor instead.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/replication-strategy-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- consistency-model-selector
- replication-failure-analyzer
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [5]
tags:
- replication
- single-leader
- multi-leader
- leaderless
- quorum
- consistency
- failover
- replication-lag
- read-after-write
- monotonic-reads
- consistent-prefix-reads
- cassandra
- riak
- postgresql
- kafka
- dynamo
- conflict-resolution
- last-write-wins
- version-vectors
- split-brain
- sloppy-quorum
- hinted-handoff
- wal-shipping
- logical-replication
- multi-datacenter
- availability
- durability
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "Application codebase, docker-compose, database config files, or architecture description that reveals data access patterns and consistency requirements"
- type: document
description: "System requirements document or architecture description if no codebase is available"
tools-required: [Read, Write, Bash]
tools-optional: [Grep]
mcps-required: []
environment: "Run inside a project directory with codebase or configuration files. Falls back to document/description input."
discovery:
goal: "Produce a concrete replication strategy recommendation: topology + sync mode + quorum config + consistency guarantees + conflict resolution"
tasks:
- "Classify the system's availability vs. consistency priority"
- "Select the replication topology (single-leader, multi-leader, leaderless)"
- "Select synchronous, asynchronous, or semi-synchronous replication mode"
- "Configure quorum parameters (n, w, r) if leaderless"
- "Select consistency guarantees required by the application"
- "Select a conflict resolution strategy if multi-leader or leaderless"
- "Document failover risks and replication lag anomalies"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "site-reliability-engineer", "tech-lead"]
experience: "intermediate-to-advanced — assumes experience with distributed systems and databases"
triggers:
- "User is choosing between replication strategies for a new database deployment"
- "User is configuring Cassandra, Riak, or DynamoDB quorum parameters"
- "User wants to add multi-datacenter replication to an existing system"
- "User is debugging stale reads, inconsistencies, or read-after-write failures"
- "User is assessing failover risk in a single-leader setup"
- "User is evaluating whether to use multi-leader replication for offline or multi-datacenter operation"
- "User is choosing a conflict resolution strategy for concurrent writes"
not_for:
- "Diagnosing replication failures in production — use replication-failure-analyzer"
- "Selecting consistency and isolation levels for transactions — use consistency-model-selector"
- "Choosing a partitioning/sharding scheme — use partitioning-strategy-advisor"
---
# Replication Strategy Selector
## When to Use
You are designing or evaluating a data system that replicates data across multiple nodes and need to select a replication topology, configure how data is propagated, and define what consistency guarantees the application requires.
This skill applies when the replication architecture is open (new system), contested (existing system with lag or consistency problems), or needs documented justification (team alignment, architecture decision record). It produces a concrete recommendation covering topology, sync mode, quorum configuration, consistency guarantees, and conflict resolution strategy.
**Prerequisite check:**
- If you need to understand what consistency model (linearizability, causal, eventual) the application requires before selecting a topology, run `consistency-model-selector` first.
- If you are diagnosing an active replication failure or anomaly in production, run `replication-failure-analyzer` instead.
- This skill does not cover partitioning/sharding — if the dataset is too large for a single machine, run `partitioning-strategy-advisor` after this skill.
---
## Context & Input Gathering
### Required Context (must have — ask if missing)
- **Availability vs. consistency priority:**
Why: This is the primary axis. Single-leader replication is the most consistent topology (all writes go through one node, so ordering is defined). Leaderless replication is the most available (writes are accepted even when nodes are down). Multi-leader sits between them but introduces conflict risk. You cannot choose a topology without knowing which trade-off matters more.
- Check prompt for: "high availability", "must not go down", "consistency is critical", "ACID", "always online", "offline-capable", "multi-region"
- Check environment for: docker-compose (replication factor configs, consistency-level settings), database config files (sync_commit, consistency_level), requirements.md
- If still missing, ask: "If a network partition separates some of your database nodes, which is more important: (a) all nodes continue accepting writes even if they may diverge temporarily, or (b) writes are rejected until you can confirm consistency across nodes?"
- **Geographic distribution:**
Why: Multi-datacenter operation strongly favors multi-leader or leaderless replication. In a single-leader setup with multiple datacenters, all writes must route to the datacenter containing the leader, adding cross-datacenter latency to every write. If users are geographically distributed and write latency matters, multi-leader or leaderless is necessary.
- Check prompt for: "multi-region", "multiple datacenters", "users in Europe and US", "global", "CDN", "geographic proximity"
- Check environment for: deployment configs (Kubernetes regions, terraform provider regions), docker-compose (multi-DC labels), architecture.md
- If still missing, ask: "Are your database nodes in a single datacenter, or are they spread across multiple geographic regions?"
- **Write conflict probability:**
Why: Multi-leader and leaderless replication can produce write conflicts — two nodes concurrently accepting writes to the same key. If writes to the same record from different clients or regions are likely, conflict resolution must be planned. If conflicts are unlikely (e.g., each user only writes to their own data), multi-leader is more tractable.
- Check prompt for: "concurrent writes", "multiple clients updating the same record", "collaborative editing", "last-write-wins is fine", "offline sync", "shopping cart"
- Check environment for: application code (multi-writer patterns), schema (shared mutable records, counters, aggregates)
- If still missing, ask: "Is it possible for two different clients or datacenters to write to the same record at approximately the same time? For example, two users editing the same document, or two datacenters updating the same account balance?"
- **Read/write ratio and latency requirements:**
Why: Affects whether asynchronous replication and read scaling are appropriate. Single-leader with many asynchronous followers enables horizontal read scaling. Leaderless quorums let you tune r and w to bias toward read or write performance. If write latency is critical, synchronous replication to all nodes is disqualifying.
- Check prompt for: throughput numbers, "read-heavy", "write-heavy", latency SLA (ms), "real-time", "user-facing"
- Check environment for: requirements.md, architecture.md (SLA definitions), application code (read/write ratios)
- If still missing, ask: "What is the approximate read-to-write ratio and your latency requirement? For example: 90% reads with p99 < 50ms, or mostly writes with best-effort latency?"
### Observable Context (gather from environment)
- **Existing database and replication config:** Look for `docker-compose.yml`, `postgresql.conf` (synchronous_standby_names, wal_level), `cassandra.yaml` (replication_factor, consistency_level), `my.cnf` (binlog_format, rpl_semi_sync_master_enabled). Reveals current topology and whether this is a greenfield or migration decision.
- **Number of replicas/nodes:** Look for replica count configurations. Affects quorum math (n must be odd for majority quorums to work cleanly; n=3 tolerates 1 failure, n=5 tolerates 2).
- **Durability configuration:** Look for fsync settings, WAL configuration, backup policies. Signals how much data loss risk is currently accepted.
- **Application consistency patterns:** Grep codebase for session tokens, user-ID-based routing, "read from primary" comments, timestamp-based staleness checks — signals that the team is already working around replication lag.
### Default Assumptions
- Replication factor unknown → assume n=3 (standard minimum for fault tolerance; tolerates 1 node failure)
- Geographic distribution unknown → assume single-datacenter
- Write conflict probability unknown → assume low (most applications have per-user or per-entity write ownership)
- Latency SLA unknown → assume best-effort; synchronous replication to all replicas is not required
---
## Process
### Step 1: Classify the System Profile
**ACTION:** Determine the system profile across three axes: (a) availability vs. consistency priority, (b) geographic distribution, (c) write conflict probability.
**WHY:** The replication topology selection is determined almost entirely by these three axes. Geographic distribution eliminates single-leader as a low-latency option. High conflict probability makes leaderless difficult to use safely. Strict consistency requirements eliminate fully asynchronous leaderless replication. Getting the profile right prevents the most common mistake: choosing Cassandra (leaderless, eventual) for a system that actually needs read-after-write guarantees, or using single-leader in a multi-datacenter deployment where the cross-datacenter write path adds unacceptable latency.
Produce a one-line profile:
```
Profile: [consistency-priority | availability-priority | balanced]
[single-datacenter | multi-datacenter]
[low-conflict | high-conflict | conflict-acceptable]
```
**IF** the user describes offline-first mobile clients → note that multi-leader (or leaderless) is structurally required; each device is effectively a leader.
**IF** the user describes collaborative editing (multiple users editing the same document) → flag that conflict resolution complexity is high; consider whether single-leader with leader-per-document is feasible.
---
### Step 2: Select the Replication Topology
**ACTION:** Choose between single-leader, multi-leader, and leaderless replication based on the Step 1 profile. Apply disqualifying filters first.
**WHY:** Each topology has a different fundamental design: who can accept writes, in what order writes are applied, and what happens when nodes disagree. These are architectural commitments — changing from single-leader to leaderless after data is in production requires schema redesign, operational migration, and application changes. The topology decision must be made with full awareness of what each option cannot do.
**Disqualifying filters:**
| Condition | Disqualifies |
|-----------|-------------|
| Users distributed across 2+ geographic regions AND write latency < 100ms required | Single-leader (all writes must cross datacenter to reach the leader) |
| Strict ACID transactions required across multiple records | Leaderless (no defined write ordering; transactions require coordination layers) |
| Writes must never be lost (zero data loss on failure) | Fully asynchronous single-leader (async follower may lag behind when leader fails) |
| Application cannot tolerate write conflicts (financial transactions, inventory reservation) | Multi-leader and leaderless (concurrent writes to same record may produce conflicts) |
| Clients must work offline and sync later | Single-leader (requires connectivity to the leader for writes) |
**Three-way topology comparison:**
| Dimension | Single-leader | Multi-leader | Leaderless |
|-----------|--------------|-------------|-----------|
| **Write ordering** | Total order (leader serializes all writes) | Partial order (writes within a leader are ordered; cross-leader conflicts possible) | No global order (concurrent writes to same key are possible) |
| **Write availability during leader failure** | Unavailable until failover completes (~30s typical) | Available (other datacenters continue accepting writes) | Available (quorum of remaining nodes accepts writes) |
| **Write latency in multi-datacenter** | High (all writes route to leader's datacenter) | Low (writes go to local datacenter's leader) | Low (writes go to any available replica; quorum locally) |
| **Conflict risk** | None (only one writer) | High (two datacenters can write to same record concurrently) | Medium (concurrent writes possible; quorum overlap reduces risk) |
| **Operational complexity** | Low (well-understood; all major DBs support it) | High (conflict resolution logic required; retrofitted in many DBs) | Medium (quorum config requires care; read repair needed) |
| **Consistency guarantees available** | Read-after-write, monotonic reads, consistent prefix (achievable with techniques) | Eventual at minimum; stronger guarantees require cross-leader coordination | Tunable via quorum config; does not naturally provide the session guarantees above |
| **Representative systems** | PostgreSQL, MySQL, MongoDB (default), Kafka | Tungsten (MySQL), BDR (PostgreSQL), CouchDB, multi-DC Cassandra with multi-leader config | Cassandra, Riak, Voldemort, DynamoDB |
**Selection rules:**
```
IF single-datacenter AND consistency-priority → Single-leader
IF single-datacenter AND availability-priority AND low-conflict → Leaderless
IF multi-datacenter AND write-latency-critical → Multi-leader OR Leaderless
IF offline-clients OR multi-datacenter-availability → Multi-leader
IF high-conflict AND no-conflict-resolution-capacity → Single-leader (simplest)
IF consistency-priority AND ACID-required → Single-leader (only safe choice)
```
**WHY the selection rules are structured this way:** Single-leader is the simplest and most consistent option — there is only one writer, so write ordering is total and conflicts are impossible. The cost is write availability during failover and cross-datacenter write latency. Multi-leader trades conflict risk for write availability across datacenters. Leaderless trades global ordering for the ability to tune availability and consistency independently via quorum parameters. Most systems should default to single-leader unless there is a concrete reason (multi-datacenter write latency, offline operation) that requires one of the more complex options.
**Record the selected topology and the primary disqualifiers applied.**
---
### Step 3: Select Synchronous vs. Asynchronous Replication Mode
**ACTION:** For the selected topology, determine whether replication to followers/replicas should be synchronous, asynchronous, or semi-synchronous.
**WHY:** Synchronous and asynchronous replication represent a direct trade-off between durability and write availability. Synchronous replication guarantees that a write is on at least two nodes before confirming to the client — if the leader fails immediately after acknowledging the write, the data is not lost. Asynchronous replication confirms the write to the client as soon as the leader applies it locally — if the leader fails before replicating, the write is lost even though the client was told it succeeded. This is not a subtle theoretical concern; it is the primary cause of data loss in leader-based database failures.
**For single-leader and multi-leader:**
| Mode | Behavior | When to use |
|------|----------|-------------|
| **Fully synchronous** | All followers must confirm before leader reports success. Write is durable on all nodes. | Never recommended for all followers — any single follower failure blocks all writes. |
| **Semi-synchronous (recommended default)** | One follower is synchronous; all others are asynchronous. If the synchronous follower fails, an async follower is promoted to synchronous. | When you need at least 2 durable copies of every write, while keeping write availability when some followers lag. This is PostgreSQL's `synchronous_commit` behavior. |
| **Fully asynchronous** | Leader confirms immediately; followers catch up in background. Replication lag may be milliseconds to minutes depending on load. | When write throughput is the primary constraint and some data loss on leader failure is acceptable (e.g., log aggregation, analytics events, non-critical telemetry). |
**Configuration implications:**
- PostgreSQL: `synchronous_commit = on` (synchronous to named standby) or `remote_write` (semi-sync). Controlled by `synchronous_standby_names`.
- MySQL: `rpl_semi_sync_master_enabled = 1` enables semi-synchronous. At least one slave must acknowledge before leader reports success.
- Kafka: `acks=all` is synchronous to all in-sync replicas (ISR). `acks=1` is semi-sync (leader only). `acks=0` is fire-and-forget.
**WHY semi-synchronous is the recommended default:** Fully synchronous (all followers) means any follower slowdown or failure stalls all writes — one network blip kills write availability for the entire cluster. Fully asynchronous means leader failure always risks losing writes that were confirmed to the client, which violates most durability expectations. Semi-synchronous gives you at least 2 durable copies (leader + 1 synchronous follower) while keeping write availability when some nodes are slow or down.
**For leaderless:**
Synchronous vs. asynchronous is expressed as quorum configuration (Step 4). Skip to Step 4.
---
### Step 4: Configure Quorum Parameters (Leaderless Only)
**ACTION:** If leaderless replication is selected, determine the values of n (replication factor), w (write quorum), and r (read quorum) using the formula w + r > n.
**WHY:** The quorum formula w + r > n guarantees that at least one node in every read set has seen the most recent write. This is because the w write nodes and r read nodes must overlap by at least one node (by the pigeonhole principle). Without this overlap, a read could hit only stale nodes and return an outdated value without knowing it. Quorum configuration is the primary lever for trading off availability vs. consistency in a leaderless system.
**Common configurations:**
| n | w | r | w+r>n? | Fault tolerance | Bias |
|---|---|---|--------|----------------|------|
| 3 | 2 | 2 | 4>3 ✓ | 1 node failure | Balanced (standard) |
| 3 | 3 | 1 | 4>3 ✓ | 0 for writes | Write-consistent, read-fast |
| 3 | 1 | 3 | 4>3 ✓ | 0 for reads | Write-fast, read-consistent |
| 5 | 3 | 3 | 6>5 ✓ | 2 node failures | Balanced |
| 5 | 5 | 1 | 6>5 ✓ | 0 for writes | Write durable, read-fast |
| 3 | 1 | 1 | 2>3 ✗ | N/A | High availability, stale reads possible |
**Quorum selection rules:**
```
High availability priority → lower w and r (set w+r ≤ n for sloppy quorum behavior)
Read-heavy workload → lower r (r=1 if stale reads acceptable); higher w to keep writes durable
Write-heavy workload → lower w; higher r to compensate for write distribution
Strong consistency needed → w = n or w+r > n strictly (strict quorum)
Multi-datacenter → configure per-datacenter quorum in Cassandra LOCAL_QUORUM
```
**WHY the common default is n=3, w=2, r=2:** This is the minimum configuration that provides both fault tolerance (1 node can fail) and quorum overlap (the 2 write nodes and 2 read nodes must share at least 1 node). Lowering to w=1 or r=1 increases availability but risks returning stale data — the read may miss the node that has the latest write.
**Sloppy quorums and hinted handoff:**
During a network partition that cuts a client off from enough nodes to reach quorum, the database faces a choice: reject all writes (strict quorum) or accept writes on any available nodes even if they are not the designated "home" nodes for a key (sloppy quorum). Sloppy quorums improve write availability during partitions but break the quorum consistency guarantee — a read of r nodes is no longer guaranteed to overlap with the w write nodes. After the partition heals, hinted handoff sends the temporarily-routed writes to their home nodes. Sloppy quorums are appropriate for use cases that can tolerate occasional stale reads (Riak enables by default; Cassandra and Voldemort disable by default).
**WHY sloppy quorums break the consistency guarantee:** With a strict quorum, w + r > n ensures the read and write node sets overlap. With a sloppy quorum, writes may land on "neighbor" nodes not in the normal n-node set. When you read from r nodes in the normal set, those nodes may not include the neighbor nodes that received the write — so w + r > n no longer guarantees overlap. Sloppy quorum is a write-availability mechanism, not a consistency mechanism.
---
### Step 5: Select Consistency Guarantees
**ACTION:** Determine which consistency guarantees the application requires and verify that the selected topology and sync mode can provide them.
**WHY:** Replication lag is not theoretical — in asynchronous systems, followers can lag by milliseconds to minutes. Applications that read from followers without accounting for this lag expose users to anomalies: seeing a write they just made disappear (violating read-after-write), watching data appear to move backward in time (violating monotonic reads), or seeing an answer before the question that caused it (violating consistent prefix reads). Each anomaly requires a specific mitigation technique. Choosing the right mitigation depends on knowing which anomaly is unacceptable for your application's users.
**The three replication lag anomalies and their mitigations:**
**1. Read-after-write consistency (read-your-writes)**
- Anomaly: User submits data, immediately reads it back from a stale follower, sees the old value. Appears to the user as if their write was lost.
- When it matters: Any application where users write and immediately view their own data (profile updates, comment submission, form submission confirmation).
- Mitigations (apply one):
- Read the user's own data from the leader; read other users' data from followers. (Requires knowing which data the user can have modified.)
- For one minute after any write, route all that user's reads to the leader (or prevent reads from followers lagging > 1 minute).
- Track the timestamp of each user's most recent write as a logical timestamp (log sequence number). On read, only serve from replicas that have applied up to that timestamp.
- For multi-device access (user writes on phone, reads on laptop): centralize the timestamp metadata and route all devices to the same datacenter.
**2. Monotonic reads**
- Anomaly: User makes two successive reads; the second read returns data older than the first (because it hit a more-lagged replica). Data appears to move backward in time.
- When it matters: Any application where users make multiple reads of the same data in a session (social feeds, dashboards, chat messages).
- Mitigation: Route each user's reads to the same replica (e.g., hash the user ID to a replica). If that replica fails, reroute — but accept that the user may briefly see data go backward.
**3. Consistent prefix reads**
- Anomaly: Observer reads writes in a different order than they were written, violating causality. An answer is visible before the question that prompted it.
- When it matters: Any application where the order of writes is semantically meaningful (conversation threads, audit logs, event sourcing, financial transaction history).
- Mitigation: Ensure causally related writes go to the same partition (so they are applied in the same order at all replicas). Alternatively, track causal dependencies with version vectors.
**Consistency guarantee availability by topology:**
| Guarantee | Single-leader | Multi-leader | Leaderless |
|-----------|-------------|-------------|-----------|
| Read-after-write | Achievable (read from leader; timestamp-based routing) | Complex (requires routing to the same leader that processed the write) | Not guaranteed by quorum alone; application must implement timestamp tracking |
| Monotonic reads | Achievable (sticky-session replica routing) | Complex (requires routing to same datacenter that processed the write) | Not guaranteed by quorum; implement user-to-replica affinity |
| Consistent prefix reads | Natural for single-partition data; requires causal tracking for partitioned data | Very hard (cross-datacenter write ordering is not defined) | Not guaranteed; requires version vectors and causal tracking |
**WHY leaderless quorums do not automatically provide session guarantees:** Even with w + r > n, the overlap node is not guaranteed to be the node you actually read from — quorum reads are sent to r nodes, and the freshest value is returned, but if two concurrent writes happened, the "freshest" determination uses version numbers or timestamps that can be unreliable (especially with clock skew). The session guarantees (read-after-write, monotonic reads, consistent prefix) require the application or database to track causal dependencies explicitly — a capability that is present in single-leader systems by construction (because the leader serializes all writes) but must be built explicitly in leaderless systems.
---
### Step 6: Select a Conflict Resolution Strategy (Multi-leader or Leaderless)
**ACTION:** If the topology is multi-leader or leaderless, select a conflict resolution strategy for concurrent writes to the same key.
**WHY:** In multi-leader and leaderless systems, two nodes can concurrently accept writes to the same key. When replication converges, the system must decide what the final value of that key should be. There is no "correct" answer — the right resolution depends on the application semantics. Making this choice explicit now prevents silent data loss from the default strategy (usually last-write-wins, which discards writes) surprising the team later.
**Conflict detection:** Two writes are concurrent if neither operation knew about the other when it was sent. Version vectors track which version each write was based on. If client A's write is based on version v1 and client B's write is also based on v1 (not knowing about A's write), both writes are concurrent — they must be resolved. If B's write is based on v2 (which incorporated A's write), B's write happens-after A's, and the later write wins without conflict.
**Resolution strategies:**
| Strategy | How it works | Data loss? | When to use |
|----------|-------------|-----------|------------|
| **Last write wins (LWW)** | Each write is timestamped; the write with the highest timestamp wins. Other writes are silently discarded. | Yes — concurrent writes are lost | Only when data loss is acceptable (caches, analytics events, idempotent operations). Cassandra's default. |
| **Conflict avoidance** | Route all writes for a given record through the same leader. Conflicts cannot occur if one leader "owns" each record. | No | When write locality is predictable (user X always writes to datacenter A). Breaks down if user location changes. |
| **Merge / union** | Merge all concurrent versions into a combined value. For sets/lists: take the union. | No (but requires tombstones for deletes) | Collaborative data structures (shopping carts, sets, counters). Riak siblings; CRDTs. |
| **Application-level resolution on read** | Store all conflicting versions; return them all to the application on the next read. Application code resolves and writes back the merged value. | No | When only the application has the semantic context to resolve (CouchDB model). Requires application code changes. |
| **Application-level resolution on write** | Conflict handler executes immediately when conflict is detected in the replication log. Must execute quickly (no user prompts). | Depends on handler | When automated resolution is possible (Bucardo for PostgreSQL). |
**WHY LWW is dangerous and widely misused:** LWW requires a reliable total ordering of writes. In distributed systems, clocks cannot be trusted to provide this — two nodes' system clocks can differ by milliseconds or more (see clock skew), meaning a write with an earlier local timestamp may have actually occurred later in causal time. LWW silently discards writes that are reported as successful to the client. The only safe use of LWW is when each key is written exactly once and then treated as immutable — for example, using a UUID as the key so concurrent updates cannot happen.
**WHY conflict avoidance is often the best strategy:** Most multi-leader deployments can be designed so that a given record has a "home" leader — the user's nearest datacenter, for example. If routing ensures that user X always writes to leader A, concurrent writes to the same record from two leaders cannot happen. Conflict avoidance is effectively making multi-leader behave like single-leader on a per-record basis. The limitation is that if the home leader changes (datacenter failure, user relocation), conflicts become possible again.
---
### Step 7: Produce the Recommendation
**ACTION:** Write a structured recommendation document covering all decisions made in Steps 1–6.
**WHY:** A concrete documented recommendation enables team alignment, prevents relitigating the decision, and creates an artifact for the architecture decision record. The "What Can Go Wrong" section is essential — it prevents future surprise when production behavior diverges from the expected replication model.
**Output format:**
Write the following to a file named `replication-strategy-recommendation.md` in the project root (or in `docs/architecture/` if that directory exists):
```markdown
## Replication Strategy Recommendation
### System Profile
[One-line profile from Step 1]
### Recommendation
**Topology:** [Single-leader | Multi-leader | Leaderless]
**Sync Mode:** [Synchronous | Semi-synchronous | Asynchronous | N/A (leaderless)]
**Quorum (if leaderless):** n=[x], w=[x], r=[x] (w+r=[x] > n=[x] ✓)
**Sloppy Quorum:** [Enabled | Disabled | Not applicable]
### Primary Database Products
| Role | Product | Configuration |
|------|---------|--------------|
| [Leader / All replicas] | [PostgreSQL / Cassandra / etc.] | [key config settings] |
| [Follower / Replica] | [as above] | [sync mode setting] |
### Consistency Guarantees Provided
| Guarantee | Provided? | Implementation |
|-----------|----------|----------------|
| Read-after-write | [Yes / No / Application-layer] | [Technique from Step 5] |
| Monotonic reads | [Yes / No / Application-layer] | [Technique from Step 5] |
| Consistent prefix reads | [Yes / No / Application-layer] | [Technique from Step 5] |
### Conflict Resolution (if multi-leader or leaderless)
**Strategy:** [LWW | Conflict avoidance | Merge | Application-level]
**Rationale:** [Why this strategy fits the application semantics]
### Why [Topology]
[2-3 sentences connecting the system profile to the topology choice]
### What We're Giving Up
[1-2 sentences on the primary trade-off and how to mitigate it]
### Disqualifiers Applied
[If any topology was disqualified, state the condition and why]
### What Can Go Wrong
[See "What Can Go Wrong" section below — include the relevant items for this topology]
```
---
## What Can Go Wrong
**Failover data loss (single-leader, async replication).** When an asynchronous leader fails, the new leader is the replica with the most up-to-date data — but "most up-to-date" does not mean "fully up-to-date." Any writes the old leader processed but had not yet replicated are lost when the new leader takes over, even though the client received an acknowledgment. This is the fundamental cost of asynchronous replication. If the old leader later comes back online, it may still have those writes — the system must ensure it discards them and follows the new leader, or data inconsistency results permanently.
**Split brain (single-leader and multi-leader).** In certain network partition scenarios, two nodes can each believe they are the leader. If both accept writes, and there is no mechanism to resolve conflicts, data diverges permanently. The standard defense is fencing: the system shuts down one of the two nodes when it detects two leaders. If the fencing mechanism itself has a bug (or is not implemented), the system can end up with both nodes shut down — a worse outcome than split brain. This is described as "shoot the other node in the head" (STONITH) in the operations literature.
**Replication lag spikes during leader failure or high load.** Asynchronous followers can fall behind by seconds or minutes when the system is under load or recovering from a failure. If the application reads from followers assuming the lag is negligible, users see stale data. The lag can increase to several minutes during follower recovery, maintenance, or network problems. Monitor the replication lag metric on every follower (PostgreSQL: `pg_replication_slots`, `pg_stat_replication`; MySQL: `Seconds_Behind_Master`). Alert when lag exceeds your acceptable staleness threshold.
**Timeout misconfiguration in failover.** Automatic leader election uses a timeout to detect leader failure. If the timeout is too short, a temporarily slow network causes an unnecessary failover — the old leader is still alive but is now competing with the new leader (split brain risk). If the timeout is too long, the cluster is unavailable for writes for longer than necessary after a real failure. There is no universally correct timeout. Start with 30 seconds and calibrate based on observed network behavior. High-load systems should use longer timeouts because response times are naturally higher.
**Old leader rejoining after failover.** When the old leader comes back online after a failover, it may still believe it is the leader — particularly if the old leader was partitioned (slow, not dead). It will try to accept writes. The system must force the old leader to recognize the new leader and demote itself to follower. Without this mechanism, writes go to two nodes simultaneously. This is a known problem in many open-source databases and is a primary reason some operations teams prefer manual failovers even when software supports automatic failover.
**LWW silently discards acknowledged writes.** Last-write-wins conflict resolution is the default in Cassandra and other Dynamo-style databases. If two clients write to the same key concurrently, LWW picks one value and discards the other — even though both clients received a success acknowledgment. The client whose write was discarded has no way to know this happened. For any data where losing a write is unacceptable (financial records, reservations, inventory counts), LWW is a correctness bug, not just a performance trade-off. Use conflict avoidance, application-level resolution, or CRDTs instead.
**Quorum does not guarantee strong consistency.** Even with w + r > n, stale reads can occur in edge cases: (a) if two writes happen concurrently, the ordering between them is not defined and LWW may pick the wrong winner; (b) if a sloppy quorum is enabled, the write nodes and read nodes may not overlap; (c) if a write fails on some replicas but succeeds on others (partial write), and the write is reported as failed to the client but applied to w nodes, subsequent reads may return that "failed" write's value. Quorum is a probabilistic durability guarantee, not a linearizability guarantee. For linearizability, use a consensus protocol (Raft, Paxos — as in etcd, ZooKeeper, CockroachDB).
**Multi-leader conflict resolution logic is hard to get right.** Amazon's early DynamoDB implementation had a conflict resolution handler for shopping carts that merged adds but not removes — items removed from the cart would reappear after a conflict merge. The bug went undetected because conflicts are infrequent and the test suite did not cover the concurrent write case. Conflict resolution code must be explicitly tested with concurrent write scenarios; it cannot be reasoned about by inspecting the code alone. Build a test harness that simulates concurrent writes to the same key before deploying multi-leader.
---
## Key Principles
**Default to single-leader unless there is a concrete requirement for multi-leader or leaderless.** Single-leader replication is the simplest topology — it provides total write ordering, makes consistency guarantees achievable, and is well-supported by every major database. Multi-leader and leaderless replication solve real problems (multi-datacenter write latency, offline operation, high write availability), but they introduce conflict risk and operational complexity that is routinely underestimated. The right default is single-leader with semi-synchronous replication.
**Replication mode is a durability commitment, not a performance setting.** Choosing asynchronous replication to improve write latency means accepting that some acknowledged writes will be lost if the leader fails before replication completes. This is a business and correctness decision, not purely a performance optimization. Make this trade-off explicitly with the team, document it, and build monitoring that alerts when replication lag reaches a level where data loss would be significant.
**The replication lag anomalies (read-after-write, monotonic reads, consistent prefix) are application bugs, not database bugs.** When an application reads from an asynchronous follower and returns stale data to the user, the database is behaving correctly — it told you it uses eventual consistency. The application is the layer responsible for implementing the session guarantees that mask this behavior. Either implement the mitigation techniques in Step 5 or use a database that provides these guarantees at the database layer (PostgreSQL with synchronous_commit, or a database with sessions backed by the leader).
**Quorum configuration is not set-and-forget — it must be re-evaluated as load and node count change.** A system that starts with n=3, w=2, r=2 and later grows to n=5 may have stale quorum configuration. Adding replicas without updating quorum parameters means the quorum overlap guarantee may be weaker or stronger than intended. Revisit quorum configuration whenever the replication factor changes or when the availability vs. consistency balance of the system changes.
**Conflict avoidance is almost always better than conflict resolution.** Every conflict resolution strategy (LWW, merge, application-level) is either lossy, complex, or application-specific. The cleanest approach is to avoid conflicts entirely by routing all writes for a given record through a single leader. If the system allows it, design the data model and routing rules so that conflicts cannot happen — then multi-leader or leaderless replication behaves like single-leader on a per-record basis, with the added benefit of datacenter-local write acceptance for other records.
---
## Examples
**Scenario: Global e-commerce platform with regional datacenters**
Trigger: "We have users in the US, EU, and Asia. Right now all writes go to our US leader. EU and Asia users complain about high write latency. We're considering multi-leader replication."
Process:
- Step 1: Multi-datacenter, write-latency-critical, low-conflict (each user writes to their own orders/cart)
- Step 2: Single-leader disqualified (cross-datacenter write latency). Multi-leader selected. Low conflict probability → conflict avoidance feasible.
- Step 3: Semi-synchronous within each datacenter; asynchronous between datacenters (inter-DC link is less reliable).
- Step 5: Read-after-write: route reads to the user's home datacenter's leader (same datacenter that accepted the write). Monotonic reads: user session pinned to home datacenter.
- Step 6: Conflict avoidance — each user's records are owned by their home datacenter. If home datacenter changes, implement a brief lock during migration.
Output: Multi-leader with one leader per datacenter, semi-sync within datacenter, async cross-datacenter, conflict avoidance via user-to-datacenter affinity. Read-after-write provided by datacenter-local routing.
**Scenario: Real-time leaderboard for a mobile game**
Trigger: "We need a leaderboard that survives node failures without going down. Players' scores update constantly. Occasional stale reads are acceptable — seeing a score a few seconds old is fine."
Process:
- Step 1: Availability-priority, single-datacenter, high-write, stale-reads-acceptable
- Step 2: Single-leader would work but creates write bottleneck for high-frequency score updates. Leaderless selected — high write throughput with tunable availability.
- Step 4: n=3, w=2, r=1. w+r=3, not > n=3. This is intentionally a sloppy configuration for high read availability at the cost of occasional staleness. For scores where exact order matters, increase r=2.
- Step 5: Consistent prefix not required (leaderboard order is eventually consistent by design). Read-after-write not required (seeing a slightly old score is acceptable).
- Step 6: LWW acceptable — score updates are monotonically increasing; overwriting with an older score is unlikely because score writes are always increments, not overwrites. Use version vectors to detect if older value would overwrite newer.
Output: Leaderless (Cassandra), n=3, w=2, r=1. Sloppy quorum enabled for maximum availability. LWW with version tracking. Monitoring alert if replication lag exceeds 5 seconds.
**Scenario: Financial transaction ledger**
Trigger: "We're building an account balance system. Every debit and credit must be consistent. We cannot lose writes. We're using PostgreSQL and considering adding read replicas for reporting."
Process:
- Step 1: Consistency-priority, single-datacenter, write-conflict-unacceptable (double-debit catastrophic)
- Step 2: Multi-leader disqualified (write conflicts are unacceptable; two DCs could both debit the same account). Leaderless disqualified (no strong ordering; quorum does not prevent double-debit). Single-leader selected.
- Step 3: Semi-synchronous. At least one standby must acknowledge before the leader confirms. Fully async is disqualified — data loss on leader failure is not acceptable for financial records.
- Step 5: Read-after-write: all balance reads for the user's own account come from the leader. Reporting queries (dashboards, audit logs) can use async followers with acceptable staleness.
- Step 6: N/A — single-leader, no conflicts possible.
Output: PostgreSQL single-leader, `synchronous_commit = remote_write` (semi-synchronous), one sync standby, multiple async replicas for reporting. All balance reads routed to leader. Replication lag monitoring alert at 10 seconds.
---
## References
For topology-specific deep dives, failure mode details, and quorum calculators, see the references directory:
- For quorum parameter calculations and availability modeling, see [quorum-calculator.md](references/quorum-calculator.md)
- For replication log format comparison (WAL shipping vs. logical/row-based vs. statement-based), see [replication-log-formats.md](references/replication-log-formats.md)
- For conflict resolution strategy comparison and implementation patterns, see [conflict-resolution-strategies.md](references/conflict-resolution-strategies.md)
- For failover procedure and the 3-step leader election process, see [failover-playbook.md](references/failover-playbook.md)
- For replication lag monitoring metrics by database (PostgreSQL, MySQL, Cassandra), see [replication-lag-monitoring.md](references/replication-lag-monitoring.md)
- Cross-reference: `replication-failure-analyzer` — for diagnosing active replication anomalies in production
- Cross-reference: `consistency-model-selector` — for selecting the right consistency model before this skill
- Cross-reference: `partitioning-strategy-advisor` — for sharding strategy after replication topology is decided
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-consistency-model-selector`
- `clawhub install bookforge-replication-failure-analyzer`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/conflict-resolution-strategies.md
# Conflict Resolution Strategies
Use this reference when selecting a conflict resolution strategy for multi-leader or leaderless replication. A write conflict occurs when two nodes independently accept writes to the same key without either knowing about the other's write. All replicas must eventually converge to the same value — conflict resolution defines how that convergence happens.
## What Makes a Conflict
Two writes are **concurrent** if neither operation was aware of the other when it was submitted. Version vectors (or vector clocks) track which version each write was based on:
- If client B's write is based on version v2 (which included client A's write), B's write happens-after A's — no conflict.
- If both A and B wrote based on version v1 (neither knew about the other), both writes are concurrent — conflict.
Two writes are **not concurrent** (and therefore not a conflict) if one causally depends on the other.
## Strategy 1: Last Write Wins (LWW)
**How it works:** Attach a timestamp to every write. When two concurrent writes are detected, the write with the higher timestamp "wins" — the other is silently discarded.
**Data loss:** Yes. The losing write is dropped, even though the client received a success acknowledgment.
**When to use:**
- Data loss is acceptable (caches, idempotent operations, analytics events that can be replayed)
- Each key is written at most once and then immutable (use UUID as key to prevent reuse)
- You need the simplest possible convergence with no application code
**When NOT to use:**
- Financial records, inventory counts, reservation systems — any write that cannot be lost
- Counters or aggregates (two increments become one)
- Any write where the client expects their acknowledgment means the write is permanent
**The clock skew problem:** LWW relies on timestamps to pick the "most recent" write. In distributed systems, node clocks cannot be perfectly synchronized. A write with a "later" local timestamp may have actually been submitted before a write with an "earlier" timestamp on another node (due to clock drift, NTP adjustments, or clock skew). LWW will silently pick the wrong winner in these cases.
**Safe use of LWW:** The only safe pattern is to use a globally unique key (UUID) for every write, so two writes to the same key cannot happen. Cassandra documentation recommends this pattern: use a UUID as the primary key, making each write operation a unique key, so concurrent updates cannot occur by construction.
**Implementation:**
- Cassandra: LWW is the default conflict resolution. Timestamp is the write time on the client.
- Riak: LWW is optional; version vectors are the default.
- DynamoDB: Conditional writes (compare-and-set) prevent LWW data loss for critical updates.
---
## Strategy 2: Conflict Avoidance
**How it works:** Design the system so that all writes for a given record always go through the same leader. If one leader "owns" each record, concurrent writes to the same record from two leaders cannot happen — there is no conflict to resolve.
**Data loss:** No.
**When to use:**
- Each record has a natural "home" (user's home datacenter, record owner, geographic affinity)
- Write locality is predictable (a given user always writes from the same region)
- The system can tolerate routing all writes for a record to a single datacenter
**Limitation:** Conflict avoidance breaks down if the home leader changes. If a user moves to a different region, or if the home datacenter fails and traffic is rerouted, writes may go to a different leader than expected — creating conflicts. Implementing a clean leader-change handoff (lock the record, drain writes, switch leader, unlock) is non-trivial.
**Implementation:** Application-level routing. Configure your load balancer or service layer to send writes for a given key (e.g., user ID, account ID) to the datacenter that "owns" that key. Can be implemented with consistent hashing or explicit key-to-datacenter mappings.
---
## Strategy 3: Merge / Union
**How it works:** When concurrent writes produce multiple versions of the same key, keep all versions and merge them into a single value. For sets and lists, merge = union.
**Data loss:** No — but merge may produce a result that is not exactly what either client intended.
**When to use:**
- The data is a collection (set, list, map) where union is semantically correct
- Collaborative data structures (shopping carts, task lists, tagged content)
- CRDTs (Conflict-free Replicated Data Types) — data structures designed for automatic merge
**Limitation:** Merge only works for additive operations. Deletions are problematic: if client A deletes an item and client B adds an item, and the two writes are concurrent, the union may bring the deleted item back. The solution is tombstones — a special marker indicating "this item was deleted." The merge algorithm must treat tombstones correctly.
**CRDTs (Conflict-free Replicated Data Types):**
CRDTs are data structures designed so that concurrent modifications can always be merged automatically without conflict. Examples:
- **G-counter (grow-only counter):** Each node has its own counter; the global count is the sum of all per-node counts. Concurrent increments never conflict.
- **OR-Set (observed-remove set):** Adds and removes are tracked with unique identifiers; concurrent add+remove is resolved by keeping the add (with a tombstone for the remove).
- **LWW-register:** A single-value register using LWW — safe only when writes to the same key are not truly concurrent (because LWW is used).
Implementations: Riak 2.0 (G-counters, sets, maps, registers, flags), Redis (basic CRDTs), Akka Distributed Data (various CRDTs).
---
## Strategy 4: Application-Level Resolution on Read
**How it works:** When concurrent versions of a key exist, all versions are stored ("siblings" in Riak terminology). On the next read, the database returns all conflicting versions to the application. The application code resolves the conflict and writes back the merged value.
**Data loss:** No — but requires application code to handle multi-version responses correctly.
**When to use:**
- Only the application has the semantic context to merge two concurrent values correctly
- Conflict frequency is low (application complexity of handling multi-version responses is justified)
- The application already handles eventual consistency gracefully
**Limitation:** Every read code path must handle the case of multiple returned values. This is easy to get wrong — a naive implementation that picks the first value, or crashes on multi-value responses, will silently corrupt data.
**Implementation:** CouchDB uses this model. A read returns all conflicting revisions; the application picks or merges them and writes a new revision that resolves the conflict.
---
## Strategy 5: Application-Level Resolution on Write
**How it works:** A conflict handler is registered with the database. When the replication log detects a conflict during synchronization, the handler is called immediately. The handler must run quickly (it executes in the background replication process) and cannot prompt users.
**Data loss:** Depends on handler implementation.
**When to use:**
- Automated conflict resolution is possible without user input
- The conflict resolution logic is deterministic and fast
- The team can implement and test the handler thoroughly
**Limitation:** The handler must not block the replication process. It has no access to the user and cannot make external API calls. Bugs in the handler are difficult to detect because conflicts are infrequent — the handler may go untested in production for months.
**Implementation:** Bucardo (PostgreSQL), Oracle GoldenGate (custom PL/SQL handlers).
---
## Strategy Comparison Matrix
| Strategy | Data loss | Application code changes required | Conflict frequency sensitivity | Best use case |
|----------|----------|----------------------------------|-------------------------------|--------------|
| Last write wins | Yes | No | Low (conflicts infrequent) | Caches, idempotent ops |
| Conflict avoidance | No | Yes (routing) | N/A (no conflicts) | User-owned data with stable home |
| Merge / CRDTs | No | Sometimes (tombstones) | High (designed for concurrent writes) | Collections, collaborative data |
| App-level on read | No | Yes (multi-version handling) | Low-medium | Semantic conflicts only app can resolve |
| App-level on write | Depends | Yes (handler code) | Low-medium | Automated resolution possible |
## Recommendation Decision Tree
```
Can all writes for a given record always route through one leader?
YES → Conflict avoidance (simplest; no data loss; no application code)
NO ↓
Is the data a collection (set, list, counter)?
YES → CRDTs / merge (Riak, or implement a CRDT data structure)
NO ↓
Can the conflict be resolved automatically without user input?
YES AND resolution is fast → App-level on write (conflict handler)
YES AND resolution needs read context → App-level on read
NO ↓
Is data loss acceptable (cache, analytics, idempotent)?
YES → Last write wins (with UUID keys if possible)
NO → Do not use multi-leader or leaderless; use single-leader instead
```
FILE:references/failover-playbook.md
# Failover Playbook
Use this reference when planning or executing a leader failover in a single-leader or multi-leader replication setup. Failover is the process of promoting a follower to become the new leader after the current leader fails. It is the most operationally risky event in a leader-based replication system.
## The Three-Step Failover Process
### Step 1: Determine that the leader has failed
There is no foolproof way to detect leader failure. The standard approach is a timeout:
- If the leader does not respond within a configured period (commonly 30 seconds), assume it is dead.
- If it has not responded within that period, initiate failover.
**The timeout calibration problem:**
- Too short (< 10 seconds): A temporary load spike or network hiccup causes false positives. The system performs an unnecessary failover while the leader is still alive — creating split-brain risk.
- Too long (> 60 seconds): The cluster is unavailable for writes for longer than necessary after a real failure.
- Starting point: 30 seconds. Adjust upward if your system operates under sustained high load (response times naturally higher). Adjust downward only after observing false-positive rate in production.
### Step 2: Choose a new leader
**By election:** The remaining replicas vote; the replica with the most up-to-date data is elected. This requires consensus (a notoriously hard distributed systems problem — see Raft, Paxos, Zab in Chapter 9 of DDIA).
**By appointment:** A previously elected controller node (e.g., ZooKeeper) appoints the new leader. The controller is itself a replicated consensus system.
**Heuristic for the best candidate:** The replica that is most up-to-date with the old leader (fewest unreplicated writes) should become the new leader, to minimize data loss.
### Step 3: Reconfigure the system to use the new leader
- All clients must send write requests to the new leader.
- All followers must consume the replication log from the new leader, not the old one.
- If the old leader comes back online, it must recognize the new leader and demote itself to follower. Without this, it may still accept writes as leader — creating split brain.
## Automatic vs. Manual Failover
| | Automatic failover | Manual failover |
|---|---|---|
| Recovery time | ~30s (timeout + election) | Minutes to hours (human response time) |
| Risk | Split brain, premature election, stale leader confusion | Human error during high-stress incident |
| Recommended for | Systems with high availability SLA, well-tested failover path | Systems where the failover code is untested or where human judgment is critical |
**Recommendation:** Use automatic failover with a conservative timeout (30-60 seconds). Test the failover path regularly (chaos engineering — kill the leader in staging, observe behavior). Do not rely on automatic failover as a substitute for testing.
## Failover Pitfalls
### Pitfall 1: Data loss from async replication
When the old leader fails, the new leader is chosen as the most up-to-date replica. But "most up-to-date" does not mean "identical to the old leader." Any writes the old leader applied but had not yet replicated are lost.
**How much data can be lost?** Determined by the replication lag at the moment of failure. With semi-synchronous replication (one synchronous follower), the loss is bounded: the synchronous follower was always up-to-date, so it becomes the new leader with no data loss. With fully asynchronous replication, the lag can be seconds to minutes.
**Mitigation:** Use semi-synchronous replication so at least one follower is always current.
### Pitfall 2: The old leader's unreplicated writes cause conflict when it rejoins
When the old leader comes back online, it has writes that the new leader does not have. The most common solution is to discard those writes — the old leader becomes a follower and its unreplicated writes are thrown away.
**Why discarding is dangerous:** The client received a success acknowledgment for those writes. The database silently breaks the durability guarantee. If those writes were coordinated with other systems (e.g., a Redis counter was incremented based on the DB write), the external system is now inconsistent with the database.
**Example:** At GitHub in 2012, a MySQL follower that was behind the leader was promoted to leader during a failover. The lagging new leader had missed some writes, including auto-incrementing primary keys. Those primary key values had already been used by the old leader's rows (which were stored in a Redis cache). The reuse of primary keys caused data from different users to be returned to the wrong users.
### Pitfall 3: Split brain — two nodes both believe they are the leader
In certain network partition scenarios, the old leader is partitioned from the cluster but not actually dead. The cluster elects a new leader. Now two nodes believe they are the leader and both accept writes.
**Defense — fencing (STONITH):** The system shuts down one node when it detects two leaders. "Shoot The Other Node In The Head" — the node that loses the election is forcibly terminated.
**The fencing catch:** If the fencing mechanism is improperly designed, it can shut down both nodes. A split-brain situation where both nodes are shut down is worse than a split-brain where both nodes are alive (at least the system is running in the latter case, even if inconsistently). Design fencing carefully and test it explicitly.
### Pitfall 4: Premature failover during load spikes
High system load causes response times to increase. If the leader is under sustained high load, it may not respond to heartbeats within the timeout — even though it is alive and processing writes. This triggers an unnecessary failover.
**Symptoms:** Failovers that happen during high-traffic periods (not hardware failures). Post-failover analysis shows the old leader was still running and had no hardware failure.
**Mitigation:** Increase the timeout during known high-load windows. Monitor the leader's response time trend — if p99 response times are approaching the failover timeout, alert before a failover is triggered. Separate heartbeat traffic from write traffic so write backpressure does not cause heartbeat failures.
## Failover Configuration by Database
### PostgreSQL
- **pg_auto_failover:** Automated failover with a monitor node. Monitor detects primary failure, promotes standby.
- **Patroni:** ZooKeeper or etcd-backed leader election. The current leader must hold a lock; if it cannot renew the lock, it demotes itself.
- **Repmgr:** Simpler failover automation; relies on monitoring script, not consensus.
- Key config: `synchronous_standby_names` (for semi-sync); `wal_level = replica` (for streaming replication).
### MySQL
- **Orchestrator (GitHub):** Detects leader failure, elects most up-to-date replica, reconfigures replication topology automatically.
- **MHA (MySQL High Availability):** Similar to Orchestrator; widely used.
- Key config: `rpl_semi_sync_master_enabled` (semi-synchronous); `GTID_MODE = ON` (for easier failover topology reconfiguration).
### Kafka
- Kafka uses an ISR (In-Sync Replicas) list managed by ZooKeeper (older) or KRaft (newer, built-in consensus). Leader election happens automatically among ISR members.
- Key config: `acks=all` (write must be acknowledged by all ISR replicas before success); `min.insync.replicas` (minimum ISR size for writes to be accepted).
## Post-Failover Checklist
After every failover (automatic or manual), verify:
- [ ] Old leader has fully demoted itself to follower (not accepting writes)
- [ ] All clients are writing to the new leader
- [ ] New leader's replication lag to its followers is decreasing (not growing)
- [ ] Any data loss is documented (how many writes were discarded? Were they coordinated with external systems?)
- [ ] External systems that relied on the old leader's data (caches, search indexes) are invalidated or refreshed
- [ ] The root cause of the failover is identified (hardware failure, network partition, timeout misconfiguration, load spike)
- [ ] The failover timeout setting is re-evaluated based on the root cause
FILE:references/quorum-calculator.md
# Quorum Calculator
Use this reference when configuring leaderless replication (Cassandra, Riak, Voldemort, DynamoDB) to determine the correct values of n, w, and r for your availability and consistency requirements.
## The Core Formula
```
w + r > n
```
- **n** = replication factor (total number of nodes that store each key)
- **w** = write quorum (number of nodes that must acknowledge a write for it to succeed)
- **r** = read quorum (number of nodes queried on each read; freshest value returned)
When w + r > n, the write set and read set must overlap by at least one node. That overlapping node has the most recent write. This guarantees the read returns the latest value — provided no concurrent writes are happening and no sloppy quorum is in use.
## Worked Examples
### Standard balanced configuration (n=3, w=2, r=2)
```
w + r = 4 > n = 3 ✓
Fault tolerance: 1 node can be unavailable (writes need 2/3; reads need 2/3)
Staleness risk: Low — any read will hit at least 1 node with the latest write
Space cost: 3x data stored
```
Best for: General-purpose workloads where both availability and consistency matter. This is the default for most Cassandra deployments.
### Write-heavy, reads can be stale (n=3, w=1, r=2)
```
w + r = 3 = n ✗ (quorum condition not strictly satisfied)
Fault tolerance: 2 nodes can be unavailable for writes; 1 for reads
Staleness risk: High — write is on 1 node; read may hit 2 nodes that don't have it
Space cost: 3x data stored
```
Best for: High-throughput event ingestion where eventual consistency is explicitly acceptable. Do not use if read-after-write is required.
### Read-fast, write-durable (n=5, w=3, r=1)
```
w + r = 4 < n = 5 ✗ (quorum condition not satisfied)
```
Wait — this means reads can return stale values. To fix this:
```
n=5, w=3, r=3: w + r = 6 > 5 ✓
Fault tolerance: 2 nodes can fail
Read cost: 3 parallel reads required per operation
Write cost: 3 acknowledgments required
```
### High availability bias — sloppy quorum (n=3, w=1, r=1)
```
w + r = 2 < n = 3 ✗ (intentionally not a strict quorum)
```
Used when write availability is the top priority. Writes are accepted on any available node. Reads may return stale values. Enable hinted handoff to propagate writes to home nodes when partition heals.
**Do not use if:** data loss is unacceptable, read-after-write is required, or application logic requires seeing the latest value.
## Availability vs. Consistency Trade-off Table
| n | w | r | w+r>n | Node failures tolerated | Stale read risk |
|---|---|---|-------|------------------------|----------------|
| 3 | 3 | 1 | 4>3 ✓ | 0 writes, 2 reads | Very low |
| 3 | 2 | 2 | 4>3 ✓ | 1 | Low |
| 3 | 1 | 3 | 4>3 ✓ | 2 writes, 0 reads | Low |
| 3 | 2 | 1 | 3=3 ✗ | 1 writes, 2 reads | High |
| 3 | 1 | 2 | 3=3 ✗ | 2 writes, 1 reads | High |
| 5 | 3 | 3 | 6>5 ✓ | 2 | Low |
| 5 | 4 | 2 | 6>5 ✓ | 1 writes, 3 reads | Low |
| 5 | 2 | 4 | 6>5 ✓ | 3 writes, 1 reads | Low |
## Cassandra Consistency Level Mapping
In Cassandra, you configure consistency levels per-query rather than global n/w/r. Cassandra translates consistency levels to the quorum formula automatically based on the replication factor.
| Cassandra CL | Meaning | Equivalent |
|-------------|---------|-----------|
| ONE | w=1 or r=1 | Any single replica |
| QUORUM | w=(n/2)+1 or r=(n/2)+1 | Majority of replicas |
| LOCAL_QUORUM | Majority within local datacenter only | Multi-DC: quorum per-DC |
| ALL | w=n or r=n | All replicas |
| ANY | w=1, accepts hinted handoff node | Maximum write availability |
| SERIAL | Lightweight transactions (Paxos) | Linearizable per-key |
**For multi-datacenter Cassandra:** Use `LOCAL_QUORUM` for both reads and writes. This satisfies quorum within each datacenter independently and avoids cross-datacenter latency on every operation.
## Sloppy Quorum Configuration by Database
| Database | Sloppy quorum setting | Default |
|----------|----------------------|---------|
| Riak | `allow_mult = true`, sloppy quorum enabled | Enabled by default |
| Cassandra | Implicit via `hinted_handoff_enabled = true` | Enabled, but strict quorum by default |
| Voldemort | Configurable `prefer-reads`, `prefer-writes` | Disabled by default |
| DynamoDB | Not directly configurable; managed by AWS | AWS-managed |
## Replication Factor Selection Rules
- **n=1:** No fault tolerance. Only acceptable for development or non-critical caches.
- **n=3:** Standard minimum. Tolerates 1 node failure with majority quorum (w=2, r=2).
- **n=5:** Higher fault tolerance (2 node failures). Use for mission-critical data or when rolling restarts (which take nodes offline one at a time) should not reduce quorum.
- **Odd n is preferred:** Simplifies majority quorum math. n=4 does not give more fault tolerance than n=3 for majority quorums (both tolerate 1 failure before majority is lost).
FILE:references/replication-lag-monitoring.md
# Replication Lag Monitoring
Use this reference when setting up monitoring for replication lag, or when diagnosing replication lag anomalies. Replication lag is the delay between when a write is applied on the leader and when it is visible on a follower/replica. Unmonitored lag is the primary cause of read-after-write, monotonic reads, and consistent prefix reads violations in production.
## Why Monitoring Lag Is Critical
Replication lag is not constant. In normal operation it may be milliseconds — imperceptible. But during:
- High write load
- Follower recovery after a crash or restart
- Network congestion between leader and follower
- Compaction or heavy disk I/O on the follower
...lag can increase to seconds or minutes. An application that reads from followers without accounting for lag will silently return stale data to users. This is not a theoretical concern — it is a common production incident.
Monitoring must alert before lag reaches the threshold where user-visible anomalies occur.
## PostgreSQL Lag Monitoring
### View current replication state
```sql
-- On the primary: see each standby's lag
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag,
sync_state
FROM pg_stat_replication;
```
- `replay_lag`: Time between the primary writing a WAL record and the standby applying it. This is the most meaningful lag metric for data visibility.
- `sync_state`: `sync` (synchronous standby), `async` (asynchronous), `potential` (can become synchronous).
### Check lag from the standby
```sql
-- On the standby: how far behind is this replica?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
```
Returns a time interval. Alert if this exceeds your staleness threshold (e.g., 30 seconds for user-facing reads, 5 minutes for analytics).
### Check bytes of lag
```sql
-- On the primary: bytes of WAL not yet replicated
SELECT
client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```
Large lag in bytes (> 1GB) indicates the replica is significantly behind and may take a long time to catch up even after load normalizes.
### Replication slot lag (prevents WAL recycling)
```sql
-- Check replication slot lag — can prevent WAL cleanup if lag is large
SELECT
slot_name,
active,
restart_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
```
Alert if a replication slot has lag > 10GB — WAL is accumulating and can fill the disk.
### Recommended alerts
| Metric | Warning threshold | Critical threshold |
|--------|------------------|-------------------|
| `replay_lag` (time) | 30 seconds | 5 minutes |
| Lag bytes | 500 MB | 5 GB |
| Replication slot lag bytes | 5 GB | 20 GB |
| `pg_stat_replication` row missing | Immediately | N/A (standby disconnected) |
---
## MySQL Lag Monitoring
### Check lag from the replica
```sql
SHOW REPLICA STATUS\G
```
Key fields:
- `Seconds_Behind_Master`: Approximate lag in seconds. Warning: this metric can be misleading — it shows 0 when the replica SQL thread has caught up but the IO thread is lagging.
- `Replica_IO_Running`: Must be `Yes`. If `No`, the replica cannot receive new writes from the primary.
- `Replica_SQL_Running`: Must be `Yes`. If `No`, the replica is not applying received writes.
- `Last_SQL_Error`: Error message if replica SQL thread stopped. Common cause: a write that succeeded on the primary fails on the replica (schema divergence, constraint violation, duplicate key).
### GTID-based lag (more accurate)
```sql
-- On the primary
SELECT @@global.gtid_executed;
-- On the replica
SELECT @@global.gtid_executed;
-- Compare the two GTID sets to find unapplied transactions
```
GTID (Global Transaction ID) tracking gives a precise count of transactions not yet applied, rather than a time estimate.
### Recommended alerts
| Metric | Warning threshold | Critical threshold |
|--------|------------------|-------------------|
| `Seconds_Behind_Master` | 30 seconds | 5 minutes |
| `Replica_IO_Running` = No | Immediately | N/A |
| `Replica_SQL_Running` = No | Immediately | N/A |
| `Last_SQL_Error` non-empty | Immediately | N/A |
---
## Cassandra Lag Monitoring
Cassandra uses leaderless replication, so there is no single "leader lag" metric. Instead, monitor:
### Repair status
Cassandra's anti-entropy repair process ensures all replicas have all data. If repair is not running regularly, stale data can persist indefinitely on replicas that missed writes.
```bash
nodetool repair --full <keyspace> # Force a full repair
nodetool compactionstats # Check compaction backlog
nodetool tpstats # Thread pool stats including repair threads
```
Key metrics:
- `AntiEntropyStage` pending tasks: if this grows, repair is not keeping up
- `RepairStage` pending tasks: repair jobs queued but not yet executed
### Hinted handoff
When a node is down and writes are directed to it via hinted handoff, the hints are stored on other nodes. Check hint delivery:
```bash
nodetool tpstats | grep HintedHandoff
```
If hints are accumulating and not being delivered, the target node may not have come back online, or the hinted handoff timeout has expired (hints are discarded after `max_hint_window_in_ms`, default 3 hours).
### Read repair monitoring
Track `ReadRepair` metrics in Cassandra's JMX interface. Low read repair rates in a read-heavy system mean replicas are diverging without correction.
### Recommended alerts
| Metric | Warning threshold | Critical threshold |
|--------|------------------|-------------------|
| AntiEntropyStage pending tasks | > 100 | > 500 |
| HintedHandoffManager active hints | > 10,000 | Growing unbounded |
| Node status (nodetool status) | Any node DN (down) | Multiple nodes DN |
| Dropped messages (nodetool tpstats) | Any | Growing (indicates overload) |
---
## General Monitoring Principles
**Measure actual staleness, not just lag time.** Time-based lag metrics (`Seconds_Behind_Master`) measure when the replica last applied a write, not how stale a specific read is. A replica that hasn't received any writes in the last 10 seconds may show a 10-second lag but is actually fully up-to-date — no writes have happened. For user-facing staleness, track the timestamp of the most recently applied write per key, not just replication lag.
**Set a staleness budget based on your application's guarantees.** If your application provides read-after-write consistency by routing reads to the leader for 1 minute after a write, your lag alert threshold should be much lower than 1 minute — alerting at 30 seconds gives you time to route reads back to the leader before the user notices.
**Monitor the trend, not just the current value.** Lag that is at 10 seconds and growing is more dangerous than lag at 30 seconds and stable. Configure lag rate-of-change alerts in addition to threshold alerts.
**Lag monitoring for leaderless systems requires different approaches.** Without a single replication log, there is no single lag metric. Instead, monitor: (1) anti-entropy repair completion rate, (2) hinted handoff delivery, (3) read repair frequency, and (4) node availability (a node that is down will accumulate lag invisibly until it rejoins).
FILE:references/replication-log-formats.md
# Replication Log Formats
Leader-based replication (single-leader and multi-leader) requires the leader to send a stream of data changes to followers. How those changes are encoded determines upgrade flexibility, cross-database compatibility, and the risk of divergence between replicas.
## The Three Formats
### 1. Statement-Based Replication
**How it works:** The leader logs every write SQL statement (INSERT, UPDATE, DELETE) and forwards the statement text to followers. Each follower re-executes the statement.
**Problems:**
- Nondeterministic functions (`NOW()`, `RAND()`, `UUID()`) produce different values on each replica. The leader must replace these with fixed values before logging — a subtle requirement that is easy to miss.
- Statements that depend on existing data (e.g., `UPDATE table WHERE condition`) must execute in the same order on all replicas. Concurrent transactions can cause divergence if statements are re-ordered.
- Statements with side effects (triggers, stored procedures, user-defined functions) may produce different effects on each replica unless side effects are deterministic.
**Used by:** MySQL before version 5.1 (now row-based by default). VoltDB uses statement-based replication but requires all transactions to be deterministic.
**When appropriate:** Only for systems where all statements are guaranteed to be deterministic. Not recommended for general-purpose use.
---
### 2. Write-Ahead Log (WAL) Shipping
**How it works:** The storage engine appends every write to a WAL (write-ahead log) — an append-only sequence of bytes describing every change to every disk block. The leader ships this raw log to followers. Followers apply the log bytes directly to their own storage, building an identical disk structure to the leader.
**Advantages:**
- Byte-for-byte identical disk structure on all replicas. Divergence is structurally impossible.
- Used by PostgreSQL and Oracle (among others).
**Problems:**
- The WAL contains physical details: which bytes changed in which disk blocks. This is tightly coupled to the specific storage engine version and format.
- If the leader and follower run different database software versions, the WAL format may differ — making zero-downtime upgrades impossible. Upgrading requires taking followers offline or using logical replication instead.
- Cannot be used to replicate between different database products (PostgreSQL WAL cannot be consumed by MySQL).
**Used by:** PostgreSQL (`wal_level = replica` or `logical`), Oracle Data Guard.
**When appropriate:** Single-database deployments where version homogeneity is maintained and operational simplicity is valued over upgrade flexibility.
---
### 3. Logical (Row-Based) Log Replication
**How it works:** A separate log format — decoupled from the storage engine's physical format — describes writes at the granularity of database rows:
- **Inserted row:** new values of all columns
- **Deleted row:** enough information to identify the row (primary key, or all column values if no primary key)
- **Updated row:** row identifier + new values of changed columns
**Advantages:**
- Decoupled from the storage engine internals → leader and follower can run different database versions during rolling upgrades.
- Easier to parse by external systems (data warehouses, caches, search indexes via change data capture).
- Can replicate between different database products if both understand the logical format.
**Problems:**
- More data volume than statement-based replication (full row contents vs. one SQL statement).
- Conflict detection in multi-leader setups must operate at the row level (which column changed? was the whole row replaced?).
**Used by:** MySQL binlog (`binlog_format = ROW`), PostgreSQL logical replication (`wal_level = logical`), Debezium (CDC).
**When appropriate:** The recommended default for most production deployments. Required if zero-downtime upgrades are needed. Required if change data capture (CDC) feeds a data warehouse or search index.
---
### 4. Trigger-Based Replication
**How it works:** Application-level replication using database triggers or stored procedures. A trigger fires on every write, logs the change to a separate table, and an external process reads that table and replicates the change.
**Advantages:**
- Maximum flexibility: can replicate a subset of data, apply transformation logic, or replicate between different database systems.
- Does not require changes to the database software itself.
**Problems:**
- Higher overhead than built-in replication (trigger execution + log table writes on every write).
- More bugs and edge cases than built-in replication methods.
- Requires careful design to avoid trigger recursion and partial-write anomalies.
**Used by:** Databus (Oracle to Oracle/MySQL), Bucardo (PostgreSQL), Oracle GoldenGate.
**When appropriate:** When built-in replication does not support the required topology (e.g., cross-database replication, selective replication, custom transformation during replication).
---
## Comparison Table
| Format | Decoupled from storage engine? | Zero-downtime upgrades? | External consumption (CDC)? | Divergence risk |
|--------|-------------------------------|------------------------|----------------------------|----------------|
| Statement-based | Yes | Yes | Difficult (parse SQL) | High (nondeterminism) |
| WAL shipping | No | No | Difficult (binary format) | Very low |
| Logical/row-based | Yes | Yes | Easy (structured rows) | Low |
| Trigger-based | Yes | Yes | Easy (custom table) | Medium (trigger bugs) |
## Recommendation
**Default choice:** Logical (row-based) replication. It provides upgrade flexibility, external CDC compatibility, and low divergence risk without the nondeterminism problems of statement-based replication.
**Use WAL shipping only if:** You want the simplest possible setup and guarantee that the leader and all followers will always run the same database version. Operational teams that can enforce version homogeneity may prefer WAL shipping for its byte-for-byte replica guarantee.
**Avoid statement-based replication** unless you can guarantee all statements are deterministic (VoltDB model) or you are constrained to a legacy MySQL version.
Diagnose active replication failures by mapping symptoms to leader failover pitfalls, replication lag anomalies, or quorum edge cases — and produce a structu...
---
name: replication-failure-analyzer
description: |
Diagnose active replication failures by mapping symptoms to leader failover pitfalls, replication lag anomalies, or quorum edge cases — and produce a structured remediation plan. Use when: data just written disappears or shows stale on re-read (read-after-write violation); records appear and vanish on refresh (monotonic reads violation); causally related events appear in impossible order (consistent prefix reads violation); a failover produces duplicate primary keys, write rejections, or incorrect routing; two replica nodes are both accepting writes (split brain in replication topology — for split brain via distributed locking, use distributed-failure-analyzer); quorum reads return stale values despite w + r > n; or a sloppy quorum with incomplete hinted handoff is serving old data. Applies to PostgreSQL, MySQL (single-leader), Cassandra, Riak, Voldemort, DynamoDB (leaderless). Use replication-strategy-selector first if the topology has not yet been chosen. Produces: symptom → failure class → mechanism → mitigation report, leader failover checklist, replication lag anomaly guide, and quorum edge case catalog (six ways w + r > n still fails).
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/replication-failure-analyzer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- replication-strategy-selector
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [5]
tags:
- replication
- failover
- split-brain
- replication-lag
- read-after-write
- monotonic-reads
- consistent-prefix-reads
- quorum
- sloppy-quorum
- hinted-handoff
- leaderless
- single-leader
- cassandra
- riak
- postgresql
- mysql
- primary-key-conflict
- data-loss
- stale-reads
- anti-entropy
- version-vectors
- last-write-wins
- concurrent-writes
- failure-analysis
execution:
tier: 2
mode: full
inputs:
- type: description
description: "Current replication topology (single-leader, multi-leader, or leaderless), observed symptoms, and whether a failover recently occurred"
- type: codebase
description: "Application source code, docker-compose, or database configuration files that reveal quorum settings, read routing, conflict resolution strategy, and primary key generation"
- type: document
description: "Incident report, runbook, or architecture description if no codebase is available"
tools-required: [Read, Write, Grep]
tools-optional: [Bash]
mcps-required: []
environment: "Run inside a project directory with codebase or configuration files, or accept a verbal description of the failure. Produces a written replication failure analysis report."
discovery:
goal: "Produce a structured replication failure analysis report: classify each symptom into its failure class, identify the specific mechanism, and recommend concrete mitigations"
tasks:
- "Gather symptom description, replication topology, and recent operational events (failover, partition, node restart)"
- "Classify each symptom into leader failover failure, replication lag anomaly, or quorum edge case"
- "Identify the specific mechanism within the class"
- "Scan configuration and codebase for anti-patterns (timeout values, read routing, quorum parameters, primary key generation)"
- "Recommend mitigations matched to root cause"
- "Produce the failure analysis report"
audience:
roles: ["backend-engineer", "software-architect", "site-reliability-engineer", "data-engineer", "tech-lead"]
experience: "intermediate-to-advanced — assumes basic replication familiarity (what a leader is, what a follower is, what a quorum is)"
triggers:
- "User's submitted data disappeared or shows stale immediately after a write"
- "Data appears then vanishes on refresh — time appears to move backward"
- "Causally related records appear out of causal order (answer before question)"
- "Failover completed but writes are rejected, misrouted, or creating duplicate key conflicts"
- "Two nodes are both accepting writes simultaneously"
- "Quorum reads returning stale values despite w + r > n"
- "Hinted handoff in progress and fresh reads returning old data"
- "Infrequently-read keys diverging silently across replicas"
- "Proactive review of replication config before shipping to production"
not_for:
- "Choosing a replication topology — use replication-strategy-selector"
- "Diagnosing network faults, zombie leaders, or clock unreliability — use distributed-failure-analyzer"
- "Selecting consistency and isolation guarantees at the transaction level — use consistency-model-selector or transaction-isolation-selector"
---
## When to Use
You have a replication failure in progress or an anomaly you cannot explain. The symptom is one of: data that was written cannot be read back; data that was visible has disappeared; records arrive in causally impossible order; two nodes are writing simultaneously; a quorum read is stale despite the math being correct.
This skill imposes a diagnostic framework: every replication failure traces to one of three classes — **leader failover pitfalls**, **replication lag anomalies**, or **quorum edge cases**. Each class has a bounded set of mechanisms with known mitigations. The skill maps your symptoms to a class, narrows to the mechanism, and produces a remediation plan.
This is the companion to `replication-strategy-selector`. That skill helps you choose. This skill helps you diagnose. Use this one when something is already wrong, or when you want to audit a configuration for latent failure before it manifests.
Cross-references:
- `replication-strategy-selector` — for choosing topology, sync mode, quorum values, and conflict resolution strategy from scratch
- `distributed-failure-analyzer` — for failures whose root cause is a network fault, clock unreliability, or process pause (zombie leaders, LWW data loss via clock skew, cascading timeouts)
- `consistency-model-selector` — for selecting the right consistency and isolation guarantees to prevent a class of anomalies at the application layer
---
## Context and Input Gathering
Before analysis, collect the following. Ask the user for any that are missing.
**Required:**
1. **Symptom description** — what was observed vs. what was expected. Be specific: "a user submitted a comment and immediately reloaded the page but did not see their comment" is better than "stale reads."
2. **Replication topology** — single-leader, multi-leader, or leaderless (Dynamo-style). If leaderless: number of replicas (n), write quorum (w), read quorum (r). If single-leader: synchronous, semi-synchronous, or asynchronous?
3. **Recent operational events** — did a failover happen recently? Was a node restarted? Was there a network partition? Was a follower promoted? Any configuration changes in the past 48 hours?
**Useful:**
4. **Read routing strategy** — do reads go to the leader only, to any replica, or to a specific replica? Is there a load balancer routing reads randomly?
5. **Primary key / ID generation strategy** — autoincrement, UUID, application-generated? Is any external system (cache, secondary store) keyed on these IDs?
6. **Quorum configuration** — for leaderless systems: exact w, r, n values; whether sloppy quorums are enabled; whether an anti-entropy background process is configured.
7. **Replication log metrics** — replication lag in seconds or write offset delta, if available.
**If no codebase or configuration is available:** accept a verbal description and produce an analysis. The report will note which findings are confirmed vs. inferred.
---
## Process
### Step 1 — Classify the failure
WHY: The three failure classes have different root causes and completely different mitigations. Treating a replication lag anomaly as a failover problem (or vice versa) leads to wasted effort and leaves the actual failure in place. Classification first prevents this.
**Class A: Leader failover pitfalls**
Applies to single-leader replication. A failover is the process of promoting a follower to be the new leader when the current leader fails. Automatic failover typically follows three steps: (1) detect the leader has failed via timeout, (2) elect a new leader (usually the most up-to-date replica, chosen by election or by a previously elected controller node), (3) reconfigure clients to route writes to the new leader and ensure the old leader becomes a follower if it recovers.
Failover is "fraught with things that can go wrong." The four documented failure modes are:
| Failure mode | Mechanism | Signal |
|---|---|---|
| **Async data loss** | New leader was an async follower — had not received all writes from old leader before failure. Old leader's unreplicated writes are discarded. | Writes confirmed to client are missing after failover |
| **Primary key conflict** | New leader's autoincrement counter lagged behind old leader's. New leader reissues keys already assigned by the old leader. Any system keyed on these IDs (Redis cache, secondary DB, audit log) develops cross-system inconsistency. | Duplicate key errors or wrong data returned for existing IDs in external systems |
| **Split brain** | Old leader recovers and does not recognize the new leader. Both nodes accept writes simultaneously. Without a process to resolve conflicts, data is lost or corrupted. Some systems "shut down one node if two leaders are detected" — but if this mechanism is misconfigured, both nodes may shut down. | Two nodes both reporting as leader; writes going to both; diverging replica state |
| **Timeout miscalibration** | Timeout too short: unnecessary failovers under load spike or network glitch, making the situation worse. Timeout too long: prolonged unavailability during genuine failures. A temporary load spike can cause response time to exceed the timeout, triggering a failover that increases load further. | Repeated failovers during traffic spikes; or prolonged unavailability before failover triggers |
**Class B: Replication lag anomalies**
Applies to single-leader asynchronous replication with read-scaling (reads routed to followers). The replication lag — the delay between a write being applied on the leader and being reflected on a follower — may be milliseconds under normal conditions, but can grow to seconds or minutes under load or network issues. Three named anomaly patterns arise:
| Anomaly | Description | Mechanism | Named guarantee required |
|---|---|---|---|
| **Read-after-write violation** | User submits data; immediately reads it back; does not see it. From the user's perspective, the submission was lost. | Read was routed to a follower that had not yet received the write. | Read-after-write consistency (also called read-your-writes) |
| **Monotonic reads violation** | User reads data (e.g., a comment); reloads the page; the data is gone. Time appears to move backward. | Sequential reads were routed to different replicas with different lag. The second read went to a more-lagged replica that had not yet received the write the first read saw. | Monotonic reads |
| **Consistent prefix reads violation** | User sees causally related records in an impossible order — an answer appearing before the question it answers. | In a partitioned database where partitions operate independently, partition A (carrying the reply) had low lag and partition B (carrying the question) had high lag. The observer read partition A first. | Consistent prefix reads |
**Class C: Quorum edge cases**
Applies to leaderless (Dynamo-style) replication: Cassandra, Riak, Voldemort, DynamoDB. The quorum condition w + r > n is designed to ensure that at least one node in every read set has seen every acknowledged write. However, six scenarios break this guarantee in practice even when the condition is mathematically satisfied:
| Edge case | Mechanism |
|---|---|
| **Sloppy quorum active** | A network interruption isolated the client from the n "home" nodes for a value. Writes were accepted by w nodes outside the home set (sloppy quorum). Even though w + r > n, the r read nodes are the home nodes — they have not seen the writes yet. Hinted handoff has not completed. |
| **Concurrent writes, no clear ordering** | Two writes to the same key occurred simultaneously. The quorum condition does not determine which write happened first. If last-write-wins is the conflict resolution strategy, the write with the lower timestamp (possibly the causally later write, if clocks are skewed) is silently discarded. |
| **Write concurrent with read** | A write was in-flight when a read was issued. The write was reflected on some of the r replicas but not others. The read may return the old value, the new value, or — in the worst case — the read returns the old value and the write is subsequently applied, but a future read may still return the old value from a different replica subset. |
| **Partial write success, no rollback** | A write succeeded on some replicas but failed on others (e.g., disk full) and was reported as failed overall (fewer than w acknowledgements). The replicas that did succeed are not rolled back. Subsequent reads may or may not see the partially-written value. |
| **Node restored from stale replica** | A node carrying a new value fails and its data is restored from a replica carrying an old value. The number of replicas storing the new value falls below w, breaking the quorum condition retrospectively. |
| **Timing edge cases at linearizability boundary** | Even with w + r > n fully satisfied, quorum reads are not linearizable — there are race conditions where unlucky timing can produce stale reads. Quorums provide eventual consistency, not linearizability. |
---
### Step 2 — Identify the specific mechanism
WHY: The class narrows the diagnostic space; the specific mechanism determines which mitigation is effective. "Replication lag anomaly" does not tell you whether you need sticky routing, a timestamp-based threshold, or causal consistency at the partition level — only identifying the exact anomaly pattern does.
**For Class A (leader failover):**
Confirm which of the four failure modes is active by asking:
- Are writes that were confirmed missing after failover? → Async data loss. Check whether the promoted follower was in synchronous or asynchronous mode with the old leader.
- Are duplicate key errors or cross-system inconsistencies appearing after failover? → Primary key conflict. Check whether the new leader's autoincrement counter was reset or is behind the old leader's. Check for any external systems keyed on the same IDs.
- Are two nodes both accepting writes? → Split brain. Check each node's leader status. Check whether fencing / STONITH (Shoot The Other Node In The Head) is configured and whether it fired correctly.
- Are failovers happening repeatedly or under load? → Timeout miscalibration. Check the configured timeout value against observed p99 latency under load.
**For Class B (replication lag anomalies):**
Confirm which anomaly pattern is active by asking:
- Does the stale read affect only the user who just wrote, immediately after a write? → Read-after-write violation.
- Does a previously visible record disappear on refresh? → Monotonic reads violation.
- Does a reply appear before its question, or an effect appear before its cause? → Consistent prefix reads violation.
For each anomaly, identify whether the read routing layer can be changed (application-level) or whether the database must be configured to provide the guarantee (database-level).
**For Class C (quorum edge cases):**
Confirm which edge case applies:
- Was there a network partition recently? Is hinted handoff in progress? → Sloppy quorum edge case.
- Were concurrent writes made to the same key? Is LWW the conflict resolution strategy? → LWW + concurrent write data loss. Check clock skew between nodes.
- Is the write partially applied (error reported on write, but no rollback)? → Partial write success.
- Was a node recently restored from backup or from a replica? → Stale node restoration.
- Is the failure intermittent under high concurrency? → Timing edge case at linearizability boundary.
---
### Step 3 — Apply the mitigation
WHY: Each mechanism has a specific mitigation. Applying the wrong mitigation (e.g., increasing quorum for a read-after-write problem in a single-leader system) wastes effort and may introduce new problems. This step matches mechanism to fix precisely.
#### Class A mitigations: leader failover
**Async data loss:**
- Short-term: accept that unreplicated writes from the old leader are lost. Inform users; roll back any downstream effects (cache entries, external system records).
- Long-term: configure at least one synchronous follower (semi-synchronous replication). PostgreSQL's `synchronous_standby_names`, MySQL's semi-sync replication. Accept the latency cost on writes. Alternatively: use a consensus-based replication protocol (Group Replication, Galera) where a write is not confirmed until a majority has applied it.
**Primary key conflict:**
- Immediate: reset the new leader's autoincrement counter to a value above the old leader's highest known key. Query the old leader's `AUTO_INCREMENT` status before decommissioning it; if the old leader is unavailable, query all external systems for the highest key they have seen.
- Invalidate or reconcile any external system (Redis, cache, secondary DB) keyed on the affected ID range.
- Long-term: switch to UUIDs or application-generated IDs that are globally unique and do not depend on a per-node counter. This eliminates the class of failure entirely.
**Split brain:**
- Immediate: manually fence the old leader — force it offline, revoke its write permissions at the network or storage layer, or trigger STONITH if configured.
- Reconcile diverged writes: identify writes accepted by both nodes during the split brain window. Resolve conflicts using the conflict resolution strategy appropriate to the data (last-write-wins if losing some writes is acceptable; application-level merge otherwise).
- Long-term: configure a fencing mechanism. Ensure the fencing path is tested periodically. Consider whether a consensus protocol (Raft, Paxos) for leader election would prevent the situation — these protocols guarantee that only one leader is active at a time.
**Timeout miscalibration:**
- Measure round-trip time distribution empirically under peak load. Set the failure-detection timeout at p99 or p99.9 latency plus a safety margin.
- If repeated unnecessary failovers are occurring: increase the timeout. If the timeout is too long and genuine failures are causing excessive unavailability: decrease it — but only after measuring the latency distribution.
- Consider a Phi Accrual failure detector (used in Cassandra, Akka) that adapts its failure threshold based on observed heartbeat jitter, rather than a fixed timeout.
- Some operations teams prefer manual failover even when automatic failover is available, specifically to avoid miscalibrated-timeout failures during load spikes.
#### Class B mitigations: replication lag anomalies
**Read-after-write violation:**
Multiple techniques can implement read-after-write consistency:
1. **Route the user's own data reads to the leader.** For data that only the user themselves can modify (e.g., a user profile), always read from the leader. Read other users' data from followers. This requires knowing, at read time, which data belongs to the current user.
2. **Time-window leader reads.** For one minute (or some configurable interval) after the user's last write, route all that user's reads to the leader. This covers the case where most things in the application are potentially editable by the user.
3. **Client-side write timestamp, replica lag threshold.** The client records the timestamp of its last write (or, better, the replication log position / log sequence number returned by the write). When reading, route to any replica that has advanced past that position. If no such replica is available, either wait or route to the leader.
4. **Cross-device consistency (additional complexity).** If the user accesses your service from multiple devices, the write timestamp must be stored centrally (not just in local storage on the device that made the write). All reads from all of that user's devices must be routed according to that centralized timestamp.
**Monotonic reads violation:**
Ensure that a given user's sequential reads always go to the same replica. The replica can be chosen based on a hash of the user ID rather than randomly. This ensures the user's observed state only moves forward in time.
Caveat: if the assigned replica fails, the user's reads must be rerouted to another replica. At that moment, monotonic reads may be violated for the duration until the new replica catches up. This is generally acceptable — the guarantee is "best effort" in the face of replica failure.
**Consistent prefix reads violation:**
Ensure that writes with causal dependencies are written to the same partition. This prevents the ordering inversion — if the question and answer go to the same partition, the follower will always apply them in the correct order.
If causally related writes cannot always be co-located on the same partition (because the data model makes this impractical): use a database or middleware layer that tracks causal dependencies explicitly (causal consistency via version vectors) and ensures that a read does not return a causally later write without also returning its causal prerequisites.
#### Class C mitigations: quorum edge cases
**Sloppy quorum / hinted handoff in progress:**
- Wait for hinted handoff to complete before serving reads that require up-to-date data. Monitor the hinted handoff queue to determine when it is empty.
- If waiting is not acceptable: disable sloppy quorums (`durable_writes = true` in Cassandra, `allow_offline_hnodes = false` in Riak) to get strict quorum behavior at the cost of lower availability during network partitions.
- Do not treat sloppy quorums as equivalent to strict quorums. They are a durability guarantee only — "w nodes somewhere accepted the write" — not a freshness guarantee.
**LWW + concurrent writes:**
- If data loss is not acceptable: do not use last-write-wins as the conflict resolution strategy. Switch to version vectors (Riak supports these natively as "dotted version vectors") which allow conflicting writes to be detected and merged rather than silently discarded.
- If LWW must be retained: use a UUID as the key so each write operation gets a unique key, making concurrent writes to the same key impossible. This is the recommended Cassandra pattern for LWW safety.
- Monitor clock skew between nodes. Alert when skew exceeds the acceptable data-loss threshold.
**Partial write success:**
- Accept that a reported-failed write may have partially applied. Applications that require strict atomicity across all replicas must use a database with transaction support, not an eventually consistent leaderless store.
- Design operations to be idempotent where possible so they can be safely retried or re-applied during read repair.
**Stale node restoration:**
- Before restoring a node from backup or from another replica, assess how many replicas currently hold the latest value for affected keys. If restoring would drop the count below w, delay the restoration until a full-cluster repair can be run first.
- After restoring a node, run a full repair (Cassandra: `nodetool repair`; Riak: `riak-admin repair`) before routing reads to it.
**Linearizability requirement:**
- If the application genuinely needs linearizable reads (not just eventual consistency), quorums alone are insufficient. Options: (1) route all reads and writes through the single-leader (leaderless systems can designate a read-repair coordinator), (2) use a consensus protocol (Raft, Paxos) which provides linearizability, or (3) upgrade to a database with linearizable transaction support (refer to `transaction-isolation-selector`).
---
### Step 4 — Scan configuration and codebase for anti-patterns
WHY: Latent replication failures exist in configuration and code before they manifest in production. Proactive scanning finds them at low cost. These are the specific patterns to search for.
**Anti-pattern 1: Reads always routed to any random follower**
```
# Look for: round-robin read balancing, random replica selection
# or: load balancer distributing reads across all replicas without session affinity
```
Risk: Read-after-write and monotonic reads violations under any replication lag. Any write may be invisible to a read that lands on a different, more-lagged follower.
Fix: Implement user-session sticky reads or timestamp-gated replica selection.
**Anti-pattern 2: Autoincrement primary keys with asynchronous replication and external systems**
```
# Look for: AUTO_INCREMENT columns, SERIAL columns, sequences
# combined with: external system (Redis, Elasticsearch, audit log) using the same IDs
# combined with: asynchronous replication with manual or automatic failover
```
Risk: After failover, the new leader reissues IDs that were already assigned by the old leader but not yet replicated. The external system retains entries for the old IDs; the new leader's records point to different data.
Fix: Use UUIDs or application-generated globally unique IDs. Or, ensure the autoincrement sequence is advanced past the old leader's maximum before the new leader begins accepting writes.
**Anti-pattern 3: No fencing / STONITH configured for leader failover**
```
# Look for: automatic failover configuration without a fencing mechanism
# e.g., Patroni without fencing, MHA without power fencing, manual failover runbooks
```
Risk: The old leader recovers and does not know it has been demoted. Both nodes accept writes. Split brain.
Fix: Configure a fencing mechanism. Test it periodically. See `distributed-failure-analyzer` for fencing token implementation details.
**Anti-pattern 4: Sloppy quorums enabled in a system requiring read freshness**
```
# Cassandra: read_repair_chance and dclocal_read_repair_chance < 1.0
# with no anti-entropy (nodetool repair) schedule
# Riak: allow_mult = false (no sibling handling) with sloppy quorums
# Voldemort: default config enables sloppy quorums
```
Risk: During and after network partitions, reads return stale data despite w + r > n being satisfied, because the w writes went to non-home nodes.
Fix: Either disable sloppy quorums (strict quorum mode) or implement application-layer awareness of hinted handoff status.
**Anti-pattern 5: No anti-entropy process, relying solely on read repair**
```
# Cassandra: nodetool repair not scheduled
# Voldemort: no anti-entropy configured
# Custom leaderless system: no background reconciliation
```
Risk: Values that are rarely read will diverge permanently across replicas. Read repair only runs when a value is actually read. Infrequently-read keys can remain stale indefinitely, violating durability guarantees.
Fix: Schedule regular anti-entropy runs. For Cassandra: `nodetool repair` on a weekly schedule (or more frequently for high-write workloads). Ensure the interval is shorter than the gc_grace_seconds (tombstone expiry period) to prevent deleted data from "coming back."
---
### Step 5 — Produce the failure analysis report
Output a structured report with:
1. **Executive summary** — what failed, which failure class, operational impact.
2. **Symptom-to-mechanism mapping** — for each symptom: class, specific mechanism, confidence level (confirmed / inferred / possible).
3. **Immediate remediation** — ordered list of steps to stop the active failure.
4. **Anti-patterns found** — code or configuration locations (if available), description, risk.
5. **Long-term fixes** — architectural changes to prevent recurrence.
6. **Open questions** — what additional information (metrics, config values, code) would increase diagnostic confidence.
---
## Common Misdiagnoses
**"The read is stale — this must be a replication bug."**
Stale reads in asynchronous replication are expected behavior, not a bug. The replication lag is working as designed. The issue is that the application assumed synchronous replication behavior but is running in asynchronous mode. The fix is application-level read routing, not replication reconfiguration.
**"We set w + r > n so our reads must be consistent."**
The quorum condition ensures overlap between write and read node sets under normal conditions. It does not guarantee freshness when: a sloppy quorum was used (writes went to non-home nodes), concurrent writes occurred with LWW resolution and clock skew, or a write partially succeeded. Quorums provide eventual consistency by default, not linearizability.
**"The failover succeeded — why are there duplicate key errors?"**
The promoted follower's autoincrement counter reflects the writes it received before failover. If the old leader had advanced its counter further (on writes not yet replicated), the new leader's counter is behind. When the new leader issues new IDs, it reuses IDs the old leader already assigned. This is especially dangerous when an external system (Redis, Elasticsearch) is keyed on these IDs — the external system retains entries for the old IDs, and the new leader's records now point to different data in the external system.
**"We have two leaders — one of them must be wrong."**
Both leaders may believe they are legitimate. The old leader did not receive the demotion message (it may have been partitioned when the new leader was elected, or the fencing mechanism failed to fire). The solution is not to query which one is "right" but to forcibly fence the old leader and then reconcile the writes it accepted during the split brain window.
**"Monotonic reads just means we need stronger consistency."**
Monotonic reads is a weaker guarantee than strong consistency. It only requires that a single user's reads do not observe an older state after having observed a newer state. It does not require that all users see the same state at the same time. Implementing it with sticky replica routing is significantly cheaper than requiring strong consistency across the cluster.
**"The quorum write failed, so the data wasn't written."**
A failed quorum write means fewer than w nodes acknowledged. But the nodes that did acknowledge are not rolled back. The write may be partially applied across some replicas. Subsequent reads may or may not return the partially-written value, depending on which replicas the read contacts. Applications that retry a failed write without making it idempotent can create inconsistencies.
---
## Examples
### Example 1: GitHub-style primary key conflict after MySQL failover
**Scenario:** A team runs MySQL with a single-leader replication topology. During maintenance, a follower is promoted to leader. Shortly after, users start seeing other users' private data — profile photos and messages belonging to a different account.
**Trigger:** Security incident report. Immediate investigation required.
**Process:**
1. Classify: failover occurred recently + cross-user data leakage → Class A, primary key conflict mechanism.
2. Mechanism: The old leader had issued autoincrement IDs for rows not yet replicated to the promoted follower. The new leader's `AUTO_INCREMENT` counter started below the old leader's maximum. The new leader reissued IDs already assigned by the old leader. A Redis cache was storing user profile data keyed on MySQL row IDs. Redis entries for the old IDs now returned different users' profile data because the new leader's rows with those IDs belong to different users.
3. Immediate remediation: (a) Take the service offline. (b) Query Redis for the highest user ID present. (c) Advance the new leader's `AUTO_INCREMENT` past that value. (d) Invalidate all Redis entries in the affected ID range. (e) Audit which users were served incorrect data and notify them.
4. Long-term fix: (a) Switch to UUID primary keys. (b) Decouple external system keys from database autoincrement IDs. (c) Add a post-failover checklist that explicitly includes advancing the autoincrement counter before opening traffic.
**Output:** Failure analysis report identifying primary key conflict as root cause. Immediate remediation steps. UUID migration plan. Updated failover runbook with autoincrement counter validation step.
---
### Example 2: Read-after-write and monotonic reads violations in a social feed
**Scenario:** A social application uses a single-leader MySQL setup with five followers. Reads are distributed across all followers via a round-robin load balancer. Users regularly report that comments they just posted do not appear when they reload the page. Occasionally, a comment that was visible disappears and then reappears.
**Trigger:** Support ticket volume on "my posts disappear" exceeds threshold. Product team requests investigation.
**Process:**
1. Classify: user's own writes not visible on reload → Class B, read-after-write violation. Comments appearing and disappearing → Class B, monotonic reads violation.
2. Mechanism for read-after-write: Round-robin routing sends the reload request to a different follower than the one that received the previous read (or to any follower, including heavily lagged ones). The follower has not yet received the write.
3. Mechanism for monotonic reads: The first read lands on a follower with low lag (comment visible). The reload lands on a follower with higher lag (comment not yet present). The user observes time moving backward.
4. Mitigation: (a) For the user's own content: route reads to the leader for one minute after any write from that user. Implement by storing `last_write_at` in the user session. After one minute, route to followers. (b) For monotonic reads: hash user ID to always route to the same follower. If that follower fails, reroute — accepting a brief monotonic reads violation during failover. (c) Scan the load balancer configuration to confirm round-robin routing is what is actually configured (vs. session-sticky or latency-aware routing).
**Output:** Failure analysis report. Read routing change specification. Session-layer implementation plan for `last_write_at` tracking. Load balancer reconfiguration recommendation.
---
### Example 3: Stale quorum reads after network partition in Cassandra
**Scenario:** A team runs Cassandra with n=3, w=2, r=2 (satisfying w + r > n). After a 20-minute network partition between two datacenters, quorum reads are returning values that are several minutes old. The partition healed 10 minutes ago.
**Trigger:** Monitoring alert showing read staleness exceeding acceptable threshold after a network event.
**Process:**
1. Classify: quorum reads stale after partition healing, despite w + r > n → Class C, sloppy quorum edge case.
2. Mechanism: During the partition, the datacenter-1 nodes were isolated from the three "home" nodes for affected keys. Cassandra's sloppy quorums (enabled by default in many configurations) allowed writes to be accepted by datacenter-1 nodes that are not in the home node set. After partition healing, hinted handoff is replaying those writes to the home nodes — but the process is not yet complete. Reads contacting the home nodes return the pre-partition values. w + r > n is satisfied in terms of node counts, but the r read nodes are not the same nodes that received the w sloppy-quorum writes.
3. Check: confirm that `nodetool tpstats` shows a non-empty hints queue. Monitor the queue draining rate.
4. Mitigation: (a) Wait for hinted handoff to complete (monitor `HintsService` metrics). (b) For values that must be current: force a read repair by issuing a `CONSISTENCY QUORUM` read or running `nodetool repair` on the affected keyspace. (c) Long-term: evaluate whether sloppy quorums provide acceptable trade-offs for this keyspace. For data requiring freshness guarantees, configure `LOCAL_QUORUM` with `durable_writes = true` and disable sloppy quorums. For data tolerating eventual consistency, sloppy quorums increase availability and are appropriate.
**Output:** Failure analysis report. Hinted handoff monitoring procedure. Decision framework for which keyspaces should use strict vs. sloppy quorums. Updated runbook for post-partition recovery validation.
---
## References
- `references/failover-checklist.md` — step-by-step leader failover checklist: pre-failover verification, the four failure modes and their per-mode checks, post-failover validation steps, and rollback procedure
- `references/lag-anomaly-patterns.md` — complete replication lag anomaly reference: read-after-write, monotonic reads, and consistent prefix reads — each with formal definition, concrete example, implementation techniques, and cross-device complexity considerations
- `references/quorum-edge-cases.md` — the six quorum edge cases in detail: conditions that trigger each, detection signals, mitigation options, and the distinction between sloppy quorums (durability guarantee) and strict quorums (freshness guarantee)
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-replication-strategy-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/failover-checklist.md
# Leader Failover Checklist
Use this checklist when performing or recovering from a leader failover in a single-leader replication setup (PostgreSQL, MySQL, or any leader-follower topology).
---
## Pre-Failover Verification (Planned Failover)
Before initiating a planned failover (e.g., maintenance, upgrade):
1. **Identify the most up-to-date follower.**
- Check replication lag for each follower. Choose the one with the smallest lag.
- Ideal: zero lag (follower is fully caught up). If no follower is at zero lag, accept the one with the lowest lag and document the expected data loss window.
- PostgreSQL: `SELECT * FROM pg_stat_replication;` (on leader). Check `write_lag`, `flush_lag`, `replay_lag`.
- MySQL: `SHOW SLAVE STATUS\G` on each follower. Check `Seconds_Behind_Master`.
2. **Check the autoincrement / sequence counter on the old leader.**
- MySQL: `SHOW TABLE STATUS LIKE 'tablename'` → `Auto_increment` column.
- PostgreSQL: `SELECT last_value FROM tablename_id_seq;`
- Record this value. It will be needed to advance the new leader's counter after promotion.
3. **Check for any external systems keyed on primary IDs.**
- Identify Redis caches, Elasticsearch indices, audit logs, or secondary databases that store records keyed on the same IDs.
- List them. You will need to reconcile or invalidate these after failover.
4. **Verify fencing mechanism is operational.**
- Confirm that STONITH (Shoot The Other Node In The Head), power fencing, or network-level fencing is configured and can fire.
- Test: if this is a staging environment, trigger the fencing mechanism manually to confirm it works before you need it in production.
---
## During Failover
5. **Stop writes to the old leader** (planned failover only).
- Gracefully drain connections.
- Wait for replication lag to reach zero on the chosen follower.
6. **Promote the follower.**
- PostgreSQL: `pg_promote()` or `touch /tmp/postgresql.trigger`.
- MySQL: `STOP SLAVE; RESET SLAVE ALL;` on the new leader.
- Patroni, MHA, Orchestrator: use their promotion commands rather than manual promotion.
7. **Reconfigure clients.**
- Update the write endpoint (connection string, load balancer, service discovery record) to point to the new leader.
- Existing connections to the old leader will fail. Applications must handle reconnection.
8. **Ensure the old leader becomes a follower** (if it recovers).
- Configure the old leader to replicate from the new leader when it comes back online.
- Do not allow the old leader to resume accepting writes.
---
## Post-Failover Validation
9. **Advance the new leader's autoincrement counter.**
- MySQL: `ALTER TABLE tablename AUTO_INCREMENT = <value above old leader's max>;`
- PostgreSQL: `SELECT setval('tablename_id_seq', <value above old leader's max>);`
- WHY: If the old leader had issued IDs that were not replicated, the new leader's counter may be behind the old leader's maximum. The new leader will reissue those IDs, causing conflicts in any external system that has already indexed them.
10. **Reconcile or invalidate external systems.**
- For each external system identified in step 3: invalidate any cached entries in the ID range between (new leader's starting counter) and (old leader's last known max).
- If the old leader is available: query it for the exact set of IDs it issued that were not replicated. Compare with the new leader.
11. **Verify write traffic is routing to the new leader only.**
- Check application logs. Look for write attempts to the old leader's address.
- Check the old leader's write count metric — it should be zero.
12. **Confirm no split brain.**
- Verify that the old leader (if it recovered) is in follower mode.
- Check its replication status — it should be replicating from the new leader, not accepting independent writes.
- If the old leader is accepting writes: immediately fence it (trigger STONITH or revoke its network write access). Then reconcile diverged writes.
---
## The Four Failure Modes: Quick Reference
| Mode | Signal | Immediate action |
|---|---|---|
| Async data loss | Writes confirmed before failover are missing on new leader | Accept loss; roll back downstream effects; advance autoincrement; notify affected users |
| Primary key conflict | Duplicate key errors in application; cross-user data leakage in external systems | Take service offline; advance autoincrement counter past old leader's max; invalidate external system ID range |
| Split brain | Two nodes both reporting as leader; write divergence | Fence old leader immediately; reconcile diverged writes; audit data integrity |
| Timeout miscalibration | Repeated unnecessary failovers under load; or excessive unavailability before failover fires | Measure p99 latency under load; recalibrate timeout; add load-sensitive trigger guard |
---
## Rollback Procedure
If the failover cannot be completed cleanly (e.g., the promoted follower has corruption or the new leader cannot accept connections):
1. Fence the new leader (prevent it from accepting writes).
2. Restore the old leader to primary status.
3. Re-establish replication from the old leader to all followers.
4. Investigate the root cause before attempting another failover.
Do not attempt a second failover while the first failover's state is unresolved. You can accumulate two concurrent leaders if the fencing from the first failover was not confirmed.
FILE:references/lag-anomaly-patterns.md
# Replication Lag Anomaly Patterns
Three named anomaly patterns arise when single-leader asynchronous replication is combined with read-scaling (reads routed to followers). Each has a formal definition, a concrete example, implementation techniques for the mitigation, and additional complexity notes for multi-device scenarios.
---
## Pattern 1: Read-After-Write Violation
**Also known as:** read-your-writes consistency violation.
**Formal definition:** A user makes a write, then issues a read. The read does not reflect the write. The user observes their own submission as "lost."
**Why it happens:** The write goes to the leader. The read is routed to a follower. If the follower has not yet received and applied the write (replication lag is nonzero), the follower returns the pre-write value. The user sees no data.
**Concrete example:** A user edits their profile bio and clicks Save. They reload their profile page. Their new bio is not shown — the old bio is still displayed. From the user's perspective, the edit failed silently.
**Distinguishing characteristic:** The anomaly affects only the user who made the write, and only their own data. Other users' reads are unaffected (they never had the new value, so they are not surprised by its absence).
**Mitigation techniques:**
*Technique 1: Selective leader reads for own data.*
Route reads for data that the user themselves can modify directly to the leader. Route reads for other users' data to followers. This requires knowing at read time which data belongs to the requesting user.
- Example: user profile pages always read the current user's profile from the leader; other users' profiles are read from followers.
- Drawback: if the application is highly personalized and most data is potentially modifiable by the user, most reads must go to the leader, negating the read-scaling benefit.
*Technique 2: Time-window leader reads.*
For a fixed interval after the user's last write (e.g., one minute), route all reads from that user to the leader. After the interval, resume reading from followers.
- Requires storing `last_write_at` per user in session state.
- The interval should be longer than the expected maximum replication lag. If lag occasionally exceeds the interval, violations can still occur — this is a heuristic, not a guarantee.
*Technique 3: Replication position tracking.*
The write returns the replication log position (PostgreSQL: LSN; MySQL: binlog coordinates) at which the write was applied on the leader. The client stores this position. On subsequent reads, route to any follower whose replication position is at or beyond the stored position. If no such follower is available: either wait for one to catch up, or route to the leader.
- This provides a hard guarantee rather than a heuristic.
- Requires the load balancer or client driver to query each follower's current replication position before routing.
- Supported by some drivers: Vitess supports this via `@primary` and position-aware routing. PgBouncer can be configured for this with custom routing logic.
**Cross-device complexity:** If the user submits data on one device (desktop) and reads it on another (mobile), technique 1 and 2 may fail because the `last_write_at` is stored in the desktop session, which the mobile device does not have access to. Centralizing the write timestamp in the user record (database or server-side session store) is required for correct cross-device read-after-write consistency.
---
## Pattern 2: Monotonic Reads Violation
**Formal definition:** A user makes multiple sequential reads. The second read returns a state that is older than the state returned by the first read. Time appears to move backward.
**Why it happens:** Sequential reads from the same user are routed to different replicas (e.g., random or round-robin routing). Replica A has low lag and returns the new value. Replica B has higher lag and has not yet received the write. The second read goes to Replica B and returns the old value.
**Concrete example:** A user refreshes a social media feed. The first refresh shows a comment posted by another user. The second refresh does not show the comment — the second request was routed to a more-lagged replica. The comment appears to have been deleted (but it was never deleted).
**Distinguishing characteristic:** The anomaly manifests as disappearing data on successive reads by the same user. It does not require the user to have made any writes. It is caused entirely by read routing hitting replicas with different lag.
**Mitigation technique:**
*Sticky replica routing:* Route all reads from a given user to the same replica. Choose the replica based on a hash of the user ID (or session ID) rather than randomly. This ensures that as long as the user's reads land on the same replica, the replica's state only moves forward — even if that replica is behind the leader, its local state is monotonically advancing.
```
replica_index = hash(user_id) % number_of_replicas
route_to = replicas[replica_index]
```
**Important caveat:** If the assigned replica fails, the user's reads must be rerouted. At the moment of rerouting, the new replica may be in a different lag state (possibly behind the failed replica's last-seen state). A brief monotonic reads violation is acceptable at this point — the failure case is typically handled by logging in the user out and back in (resetting their read state), or simply accepting the anomaly as a known rare occurrence.
**Interaction with read-after-write:** If sticky routing assigns the user to a follower that has not yet received their own write, a read-after-write violation can occur even with sticky routing in place. Sticky routing solves monotonic reads but does not solve read-after-write. Both mitigations may be needed simultaneously. The read-after-write mitigation (leader reads after a write) overrides sticky routing for the relevant time window.
---
## Pattern 3: Consistent Prefix Reads Violation
**Formal definition:** A user reads a sequence of writes. The writes were causally ordered (write B depends on write A). The user sees write B but not write A — they see the effect before the cause.
**Why it happens:** In a partitioned (sharded) database, different partitions operate independently with no global write ordering. Write A goes to partition 1; write B (which depends on A) goes to partition 2. Partition 2 has low replication lag; partition 1 has high replication lag. The user's read contacts both partitions and sees partition 2's current state (write B present) but partition 1's stale state (write A not yet applied).
**Concrete example (from Kleppmann):** Mr. Poons asks: "How far into the future can you see?" Mrs. Cake replies: "About ten seconds usually." These are written to different partitions. An observer reads Mrs. Cake's reply (partition 2, low lag) before Mr. Poons' question (partition 1, high lag). The observer sees: "About ten seconds usually" followed by "How far into the future can you see?" — the answer precedes the question.
**Distinguishing characteristic:** The anomaly involves causally related writes that are written to different partitions. It is a problem of ordering, not of lag per se — even small lag differences can cause it if partitions diverge for even a moment.
**Mitigation techniques:**
*Technique 1: Causal co-location.*
Ensure that writes with causal dependencies go to the same partition. If the question and answer are always written to the same user's partition, they will always be applied in order by that partition's follower.
- Practical implementation: use a consistent routing key for causally related writes (e.g., a conversation ID, a user ID, an entity ID). All writes for the same conversation go to the same partition.
- Limitation: does not work when causally related writes must span different users' data or different entity types.
*Technique 2: Causal dependency tracking.*
Track causal dependencies explicitly using version vectors or logical clocks. A read does not return a causally later value without also verifying that all its causal prerequisites are present.
- More complex to implement and requires database or middleware support.
- Some databases provide this as a built-in feature (causal consistency in MongoDB, some Cassandra consistency levels).
- References: version vectors (Riak), Lamport timestamps, hybrid logical clocks (CockroachDB).
*Technique 3: Single-partition writes for related data.*
Redesign the data model so that causally related data lives in the same partition by default. For the conversation example: store both the question and answer as records within the same conversation partition, rather than in different user partitions.
---
## Replication Lag Monitoring
Regardless of which anomaly mitigation is implemented, monitoring replication lag is essential:
- **Leader-based replication:** The database exposes per-replica lag metrics. PostgreSQL: `pg_stat_replication.replay_lag`. MySQL: `Seconds_Behind_Master` in `SHOW SLAVE STATUS`. These can be fed into a monitoring system (Prometheus, Datadog, CloudWatch).
- **Leaderless replication:** There is no single leader to compare against. Staleness must be inferred from version numbers or timestamps returned by reads. Some databases expose node-level metrics (Cassandra: `nodetool netstats`, `nodetool tpstats`), but a unified lag figure is not available the same way it is in leader-based systems.
Alert thresholds:
- Warning: lag > 5 seconds (investigate whether the follower is falling behind)
- Critical: lag > 60 seconds (all read-after-write time-window mitigations assume lag is short; long lag makes them unsafe)
- Emergency: lag growing continuously (follower cannot keep up; consider removing it from the read pool)
FILE:references/quorum-edge-cases.md
# Quorum Edge Cases
The quorum condition w + r > n is designed to guarantee that the read node set and the write node set overlap, ensuring at least one node in every read has the latest value. In practice, six scenarios break this guarantee even when the condition is mathematically satisfied.
Applies to leaderless (Dynamo-style) replication: Cassandra, Riak, Voldemort, DynamoDB, and custom Dynamo-inspired implementations.
---
## The Quorum Baseline
For n replicas:
- Every write must be confirmed by w nodes to be considered successful.
- Every read must query r nodes and return the most recent value seen.
- As long as w + r > n, at least one node in the r-node read set must have seen the w-node write set.
Example: n=3, w=2, r=2. Any two nodes overlap in at least one member. If the write was confirmed by nodes {A, B}, any read contacting {A, C} or {B, C} or {A, B} will include at least one node that has the latest value.
This guarantee holds under the assumptions: strict quorums, no concurrent writes, no partial failures, no node recovery from stale state, and no clock skew in conflict resolution. The six edge cases each violate one of these assumptions.
---
## Edge Case 1: Sloppy Quorum / Hinted Handoff in Progress
**Condition:** A network interruption isolated the client from the n "home" nodes for a key. The database accepted the write on w "non-home" nodes (sloppy quorum) to maintain write availability. After the partition heals, hinted handoff is replaying those writes to the home nodes — but the process is not yet complete.
**Why the guarantee breaks:** The w writes are on non-home nodes. The r reads go to the home nodes. Even though w + r > n is satisfied in terms of counts, there is no overlap between the write set (non-home nodes) and the read set (home nodes) until hinted handoff completes.
**Detection signal:** Recent network partition or node outage. Hinted handoff queue non-empty. Reads returning pre-partition values.
**Mitigation:**
- Wait for hinted handoff to complete before serving reads. Monitor `HintsService` metrics (Cassandra: `nodetool tpstats`; Riak: hints queue length).
- Force read repair: issue a `CONSISTENCY QUORUM` read or run `nodetool repair`.
- Long-term: for keyspaces requiring freshness guarantees, disable sloppy quorums. For Cassandra: set `durable_writes = true`. For Riak: configure per-bucket quorum settings with `pw` (primary write quorum) rather than relying on sloppy handoff.
**Key distinction:** A sloppy quorum is a durability guarantee — "w nodes somewhere in the cluster hold this write." It is not a freshness guarantee. Treating sloppy quorum confirmations as equivalent to strict quorum confirmations is the error.
---
## Edge Case 2: Concurrent Writes with Last-Write-Wins Resolution
**Condition:** Two clients write to the same key at approximately the same time. Last-write-wins (LWW) is the conflict resolution strategy. Clocks between nodes have skew.
**Why the guarantee breaks:** The quorum condition says nothing about which write wins when two writes are concurrent. The only safe solution is to merge concurrent writes (using version vectors or application-level conflict resolution). LWW picks a winner based on timestamp — but timestamps come from node clocks, which have skew. The write with the higher timestamp wins, even if it was causally earlier. The causally later write — which the application intended to be the final value — is silently discarded. No error is reported.
**Detection signal:** Concurrent writes to the same key. LWW conflict resolution enabled (Cassandra default). Clock skew present between nodes. Missing writes with no error in the application.
**Mitigation:**
- Replace LWW with version vectors (Riak: dotted version vectors; custom systems: per-replica version numbers). Version vectors detect concurrent writes and preserve both versions as siblings, allowing application-level or CRDT-based merge.
- If LWW must be retained: use UUIDs as keys so each write has a unique key and concurrent writes to the same key cannot occur. This is the recommended Cassandra pattern for LWW safety.
- Monitor inter-node clock skew. Alert when skew exceeds the acceptable loss threshold (even 3ms skew can cause LWW data loss under high write concurrency).
**Key distinction:** LWW achieves eventual convergence at the cost of durability. All w writes were durably stored; LWW then deliberately discards some of them. This is a design choice, not a quorum failure — but it violates the application's intuition that "confirmed writes survive."
---
## Edge Case 3: Write Concurrent with Read
**Condition:** A write is in-flight when a read is issued. The write has been applied to some replicas but not others. The read contacts r replicas.
**Why the guarantee breaks:** If some of the r replicas have the new value and some have the old value, the read may return either value. If the read takes the old value and the write subsequently completes on all nodes, a later read might return the new value — but another later read (contacting a different replica subset) might still return the old value if that subset did not all receive the write in time. The system is in a transient inconsistent state during the in-flight write window.
**Detection signal:** Very high write concurrency. Intermittent stale reads on keys that are actively being written. Reads sometimes returning new values and sometimes old values non-deterministically.
**Mitigation:**
- For strong consistency requirements: use a consensus protocol (Raft, Paxos) rather than quorums. Consensus protocols serialize reads and writes and provide linearizability.
- For eventual consistency workloads: this edge case is inherent and expected. Ensure the application can tolerate reading either the old or new value during the write window.
- Read repair helps: if a read returns conflicting versions from different replicas, it writes the latest version back to the stale replicas, accelerating convergence.
---
## Edge Case 4: Partial Write Success, No Rollback
**Condition:** A write succeeded on some replicas but failed on others (e.g., a replica's disk is full, or it was temporarily unreachable). The overall write returned an error to the client (fewer than w acknowledgements). The replicas that did succeed are not rolled back.
**Why the guarantee breaks:** The write was reported as failed, so the client may retry it. But the replicas that succeeded already have the new value. On the next read, those replicas may return the partially-written value. The state is now: some replicas at new value, some at old value, and the client does not know which is "correct."
**Detection signal:** Write errors under disk pressure or replica unavailability. Subsequent reads intermittently returning values the client believes were never written.
**Mitigation:**
- Design operations to be idempotent. A safe retry of a partially-applied write must produce the same result as a full write.
- If atomicity across all replicas is required: use a database that provides transactional writes with rollback, not an eventually consistent leaderless store.
- Use conditional writes (compare-and-swap) where available to make partial writes detectable: the write includes a precondition (expected version), and only succeeds if the precondition is met. If the write partially applied, the replicas that got it are at a new version; replicas that did not are at the old version. The next attempt can detect the split by checking versions.
---
## Edge Case 5: Node Restored from Stale Replica
**Condition:** A node that held the latest value for a key fails. Its data is restored from a backup or from another replica that held an older value. After restoration, the number of replicas storing the latest value for that key falls below w.
**Why the guarantee breaks:** The quorum condition was satisfied when the write was made (w nodes confirmed). But the quorum is a snapshot in time — it is not a permanent guarantee. If nodes fail and are restored with stale data, the count of replicas holding the latest value can shrink below w, retroactively breaking the quorum guarantee for that write.
**Detection signal:** Node failure followed by restoration from backup. Reads intermittently returning values that should have been superseded. `nodetool repair` showing large amounts of data being resynchronized.
**Mitigation:**
- Before restoring a node from backup: run `nodetool repair` on the affected keyspace across all remaining healthy replicas first. This ensures the latest values are fully propagated before the restored node's stale data can contaminate reads.
- After restoring a node: do not route reads to it until a repair completes. Mark it as not eligible for read quorum until it has been resynchronized.
- Schedule regular anti-entropy runs (see below) to prevent silent value drift that makes restoration events worse.
---
## Edge Case 6: Timing Edge Cases at the Linearizability Boundary
**Condition:** No specific failure event has occurred. The system is operating normally with w + r > n fully satisfied. Yet under certain timing conditions, a read returns a stale value.
**Why the guarantee breaks:** Quorum reads provide eventual consistency by default — they are not linearizable. Linearizability requires that every read reflects the most recently completed write as of the moment the read is issued. Quorum reads do not guarantee this: the r replicas contacted may all have received the write, but the read request may have been issued before all r replies were collected. There are race conditions where unlucky timing causes stale reads even when the quorum condition is met.
**Detection signal:** Intermittent stale reads under high concurrency with no concurrent writes or node failures. Reads appearing stale by only milliseconds. The anomaly is not reproducible consistently.
**Mitigation:**
- Accept that quorum reads provide eventual consistency, not linearizability. Design the application to tolerate occasional stale reads.
- If linearizability is genuinely required: route reads through a single leader (leaderless systems can designate one replica as the read coordinator for linearizable reads), or use a consensus protocol (Raft, Paxos). Stronger guarantees require transactions or consensus — see `transaction-isolation-selector` and `consistency-model-selector`.
---
## Anti-Entropy: The Missing Piece
All six edge cases are worsened by the absence of an anti-entropy background process.
**Read repair** (the default in most Dynamo-style databases): when a read detects that some replicas returned a stale value, the reading client writes the latest value back to the stale replicas. This only runs when a value is actually read. Infrequently-read keys can diverge permanently.
**Anti-entropy process**: a background process that continuously compares replicas and propagates missing writes. Unlike read repair, it does not depend on reads being issued. Cassandra: `nodetool repair`. Riak: background anti-entropy (enabled by default). Voldemort: does not implement anti-entropy — relies solely on read repair.
**Without anti-entropy:** values that are rarely read may be missing from some replicas indefinitely. This reduces the effective durability below the theoretical guarantee (if only 1 of n replicas has a value for an infrequently-read key, that single replica's failure causes permanent data loss).
**Anti-entropy schedule recommendation:**
- Run `nodetool repair` on a schedule shorter than the gc_grace_seconds period (Cassandra default: 10 days).
- If gc_grace_seconds expires before repair runs, tombstones (deletion markers) are garbage-collected. If a replica that was offline during the original deletion comes back online after tombstone GC, it will "resurrect" the deleted data — deleted rows will reappear.
- Recommended: weekly repair at minimum. Daily repair for high-write workloads or large clusters.
---
## Quick Reference: Quorum Edge Cases
| Edge case | w + r > n satisfied? | Root cause | Mitigation |
|---|---|---|---|
| Sloppy quorum / hinted handoff | Yes (count-wise) | Write and read node sets don't overlap | Wait for hinted handoff; force read repair; consider strict quorums |
| LWW + concurrent writes | Yes | Conflict resolution discards valid writes | Use version vectors; use UUID keys; monitor clock skew |
| Write concurrent with read | Yes | Transient in-flight state | Design for eventual consistency; use consensus for linearizability |
| Partial write, no rollback | No (write failed) | Partial state left on some replicas | Idempotent operations; conditional writes; transactional DB if atomicity needed |
| Stale node restoration | Was yes, now no | Node count with latest value fell below w | Repair before routing reads to restored node; schedule regular anti-entropy |
| Timing at linearizability boundary | Yes | Quorums are not linearizable | Accept eventual consistency; use consensus for linearizable reads |
Select the right partitioning (sharding) strategy — range, hash, or compound key — and configure secondary indexes, rebalancing, and request routing for a di...
---
name: partitioning-strategy-advisor
description: |
Select the right partitioning (sharding) strategy — range, hash, or compound key — and configure secondary indexes, rebalancing, and request routing for a distributed database. Use when: designing a partition key for a new system; diagnosing write hotspots on monotonically increasing keys (timestamps, auto-increment IDs); evaluating whether an existing sharding scheme supports required query patterns; choosing between document-partitioned (local) vs. term-partitioned (global) secondary indexes and weighing scatter/gather read costs against global index write amplification; or selecting a rebalancing approach (fixed partitions, dynamic partitions, proportional-to-nodes) and routing topology (gossip, ZooKeeper coordination, partition-aware client). Covers Cassandra compound primary key patterns for range queries within hash-distributed partitions, HBase/SSTables range partitioning, Riak consistent hashing, and MongoDB/Elasticsearch index partitioning. Distinct from replication-strategy-selector (topology and consistency) and data-model-selector (schema design). Produces a concrete recommendation: partition key, partitioning method, secondary index approach, rebalancing configuration, and routing topology. Depends on data-model-selector for schema and access pattern context.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/partitioning-strategy-advisor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- data-model-selector
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [6]
tags:
- partitioning
- sharding
- range-partitioning
- hash-partitioning
- compound-key
- hotspot
- skew
- secondary-index
- local-index
- global-index
- document-partitioned-index
- term-partitioned-index
- scatter-gather
- rebalancing
- dynamic-partitioning
- fixed-partitions
- proportional-partitioning
- consistent-hashing
- request-routing
- zookeeper
- gossip-protocol
- cassandra
- hbase
- mongodb
- riak
- elasticsearch
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "Application codebase, docker-compose, database schema files, or architecture description that reveals data access patterns, query patterns, and write volume characteristics"
- type: document
description: "System requirements document or architecture description if no codebase is available"
tools-required: [Read, Write, Bash]
tools-optional: [Grep]
mcps-required: []
environment: "Run inside a project directory with codebase or configuration files. Falls back to document/description input."
discovery:
goal: "Produce a concrete partitioning recommendation: partition key, partitioning method, secondary index approach, rebalancing configuration, and request routing topology"
tasks:
- "Identify the primary data entity and its natural access key"
- "Classify the data distribution risk (monotonic keys, celebrity keys, uniform keys)"
- "Select partitioning method: range, hash, or compound"
- "Assess secondary index requirements and select local vs. global indexing"
- "Select rebalancing approach: fixed, dynamic, or proportional-to-nodes"
- "Select request routing approach: gossip, coordination service, or partition-aware client"
- "Document hotspot risks and mitigation strategies"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "site-reliability-engineer", "tech-lead"]
experience: "intermediate-to-advanced — assumes experience with distributed databases and data modeling"
triggers:
- "User is designing a partitioning scheme for a new distributed database deployment"
- "User is experiencing write hotspots or uneven partition load"
- "User needs efficient range queries on a hash-partitioned database"
- "User is choosing between local and global secondary indexes"
- "User is configuring rebalancing for a growing cluster"
- "User is evaluating whether their sharding key causes skew"
- "User is deciding how to route requests to the correct partition"
not_for:
- "Choosing a replication topology — use replication-strategy-selector"
- "Selecting a storage engine or data model — use storage-engine-selector or data-model-selector"
- "Diagnosing distributed system failures — use distributed-failure-analyzer"
---
# Partitioning Strategy Advisor
## When to Use
You are designing or evaluating a distributed database where data must be spread across multiple nodes (sharding) and need to select a partitioning key, a partitioning method, secondary index approach, rebalancing configuration, and request routing topology.
This skill applies when the partitioning scheme is open (new system), problematic (existing system with hotspots or unsupported query patterns), or needs documented justification (architecture decision record, team alignment). It produces a concrete recommendation covering partition key choice, range vs. hash vs. compound method, secondary index strategy, rebalancing approach, and routing topology.
**Prerequisite check:**
- If you haven't selected a data model (relational, document, graph) for your system, run `data-model-selector` first — partitioning key design depends on the schema and access patterns that the data model determines.
- If you are also designing replication across nodes, run `replication-strategy-selector` after this skill — partitioning and replication are largely independent choices, but the replication factor affects partition sizing.
- If you need to diagnose hotspots or routing failures in a live production system, use `distributed-failure-analyzer` for root-cause analysis after using this skill to select the corrected strategy.
---
## Context & Input Gathering
### Required Context (must have — ask if missing)
**1. Primary access pattern — how records are most frequently looked up**
Why: The access pattern determines what the partition key must be. If records are always looked up by a single primary key (user ID, order ID), hashing that key achieves even distribution. If records are frequently fetched in sorted ranges (sensor readings by time, events by date), range partitioning on that key enables efficient range scans. Choosing the wrong key forces scatter/gather reads or hot single partitions on every query.
- Check prompt for: "look up by", "query by", "fetch all records for", "time-series", "range query", "primary key is"
- Check environment for: schema.sql (PRIMARY KEY, index definitions), application code (WHERE clauses, query methods), architecture.md (data access descriptions)
- If still missing, ask: "How does your application most frequently look up records? By a single ID, by a range of values (e.g., date range), or by filtering on secondary attributes (e.g., color, status)?"
**2. Write distribution characteristics — is the natural key monotonically increasing?**
Why: Monotonically increasing keys (timestamps, auto-increment IDs, sequential order numbers) create sequential write patterns that concentrate all writes on the latest partition in range partitioning. This is the most common hotspot source. Even under hash partitioning, extremely popular individual keys (celebrity users, viral content IDs) can overwhelm a single partition. Understanding write distribution is required before selecting a partition key.
- Check prompt for: "timestamp", "created_at", "auto-increment", "sequential ID", "time-series writes", "celebrity", "viral", "high-traffic key"
- Check environment for: schema.sql (SERIAL, AUTO_INCREMENT, TIMESTAMP columns as primary keys), application code (INSERT patterns, event ingestion pipelines)
- If still missing, ask: "Does your most common write operation use a key that increases sequentially (like a timestamp or auto-increment ID)? And are any individual records likely to receive dramatically more writes than others (e.g., a popular user or trending item)?"
**3. Secondary query requirements — does the application filter on non-primary-key attributes?**
Why: Secondary indexes on partitioned data require a choice between document-partitioned (local) and term-partitioned (global) indexes. This choice forces a fundamental trade-off: local indexes are cheap to write (only the local partition is updated) but expensive to read (scatter/gather across all partitions). Global indexes are cheap to read (single partition lookup) but expensive to write (updates span multiple partition index entries, often asynchronously). You must know which queries are required before designing the index strategy.
- Check prompt for: "filter by", "search by color", "find all users where", "full-text search", "secondary index", "non-primary attribute lookup"
- Check environment for: schema.sql (CREATE INDEX, secondary index definitions), application code (multi-attribute WHERE clauses), architecture.md (search or filter requirements)
- If still missing, ask: "Does your application need to query records by attributes other than the primary key — for example, filtering orders by status, searching products by category, or looking up users by email?"
**4. Data volume and growth trajectory**
Why: Affects rebalancing strategy selection. If total data volume is small and known, a fixed number of partitions sized correctly upfront is simplest. If data will grow significantly over time, dynamic partitioning (split/merge) adapts partition count automatically. If the cluster will scale out by adding nodes, proportional-to-nodes partitioning keeps partition size stable. Choosing a fixed partition count that's too small locks out future growth; too large creates excessive overhead.
- Check prompt for: "expected data size", "TB", "GB", "millions of records", "growing fast", "scaling out", "adding nodes"
- Check environment for: docker-compose (volume mounts, sizing comments), requirements.md (capacity planning), architecture.md (growth estimates)
- If still missing, ask: "Roughly how much data will this system store now, and how is it expected to grow over the next 1–2 years? Will you be adding database nodes as it grows?"
**5. Target database**
Why: Different databases implement partitioning in fundamentally different ways — HBase and RethinkDB use dynamic key-range partitioning; Cassandra and Riak use hash partitioning with proportional-to-nodes rebalancing; Elasticsearch uses fixed-partition hash sharding; MongoDB supports both. The recommendation must fit what the target database actually supports. Additionally, Cassandra's compound primary key pattern is only available in Cassandra.
- Check prompt for: database names (Cassandra, HBase, MongoDB, DynamoDB, Riak, Elasticsearch, PostgreSQL with Citus, etc.)
- Check environment for: docker-compose (image names), requirements.md (technology constraints), package files (database drivers)
- If still missing, ask: "Which database are you partitioning? (e.g., Cassandra, HBase, MongoDB, DynamoDB, Elasticsearch, or a custom partitioning layer on top of another store)"
### Optional Context (enriches recommendation)
- **Consistency requirements under rebalancing:** If the system requires zero-downtime rebalancing with continued read/write availability, manual-approval rebalancing (Couchbase, Riak, Voldemort model) is safer than fully automatic.
- **Request routing constraints:** If the system cannot depend on an external coordination service like ZooKeeper, gossip-based routing (Cassandra, Riak) or partition-aware client libraries are required.
- **Read/write ratio:** Heavily read-dominated workloads benefit more from global (term-partitioned) secondary indexes; heavily write-dominated workloads benefit from local (document-partitioned) secondary indexes.
---
## Process
### Step 1 — Classify Data Distribution Risk
Categorize the workload's skew profile before selecting any partitioning method.
**Why:** Partitioning is only useful if load is evenly distributed. A poorly chosen partition key can negate all the benefits of horizontal scaling by concentrating all reads or writes on one node (a hotspot). Identifying skew risk upfront determines whether extra mitigation (key prefixing, salting, compound keys) is needed on top of the base partitioning method.
Classify into one of three profiles:
| Profile | Signal | Risk |
|---|---|---|
| **Monotonic key** | Timestamps, sequential IDs, auto-increment PKs | All writes go to the single "latest" partition under range partitioning |
| **Celebrity key** | A small number of keys receive orders of magnitude more traffic (viral content, celebrity users) | Single partition overloaded even under hash partitioning |
| **Uniform key** | Random UUIDs, user IDs with no natural ordering, hashed values | Low hotspot risk; both range and hash partitioning are viable |
**Mitigation for monotonic keys (range partitioning):** Prefix the timestamp with a high-cardinality dimension (e.g., sensor name + timestamp, or user ID + timestamp). This distributes writes across the key space while preserving range-scan ability within each prefix bucket.
**Mitigation for celebrity keys (hash partitioning):** Append a two-digit random suffix (00–99) to the hot key at write time. This splits writes across 100 partitions. Reads must then query all 100 suffix variants and merge — track which keys are "hot" in a separate lookup table to apply this only where needed.
### Step 2 — Select Partitioning Method
Choose range partitioning, hash partitioning, or compound key (hybrid) based on access patterns and skew profile.
**Why:** This is the central decision. Range partitioning preserves sort order, enabling efficient range scans but requiring careful key design to avoid hotspots. Hash partitioning destroys sort order (making range queries require scatter/gather) but distributes load evenly. Compound keys (Cassandra model) combine both — hashing on the first part distributes across partitions, sorting within the partition on the remaining parts supports range scans within a single partition.
**Decision framework:**
```
PRIMARY ACCESS PATTERN
├── Range queries are critical (fetch records within a time window,
│ scan sorted ranges, support range-based pagination)
│ ├── Key is monotonically increasing → RANGE with prefix mitigation
│ │ Example: (sensor_name, timestamp) compound range key
│ └── Key has natural, non-monotonic distribution → RANGE partitioning
│ Example: alphabetically distributed last names (encyclopedia volumes)
│
├── Point lookups dominate (fetch a single record by ID) AND
│ range queries are not needed
│ → HASH partitioning
│ Example: user profiles by user_id, order records by order_id
│
└── Point lookups on partition key AND range queries within that key
→ COMPOUND KEY (hash partition key + range sort key)
Example: (user_id [hash], update_timestamp [range]) in Cassandra
→ Efficiently fetch all updates for a user, sorted by time,
in a single partition lookup
```
**Range partitioning:** Used by Bigtable, HBase, RethinkDB, MongoDB (pre-2.4). Partition boundaries must adapt to data distribution — automatic boundary management (dynamic partitioning) is strongly preferred over manually configured boundaries to avoid under/over-filled partitions.
**Hash partitioning:** Used by Cassandra, Riak, Voldemort, MongoDB (hash mode). The hash function must be stable across processes — language-native hash functions (Java's `Object.hashCode()`, Ruby's `Object#hash`) are not safe for this purpose because they can return different values in different processes. Use a purpose-built stable function: Cassandra and MongoDB use MD5, Voldemort uses Fowler-Noll-Vo. Assign each partition a range of hash values (not `hash mod N` — see rebalancing step).
**Compound key (Cassandra):** The partition key (first component) is hashed to determine the partition. The clustering columns (remaining components) determine sort order within the partition. A query that fixes the partition key can perform an efficient range scan over the clustering columns without touching other partitions. This is the primary pattern for one-to-many relationships in Cassandra: `(user_id, post_timestamp)` stores all posts for a user in one partition, sorted by time.
### Step 3 — Design the Secondary Index Strategy
If the application requires queries on non-primary-key attributes, choose between document-partitioned (local) and term-partitioned (global) secondary indexes.
**Why:** Secondary indexes do not map cleanly to partitions. The choice of local vs. global index makes a fundamental trade-off between write cost and read cost that cannot be undone without a full reindex. Making this decision explicitly — and documenting it — prevents later surprises when a secondary index query turns out to require scatter/gather across every partition in the cluster.
**Document-partitioned (local) index:**
- Each partition maintains its own secondary index covering only the documents in that partition.
- Write: only the partition being written must be updated. Fast, local, no cross-partition coordination.
- Read: a secondary index query must be sent to all partitions and results merged (scatter/gather). Tail latency amplification — the query takes as long as the slowest partition.
- Used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB.
- Choose when: writes are frequent and latency-sensitive; read queries on secondary indexes are acceptable to be slower; secondary index queries usually target a single partition (structured such that the secondary query attribute correlates with the partition key).
**Term-partitioned (global) index:**
- A single global index covers all partitions, but is itself partitioned (by the indexed term or a hash of the term).
- Read: a secondary index query hits a single index partition — fast, no scatter/gather.
- Write: a single document write may need to update index entries spread across multiple index partitions. Updates are typically asynchronous (DynamoDB global secondary indexes update within a fraction of a second normally, but with potential delays under faults).
- Used by: Amazon DynamoDB (global secondary indexes), Riak's search feature, Oracle data warehouse.
- Choose when: secondary index read frequency and latency are critical; writes are less frequent or can tolerate asynchronous index propagation; consistency requirements can accept eventually-consistent secondary indexes.
**Index partition boundary choice (for global indexes):**
- Range-partitioned term index: supports range scans on the indexed attribute (e.g., price ranges).
- Hash-partitioned term index: more uniform load distribution, no range scan support on the index.
### Step 4 — Select Rebalancing Approach
Choose fixed-partition, dynamic, or proportional-to-nodes rebalancing.
**Why:** As data grows and nodes are added or removed, partitions must be redistributed. The rebalancing strategy determines how disruptive this is, how much data moves, and how much operational overhead it requires. Using `hash mod N` — the naive approach — is explicitly wrong: it causes the vast majority of keys to move when N changes, making rebalancing catastrophically expensive.
**Fixed number of partitions:**
- Create many more partitions than nodes at setup (e.g., 1,000 partitions for a 10-node cluster = ~100 partitions/node).
- When a node is added, it steals a few partitions from every existing node. Only entire partitions move; no keys are remapped.
- Partition count is fixed at creation time. Choose it high enough to accommodate maximum expected cluster size.
- Operationally simple. Used by Riak, Elasticsearch, Couchbase, Voldemort.
- Risk: if data size is highly variable, partition size grows proportionally to total data, making very large partitions expensive to rebalance and recover from failure.
**Dynamic partitioning:**
- Partitions split when they exceed a configured size threshold (HBase default: 10 GB); partitions merge when they shrink below a lower threshold.
- Partition count adapts to data volume: small datasets use few partitions (low overhead), large datasets use many.
- Requires pre-splitting on an empty database (pre-splitting) — without it, all writes initially hit a single partition until the first split.
- Used by HBase, RethinkDB. MongoDB supports both dynamic and fixed partitioning.
- Choose when: data volume is highly variable or expected to grow significantly; range-partitioned databases where fixed boundaries would be very wrong.
**Proportional to nodes:**
- Fixed number of partitions per node (Cassandra default: 256 per node).
- When a node is added, it randomly splits existing partitions and takes half of each split.
- Partition size remains stable as the cluster grows because adding nodes also increases partition count.
- Used by Cassandra and Ketama. Requires hash-based partitioning (partition boundaries are drawn from the hash space).
- Choose when: using Cassandra or a gossip-based hash-partitioned system; cluster will scale out by adding nodes.
**Automatic vs. manual rebalancing:**
- Fully automatic: the system decides when and how to move partitions, without operator intervention. More convenient but unpredictable — a node that is temporarily slow may be misidentified as dead, triggering rebalancing that adds load to the already-overloaded node, potentially causing cascading failure.
- Semi-automatic (recommended for production): the system generates a suggested rebalancing plan, but an operator must approve it before it executes. Used by Couchbase, Riak, Voldemort.
- Fully manual: an administrator configures partition assignment explicitly.
- **Recommendation:** Use semi-automatic rebalancing in production. The human-in-the-loop prevents rebalancing storms triggered by false-positive failure detection.
### Step 5 — Select Request Routing Approach
Choose how clients discover which node owns a given partition.
**Why:** As partitions rebalance, the mapping from partition to node changes. Clients or proxies must track these changes to route requests to the correct node. Using stale routing metadata results in forwarded requests (extra latency), re-tried requests, or errors. The routing approach must match the database's coordination model.
Three options:
**Option 1 — Any-node with forwarding (gossip-based):**
- Clients send requests to any node. If that node owns the partition, it handles the request; otherwise it forwards to the correct node.
- Nodes disseminate routing metadata via gossip protocol — no external coordination service required.
- Used by Cassandra and Riak.
- Trade-off: adds complexity to database nodes; routing metadata may be slightly stale (eventual convergence). Good for systems that must avoid a single-point-of-failure coordination service.
**Option 2 — Routing tier (coordination service):**
- A dedicated routing layer (partition-aware load balancer) receives all requests, consults a coordination service (ZooKeeper, etcd), and forwards to the correct node.
- ZooKeeper maintains the authoritative partition-to-node mapping. Nodes register themselves; the routing tier subscribes to changes.
- Used by HBase, SolrCloud, Kafka (ZooKeeper), LinkedIn Espresso (Helix+ZooKeeper).
- Trade-off: introduces ZooKeeper as an operational dependency; provides strong consistency for routing metadata. Best for systems with complex multi-partition query routing.
**Option 3 — Partition-aware client:**
- The client library maintains a local copy of the partition-to-node mapping and connects directly to the correct node.
- Requires a mechanism to learn about partition changes — usually subscribing to ZooKeeper, or refreshing from a config server.
- Used by MongoDB (mongos daemon), some Cassandra drivers in "token-aware" mode.
- Trade-off: pushes routing logic into each client; reduces hop count for simple key lookups.
---
## What Can Go Wrong
**Hotspot from monotonically increasing write key**
The most common partitioning failure. A table keyed on `created_at`, `id SERIAL`, or any auto-increment column with range partitioning sends 100% of writes to the partition covering the current moment. The newest partition is overloaded while all others sit idle.
Fix: Switch to compound key with a high-cardinality prefix, or switch to hash partitioning if range queries are not needed. For time-series with range-query requirements, use `(source_id, timestamp)` compound range key.
**Scatter/gather latency on secondary index queries**
Document-partitioned (local) secondary indexes require querying all partitions. With 100 partitions, the query waits for the slowest of 100 responses (tail latency amplification). Secondary index queries that were fast in a single-node database become 10–100x slower after partitioning.
Fix: If secondary index reads are frequent and latency-sensitive, switch to a global (term-partitioned) index and accept asynchronous write propagation. Alternatively, structure the primary partition key to co-locate records that will be queried together (e.g., partition by tenant_id if secondary queries are always within a tenant).
**Rebalancing storm from automatic rebalancing + false-positive failure detection**
A temporarily overloaded node responds slowly. Other nodes declare it dead. Automatic rebalancing begins moving its partitions to other nodes. The extra rebalancing traffic overloads the already-struggling node further. The added load on other nodes triggers further detection false-positives.
Fix: Use semi-automatic (operator-approved) rebalancing in production. Tune failure detection timeouts conservatively. Set rebalancing rate limits to bound the bandwidth consumed during rebalancing.
**Wrong partition count with fixed-partition scheme**
Fixed partitions chosen at setup cannot be changed later (in databases that do not support dynamic partitioning). Too few partitions caps how many nodes can be added. Too many creates excessive per-partition overhead. Choosing 10 partitions for a system that grows to 100 nodes means 10 nodes must carry the entire load.
Fix: For fixed-partition systems, choose initial partition count 10–20x the expected maximum node count. For variable-growth scenarios, prefer dynamic partitioning.
**Language-native hash functions used for partitioning**
Java's `Object.hashCode()` and Ruby's `Object#hash` return different values for the same string in different JVM processes (randomization is intentional for security). Using these for partitioning produces incorrect routing — the partition computed at write time differs from the partition computed at read time.
Fix: Use a purpose-built stable hash function: MD5 (Cassandra, MongoDB), Fowler-Noll-Vo (Voldemort), MurmurHash, or similar. Verify hash stability across process restarts before deploying.
**Stale routing metadata causes misrouted requests**
If routing metadata is cached client-side and not refreshed after rebalancing, requests are sent to nodes that no longer own the target partition. The node either returns an error or silently returns stale data.
Fix: Use a routing approach that subscribes to partition assignment changes (ZooKeeper watches, gossip convergence). Test routing after every rebalancing event. Implement retry-with-redirect at the client layer.
---
## Examples
### Example 1 — IoT sensor time-series platform
**Scenario:** Storing sensor readings where each reading has `(sensor_id, timestamp, value)`. Queries: fetch all readings for a sensor within a date range; write throughput is high and continuous.
**Trigger:** "We're building a sensor data platform on HBase. Our current key is just the timestamp, and we're seeing one region server getting all the writes."
**Process:**
1. Skew classification: Monotonic key — `timestamp` as the sole partition key causes all writes to go to the "now" partition.
2. Method selection: Range partitioning is required (range scans by date are the primary read pattern). Apply compound range key with sensor name as prefix: key = `(sensor_name, timestamp)`.
3. Secondary index: None required — all queries fix `sensor_name` (partition prefix) and scan a `timestamp` range within the partition.
4. Rebalancing: Dynamic partitioning (HBase default). Pre-split with known sensor names to avoid cold-start single-partition writes.
5. Routing: ZooKeeper-based routing tier (HBase default via HMaster).
**Output:** Compound range key `(sensor_name, timestamp)`. All writes for a sensor go to the same partition, distributed across sensors. Range queries within a sensor are efficient single-partition scans. Dynamic rebalancing handles data growth.
---
### Example 2 — Social media post feed (Cassandra)
**Scenario:** Storing user posts where queries are: "fetch all posts by user X sorted by time" (dominant) and "fetch a single post by post_id" (secondary). Write pattern: each user writes to their own posts; no cross-user write conflicts.
**Trigger:** "We're modeling user posts in Cassandra. We need to efficiently retrieve all posts by a user, sorted from newest to oldest, but we also need to look up individual posts by ID."
**Process:**
1. Skew classification: User ID as partition key is approximately uniform (assuming large user base). No celebrity-user mitigation needed unless specific users are known to be viral.
2. Method selection: Compound key pattern — `user_id` as hash partition key (evenly distributes users across nodes), `post_timestamp DESC` as clustering column (sorts posts within the partition, newest first).
3. Secondary index: Post-by-ID lookup can be handled by a separate table keyed on `post_id` (denormalization, Cassandra idiom) rather than a secondary index, avoiding scatter/gather entirely.
4. Rebalancing: Proportional-to-nodes (Cassandra default, 256 vnodes/node). Automatic with gossip-based partition assignment.
5. Routing: Gossip protocol with token-aware driver (Cassandra default). No ZooKeeper dependency.
**Output:** Compound primary key `(user_id, post_timestamp DESC)`. Fetching all posts by user is a single-partition range scan. Individual post lookup uses a separate `posts_by_id` table. Scales horizontally by adding Cassandra nodes.
---
### Example 3 — E-commerce product catalog with multi-attribute search
**Scenario:** Product records with attributes `(product_id, category, color, price)`. Queries: look up by `product_id` (dominant write target), filter by `category + color` (secondary index query), filter by `price` range (secondary index query).
**Trigger:** "We have 50 million products partitioned by product_id hash across 20 Elasticsearch shards. Our 'filter by category and color' queries are slow and getting slower as we add shards."
**Process:**
1. Skew classification: `product_id` hash partitioning is uniform — no hotspot risk on writes.
2. Method selection: Hash partitioning on `product_id` is correct for primary key lookups. Do not change this.
3. Secondary index analysis: Elasticsearch uses document-partitioned (local) indexes by default. The `category + color` filter query is scatter/gathered across all 20 shards. Query latency scales with shard count.
4. Secondary index strategy: Two options:
- If `category` queries are always within a tenant/store, consider routing documents by `(store_id, product_id)` compound key to co-locate products from the same store, reducing scatter/gather to intra-store shards.
- If global catalog search is required, accept scatter/gather and optimize by reducing shard count (fewer, larger shards perform better for scatter/gather than many small shards) or pre-aggregate category+color facets.
5. Rebalancing: Fixed partitions (Elasticsearch uses fixed shard count). Choose initial shard count = projected node count × 2–3. Do not over-shard.
6. Routing: Elasticsearch coordinator node (partition-aware proxy, equivalent to routing tier option).
**Output:** Keep hash partitioning on `product_id`. Accept scatter/gather for secondary index queries — this is the Elasticsearch design. Reduce total shard count to the minimum needed; over-sharding amplifies scatter/gather cost. Add replica shards for read scaling on hot index queries.
---
## Cross-References
- **replication-strategy-selector** — Select replication topology (single-leader, multi-leader, leaderless) after partitioning is configured. Replication factor affects partition sizing.
- **data-model-selector** — Must be run before this skill. The schema and access pattern decisions made there determine viable partition keys and secondary index requirements.
- **storage-engine-selector** — The storage engine (LSM-tree vs. B-tree) affects how range scans perform within a partition. Relevant when range partitioning is selected.
---
## References
See `references/` for:
- `partitioning-decision-matrix.md` — Scoring rubric comparing range, hash, and compound strategies across 6 workload dimensions
- `rebalancing-strategies.md` — Detailed comparison of fixed, dynamic, and proportional rebalancing with configuration guidance
- `secondary-index-trade-offs.md` — Local vs. global index trade-off analysis with cost model
- `hotspot-mitigation-patterns.md` — Patterns for monotonic key hotspots, celebrity key hotspots, and write skew mitigation
- `request-routing-comparison.md` — Gossip vs. ZooKeeper vs. partition-aware client with operational trade-offs per database
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-data-model-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/hotspot-mitigation-patterns.md
# Hotspot Mitigation Patterns
Concrete patterns for detecting and relieving the most common partitioning hotspot types.
---
## Pattern 1: Monotonic Key Hotspot (Range Partitioning)
### Symptom
All writes are routed to a single partition — typically the "latest" one. Other partitions are idle. The affected partition's CPU and I/O are saturated while cluster utilization is very low overall.
### Root cause
The partition key is monotonically increasing: `created_at TIMESTAMP`, `id BIGSERIAL`, `event_sequence BIGINT`, `order_id INT AUTO_INCREMENT`. Range partitioning assigns contiguous key ranges to partitions. All new data falls in the highest range, and all current writes go to the partition holding that range.
### Mitigation: Compound prefix key
Prepend a high-cardinality, non-monotonic value to the key before the monotonic component.
**Before (bad):**
```
partition key: timestamp
range: 2024-01-01T00:00:00 → 2024-01-01T23:59:59 → Partition 0
range: 2024-01-02T00:00:00 → 2024-01-02T23:59:59 → Partition 1 ← ALL current writes
range: 2024-01-03T00:00:00 → ... → Partition 2 ← empty
```
**After (good):**
```
partition key: (sensor_name, timestamp)
range: sensor_A, 2024-01-01 → sensor_A, 2024-01-03 → Partition 0 ← sensor A writes
range: sensor_B, 2024-01-01 → sensor_B, 2024-01-03 → Partition 1 ← sensor B writes
range: sensor_C, 2024-01-01 → sensor_C, 2024-01-03 → Partition 2 ← sensor C writes
```
**Query impact:** To fetch all sensor readings within a time range (regardless of sensor), you must issue one range query per sensor prefix. This is an accepted trade-off. If you need a global time range query, consider a secondary index on `timestamp` or accept scatter/gather.
**Alternative prefix choices:** User ID, device ID, region code, category ID — any dimension with sufficient cardinality to spread load, which is also available at write time.
---
## Pattern 2: Monotonic Key Hotspot (Hash Partitioning on Sequential IDs)
### Symptom
Hash partitioning is in use, but a new auto-increment ID or UUID-v1 (time-based) is the key. With UUID-v1, the time component is in the high-order bits and produces similar hash values for keys created at the same time, clustering them in the same partition.
### Root cause
UUID-v1 and ULID encode timestamp in a way that may produce sequential values over short intervals, partially defeating hash partitioning.
### Mitigation
- Use UUID-v4 (random) as the partition key — fully random hash distribution.
- Use ULID but reverse the timestamp component to distribute writes: `timestamp_reversed + random_component`.
- Use application-generated random IDs rather than database-generated sequential IDs.
- If you must retain sequential IDs for business reasons, use a hash of the sequential ID as the partition key (not the ID itself), and store the original ID as a separate non-partitioning column.
---
## Pattern 3: Celebrity Key Hotspot (Hash Partitioning)
### Symptom
One or a few specific keys receive orders of magnitude more traffic than all others. Hash partitioning distributes keys evenly in aggregate, but the hot keys are each still concentrated in a single partition. The partition containing the celebrity key is overloaded; others are not.
### Root cause
The partitioning is correct, but the data distribution is inherently skewed: a celebrity user, a viral post, a globally-shared resource (e.g., a global counter, a shared cart, a trending topic ID). Hashing cannot help — `hash(celebrity_key)` is still a single value.
### Mitigation: Write-time key salting
Append a random suffix (00–99) to hot keys at write time, spreading one logical key across 100 physical keys.
**Write:**
```python
import random
def write_celebrity_key(db, key, value):
# Only apply to known-hot keys
if is_hot_key(key):
suffix = random.randint(0, 99)
physical_key = f"{key}_{suffix:02d}"
else:
physical_key = key
db.write(physical_key, value)
```
**Read:**
```python
def read_celebrity_key(db, key):
if is_hot_key(key):
# Must read all 100 suffix variants and merge
results = []
for i in range(100):
physical_key = f"{key}_{i:02d}"
results.append(db.read(physical_key))
return merge(results)
else:
return db.read(key)
```
**Bookkeeping:** Maintain a lookup table (or application config) of which keys are "hot" so that the salting and merge logic is applied selectively. Applying it to all keys is unnecessary overhead for the vast majority that are not hot.
**Trade-off:** Reads on hot keys now require fetching 100 records and merging. This is acceptable when writes vastly outnumber reads for the hot key, or when the merge is simple (e.g., summing a counter across suffix partitions).
### Alternative: Application-level sharding for celebrity resources
For resources like a global counter or a shared leaderboard that receive extreme write volume, use application-level partitioned counters:
- Maintain N counter shards in the application layer.
- Each write randomly increments one shard.
- To read the total, sum all N shards.
- N can be tuned based on write throughput requirements.
---
## Pattern 4: Write Skew from Unbalanced Partition Boundaries
### Symptom
Some partitions are very large; others are nearly empty. This occurs most often with range partitioning where partition boundaries were chosen based on assumed key distribution that did not match actual data.
### Root cause
In range partitioning, partition boundaries must match the actual data distribution, not an assumed uniform distribution. For example, partitioning a name field alphabetically with equal letter ranges (A–B, C–D, E–F, ...) will produce very unequal partitions because letters do not appear with equal frequency.
### Mitigation
- Use dynamic partitioning (HBase, RethinkDB, MongoDB): boundaries adapt automatically based on actual data volume.
- If using fixed boundaries, analyze actual key distribution before setting boundaries. HBase supports pre-splitting with custom split points derived from a key distribution analysis.
- Use hash partitioning if range queries are not required — hash partitioning produces near-uniform distribution regardless of key distribution.
---
## Detection: How to Identify Hotspots Before They Cause Outages
**Metrics to monitor per partition:**
- Write requests/second per partition (look for orders-of-magnitude differences)
- Storage size per partition (look for one partition growing much faster)
- CPU and I/O utilization per node (hotspot node will be saturated)
**Query to identify hot partition in Cassandra:**
```bash
nodetool tpstats # Thread pool stats — saturated pools on one node
nodetool cfhistograms # Per-table latency and request histograms
nodetool tablestats # Per-table metrics including partition size
```
**Query to identify hot shards in Elasticsearch:**
```bash
GET _cat/shards?v&s=store # Shards sorted by store size
GET _nodes/stats/indices # Per-node indexing/search rate
```
**Signal:** If one partition/shard/node shows 10x or more the request rate or storage of others, a hotspot is present.
FILE:references/partitioning-decision-matrix.md
# Partitioning Decision Matrix
Scoring rubric for selecting among range, hash, and compound key partitioning strategies. Score each dimension 1–5, then total. The highest-scoring method is the primary recommendation; use the runner-up to inform mitigation strategies.
---
## Dimensions
### 1. Range Query Support
Does the application require efficient range scans on the partition key (e.g., date ranges, alphabetical ranges, numeric ranges)?
| Score | Meaning |
|---|---|
| 5 | Range queries are the primary access pattern — efficient range scan is critical |
| 3 | Range queries are needed but not dominant |
| 1 | All queries are point lookups — range queries are never needed |
### 2. Write Distribution (inverted — higher = more uniform)
How uniformly are writes distributed across the key space?
| Score | Meaning |
|---|---|
| 5 | Keys are randomly/uniformly distributed (UUIDs, random hashes) |
| 3 | Moderate skew — some keys receive more writes but not extreme concentration |
| 1 | Strongly monotonic or celebrity-key workload — severe skew without mitigation |
### 3. Cross-Partition Query Avoidance
Does the application's primary access pattern allow most queries to be served from a single partition?
| Score | Meaning |
|---|---|
| 5 | Primary queries always fix the partition key (single-partition lookups) |
| 3 | Mixed — some queries span partitions, most do not |
| 1 | Most queries fan out across all partitions (analytics, global aggregations) |
### 4. Secondary Index Requirements
How demanding are secondary index query patterns (non-primary-key attribute filters)?
| Score | Meaning |
|---|---|
| 5 | No secondary index queries — all queries use the primary key |
| 3 | Secondary index queries exist but are infrequent or can tolerate scatter/gather |
| 1 | Secondary index queries are frequent, latency-sensitive, and cannot scatter/gather |
### 5. One-to-Many Relationship Access
Does the application frequently fetch all child records for a given parent entity, sorted?
| Score | Meaning |
|---|---|
| 5 | Yes — dominant pattern is "fetch all X for user Y, sorted by timestamp" |
| 3 | Occasionally needed |
| 1 | Never — all lookups are for individual records |
### 6. Operational Simplicity
How much operational complexity can the team absorb for partition boundary management?
| Score | Meaning |
|---|---|
| 5 | Prefer fully managed/automatic boundary assignment |
| 3 | Can manage semi-automatic with operator approval |
| 1 | Team has capacity for manual partition boundary administration |
---
## Scoring by Strategy
For each dimension, score the strategy being evaluated:
| Dimension | Range | Hash | Compound |
|---|---|---|---|
| Range query support | High score when range queries are critical | Low score — hash destroys sort order | High score — range scans within partition |
| Write distribution | Low score for monotonic keys | High score — hash distributes evenly | Medium — distributes on hash part, sequential within |
| Cross-partition query avoidance | High score if queries always fix prefix | High score for point lookups | High score — compound key collocates related records |
| Secondary index requirements | Medium | Medium | Medium (same trade-offs apply to secondary indexes) |
| One-to-many relationship access | Low score — all parents share the key space | Low score — cannot sort within partition | High score — clustering columns sort within partition |
| Operational simplicity | Medium — dynamic partitioning helps | High — hash ranges are simple to manage | Medium — compound key design requires upfront modeling |
---
## Quick-Decision Table
| Access pattern | Write distribution | Recommendation |
|---|---|---|
| Range queries dominant | Uniform keys | Range partitioning |
| Range queries dominant | Monotonic keys | Range + compound prefix (prefix + timestamp) |
| Point lookups only | Any | Hash partitioning |
| Point lookup on entity + range scan within entity | Any | Compound key (hash entity, range sort key) |
| Analytics / multi-attribute filter | Any | Hash + accept scatter/gather, or global secondary index |
---
## Example Scores
### IoT sensor platform (sensor_name + timestamp key)
| Dimension | Score |
|---|---|
| Range queries (date range scans) | 5 |
| Write distribution (timestamp = monotonic → needs prefix) | 3 (with prefix mitigation) |
| Cross-partition avoidance (prefix fixes partition) | 5 |
| Secondary index requirements (none needed) | 5 |
| One-to-many (all readings per sensor) | 4 |
| Operational simplicity (HBase dynamic) | 4 |
| **Total** | **26 / 30** |
**Winner: Range with compound prefix**
### User post feed (Cassandra: user_id + post_timestamp)
| Dimension | Score |
|---|---|
| Range queries (fetch posts by time) | 4 |
| Write distribution (user IDs uniform) | 5 |
| Cross-partition avoidance (user_id fixes partition) | 5 |
| Secondary index requirements (none — denormalized) | 5 |
| One-to-many (all posts per user) | 5 |
| Operational simplicity (Cassandra default) | 5 |
| **Total** | **29 / 30** |
**Winner: Compound key (hash + range)**
FILE:references/rebalancing-strategies.md
# Rebalancing Strategies
Detailed comparison of the three main rebalancing approaches, with configuration guidance and known failure modes.
---
## Why Not hash mod N
The naive approach to rebalancing is `partition = hash(key) mod N` where N is the number of nodes.
**The problem:** When N changes (a node is added or removed), the modulo result changes for nearly every key. If you had 10 nodes and add one (N = 11), `hash(key) mod 10` ≠ `hash(key) mod 11` for most keys — they all need to move. This makes rebalancing catastrophically expensive: adding a single node triggers a full data shuffle.
**The fix:** Assign each partition a fixed range of hash values (not a modulo). When a node is added, it takes over some partitions' hash ranges from existing nodes. Keys within a given hash range always map to the same partition; only the partition-to-node assignment changes, not the key-to-partition mapping.
---
## Strategy 1: Fixed Number of Partitions
**How it works:**
- Create many more partitions than nodes at initial setup. Common ratio: 10:1 to 100:1 (e.g., 1,000 partitions for a 10-node cluster).
- When a node is added: it steals a few complete partitions from every existing node until load is balanced.
- When a node is removed: its partitions are distributed to remaining nodes.
- Key-to-partition mapping never changes. Only partition-to-node assignment changes.
- Used by: Riak, Elasticsearch, Couchbase, Voldemort.
**Configuration:**
- Initial partition count: set at cluster creation, cannot be changed later in most fixed-partition databases.
- Rule of thumb: `initial_partitions = max_expected_nodes × 10`
- Elasticsearch: `number_of_shards` per index (set at index creation, immutable without reindex)
- Riak: `ring_size` (power of 2, set at cluster creation)
**When to choose:**
- Total data size is known and bounded — partition count can be sized correctly upfront.
- Operational simplicity is preferred — no dynamic splitting/merging to manage.
- Database supports it natively (Riak, Elasticsearch, Couchbase, Voldemort).
**Failure modes:**
- Too few partitions: limits maximum cluster size. A 10-partition cluster can never have more than 10 nodes effectively.
- Too many partitions: each partition has per-partition overhead (metadata, connections, background processes). Very high partition counts degrade performance.
- Data grows much larger than expected: partition size grows proportionally, making each partition very large and recovery from node failure very slow.
---
## Strategy 2: Dynamic Partitioning
**How it works:**
- Partitions split when they exceed a size threshold; they merge when they fall below a lower threshold.
- Partition count adapts to total data volume automatically.
- After a large partition splits, one of the two halves can be transferred to another node to balance load.
- Used by: HBase (default 10 GB threshold), RethinkDB, MongoDB (both hash and range modes).
**Configuration (HBase example):**
```
hbase.regionserver.region.split.policy = IncreasingToUpperBoundRegionSplitPolicy
hbase.hregion.max.filesize = 10737418240 # 10 GB — split threshold
```
**Pre-splitting (required for empty databases):**
An empty database starts with a single partition. All writes hit one node until the first split. For key-range partitioning, pre-splits require knowing the key distribution in advance.
```
# HBase: create a table with 10 pre-splits for known key prefixes
create 'sensor_data', 'readings', SPLITS => ['sensor01', 'sensor02', ..., 'sensor09']
```
For hash partitioning, MongoDB supports pre-splitting by hash value ranges.
**When to choose:**
- Data volume is highly variable or will grow significantly and unpredictably.
- Key-range partitioning is used (dynamic partitioning is the natural complement).
- The database natively supports it (HBase, RethinkDB, MongoDB).
**Failure modes:**
- Cold-start bottleneck: without pre-splitting, all early writes go to one partition/node.
- Split storms: a sudden data ingestion spike can trigger many rapid splits simultaneously, creating high I/O load on the splitting nodes.
- Pre-split key guessing: key-range pre-splits require knowing key distribution upfront, which may not be available for new systems.
---
## Strategy 3: Proportional to Nodes
**How it works:**
- Fixed number of partitions per node (Cassandra default: 256 vnodes per node).
- Total partition count = nodes × partitions_per_node.
- When a new node joins: it randomly selects existing partitions, splits them, and takes ownership of half of each split partition.
- Partition size stays roughly constant as nodes are added (more nodes = more partitions = same data per partition).
- Used by: Cassandra (virtual nodes / vnodes), Ketama.
**Configuration (Cassandra):**
```yaml
# cassandra.yaml
num_tokens: 256 # vnodes per node (partitions per node)
# Cassandra 3.0+: use vnodes for automatic load balancing
# Lower values (16, 32) reduce overhead but decrease balance quality
```
**Virtual nodes (vnodes):** Rather than assigning each node a single contiguous range of the hash space, each node owns many small non-contiguous ranges. This improves load distribution when nodes have heterogeneous capacity (a stronger node gets more vnodes). It also reduces the data movement needed when a node fails (failure load is spread across all other nodes, not just neighbors in the ring).
**When to choose:**
- Using Cassandra, Ketama, or a gossip-based ring system.
- Cluster will grow by adding nodes — partition size stays stable automatically.
- Hash partitioning is in use (proportional rebalancing requires hash-based boundary selection).
**Failure modes:**
- Unfair splits from randomization: with few partitions per node, random split selection can create unbalanced partition sizes. Cassandra 3.0 introduced a deterministic allocation algorithm to address this.
- New node join overhead: a new node must stream data from all nodes it takes partitions from, generating simultaneous I/O on multiple existing nodes.
---
## Automatic vs. Manual Rebalancing Decision
| Approach | Operational burden | Predictability | Risk |
|---|---|---|---|
| Fully automatic | Low — system handles all moves | Low — unpredictable timing | Rebalancing storm from false-positive failure detection |
| Semi-automatic (recommended) | Medium — operator approves proposed plan | High — operator sees full plan before execution | Delayed response to cluster changes |
| Fully manual | High — DBA configures partition assignment | High | Human error in assignment |
**Recommendation for production:** Use semi-automatic. The operator-approval step costs minutes but prevents the failure mode where a temporarily slow node triggers a rebalancing cascade that makes the situation worse.
Systems that support semi-automatic rebalancing: Couchbase, Riak, Voldemort (generate plan → operator commits → execution begins).
FILE:references/request-routing-comparison.md
# Request Routing Comparison
Detailed comparison of the three request routing approaches for partitioned databases, with operational trade-offs and database-specific guidance.
---
## The Problem
When a client sends a read or write request, the database must route it to the node that owns the target partition. As partitions rebalance (nodes added, removed, or failed over), the mapping from partition to node changes. Every component that makes routing decisions must have an up-to-date copy of this mapping.
Using stale routing metadata results in:
- Request forwarded to wrong node → error or extra hop latency
- Silent stale reads (wrong node returns stale data)
- Write to wrong node → data on the correct node is not updated
---
## Option 1: Any-Node with Forwarding (Gossip Protocol)
### How it works
1. Client sends request to any node (e.g., via round-robin load balancer or DNS).
2. If the receiving node owns the target partition, it handles the request directly.
3. If not, the receiving node looks up the correct owner in its local routing table and forwards the request to that node.
4. The correct node handles the request and returns the result (either directly to client or via the forwarding node).
Nodes maintain an up-to-date routing table by gossiping — periodically exchanging cluster state information with random peers. Changes propagate through the cluster over time (eventual consistency in metadata, not data).
### Used by
- Apache Cassandra (gossip protocol, token-aware drivers)
- Riak (riak_core gossip)
### Cassandra-specific
Cassandra drivers support "token-aware routing": the driver computes the partition key hash and connects directly to the owning node, bypassing the forwarding step. This reduces latency by eliminating the unnecessary hop.
```python
# Python cassandra-driver: token-aware load balancing policy
from cassandra.policies import TokenAwarePolicy, DCAwareRoundRobinPolicy
from cassandra.cluster import Cluster
cluster = Cluster(
contact_points=['10.0.0.1', '10.0.0.2'],
load_balancing_policy=TokenAwarePolicy(DCAwareRoundRobinPolicy())
)
```
### Operational trade-offs
| Dimension | Assessment |
|---|---|
| External dependency | None — no ZooKeeper or etcd required |
| Routing consistency | Eventual — gossip convergence may take seconds; brief stale routing is possible |
| Node complexity | Higher — routing logic lives in each database node |
| Failure tolerance | Good — no single coordinator to fail; gossip continues without central coordination |
| Configuration | Minimal — gossip is enabled by default in Cassandra/Riak |
### Best for
Systems that must avoid external coordination service dependencies. Multi-datacenter deployments where cross-DC ZooKeeper consensus would add latency. Teams with operational experience with Cassandra or Riak.
---
## Option 2: Routing Tier with Coordination Service
### How it works
1. A dedicated routing tier (partition-aware proxy/load balancer) sits between clients and database nodes.
2. Routing tier consults a coordination service (ZooKeeper, etcd) to look up the current partition-to-node mapping.
3. Routing tier forwards the request to the correct node and returns the response to the client.
4. Nodes register their partition assignments in ZooKeeper on join/leave. ZooKeeper notifies the routing tier of changes via watches.
### Used by
- Apache HBase (HMaster + ZooKeeper)
- Apache SolrCloud (ZooKeeper)
- Apache Kafka (ZooKeeper, pre-KRaft)
- LinkedIn Espresso (Helix + ZooKeeper)
- MongoDB (config server + mongos routing daemon)
### ZooKeeper-based routing (HBase/SolrCloud pattern)
```
Client → Routing Tier → ZooKeeper (watches partition assignment)
↓
Node N (owns target partition)
```
ZooKeeper stores:
- Partition-to-node assignment map
- Node health/liveness registration
- Cluster epoch (version of the assignment)
Routing tier subscribes to ZooKeeper watches on the assignment node. When a partition moves, ZooKeeper notifies all subscribed routing tier instances, which update their local routing tables.
### MongoDB routing (mongos)
MongoDB uses a dedicated `config server` (replicated MongoDB deployment) to store shard metadata. `mongos` daemons act as the routing tier, caching shard metadata locally and refreshing from config server on stale-route errors.
```
Client → mongos → config server (shard map)
↓
shard N (mongod)
```
### Operational trade-offs
| Dimension | Assessment |
|---|---|
| External dependency | Required — ZooKeeper/etcd cluster must be maintained and highly available |
| Routing consistency | Strong — ZooKeeper watch notifications are reliable; routing tier can have up-to-date metadata at all times |
| Node complexity | Lower — routing logic is isolated to the routing tier and coordination service |
| Failure tolerance | ZooKeeper itself becomes a critical path; ZooKeeper outage impairs routing tier operation |
| Operational overhead | Higher — ZooKeeper cluster requires its own deployment, monitoring, and maintenance |
### Best for
Systems with complex multi-partition query routing (e.g., analytics queries spanning multiple shards). Systems where routing correctness is critical and brief stale routing is unacceptable. Teams already operating ZooKeeper for other services (e.g., Kafka, HBase).
---
## Option 3: Partition-Aware Client
### How it works
1. The client library maintains a local cache of the partition-to-node mapping.
2. Client computes the target partition for the request and connects directly to the owning node.
3. Client refreshes its local routing map from a known source (ZooKeeper, config server, a seed node) on startup and periodically, or upon receiving a routing error from a node.
### Used by
- Cassandra drivers in token-aware mode (driver holds the token ring map)
- MongoDB application drivers (can bypass mongos and connect directly with partition info)
- DynamoDB (fully managed — AWS handles routing; client only calls the service endpoint)
### Cassandra token-aware driver (detailed)
The Cassandra driver downloads the full token ring topology from the cluster (via the `system.peers` table). For each write/read, the driver computes `hash(partition_key)` to determine the token, then looks up the token ring to find the owning node's IP address, and connects directly.
When the token ring changes (node added/removed), the driver receives a topology change event and updates its local copy. This update is asynchronous — there is a brief window where the driver has stale routing information.
### Operational trade-offs
| Dimension | Assessment |
|---|---|
| External dependency | None to minimal — may need initial seed node or config server endpoint |
| Routing consistency | Eventually consistent — client cache is refreshed asynchronously |
| Node complexity | Low — database nodes have no routing responsibility |
| Client complexity | Higher — routing logic is in each client library; library versions must be kept current |
| Latency | Lowest — zero-hop direct connection to owning node |
### Best for
Performance-critical applications where eliminating the forwarding hop matters. Systems using Cassandra with token-aware drivers. Managed database services (DynamoDB, Azure Cosmos DB) where routing is fully abstracted.
---
## Summary Comparison
| | Any-node + Forward | Routing Tier + Coord Service | Partition-Aware Client |
|---|---|---|---|
| Routing consistency | Eventual (gossip) | Strong (ZooKeeper watch) | Eventual (client cache refresh) |
| Hop count | 1–2 | 2 | 1 (direct) |
| External dependency | None | ZooKeeper/etcd | None (or seed node) |
| Operational complexity | Low | High | Medium |
| Best databases | Cassandra, Riak | HBase, SolrCloud, Kafka | Cassandra (token-aware), managed DBs |
| Failure of routing component | Gossip-based self-healing | ZooKeeper outage impairs routing | Client cache staleness |
---
## Handling Stale Routing in All Approaches
Regardless of which approach is used, clients should implement a stale-routing recovery path:
1. Send request using current routing information.
2. If the target node returns an error indicating it no longer owns the partition (e.g., Cassandra's `UnavailableException` with a "wrong coordinator" hint), refresh routing information from the authoritative source.
3. Retry the request with updated routing information.
4. If retry fails, surface the error to the application layer.
This retry-on-stale-route pattern ensures correctness even during rebalancing events when routing information is temporarily inconsistent.
FILE:references/secondary-index-trade-offs.md
# Secondary Index Trade-offs: Local vs. Global
This reference covers the detailed trade-off analysis between document-partitioned (local) and term-partitioned (global) secondary indexes in a partitioned database.
---
## The Core Problem
Secondary indexes identify records by attributes that are not the partition key. Because the partition key determines which node holds a record, a secondary index on a different attribute spans all partitions — records with a given attribute value may exist in every partition.
There is no way to partition both the primary data and the secondary index by the same key simultaneously. Every secondary index strategy involves a trade-off between write complexity and read complexity.
---
## Document-Partitioned (Local) Index
### Architecture
Each partition maintains its own secondary index, containing entries only for documents stored in that partition.
```
Partition 0 Partition 1
PRIMARY DATA PRIMARY DATA
product_id=191 color=red product_id=515 color=silver
product_id=214 color=black product_id=768 color=red
product_id=306 color=red product_id=893 color=silver
LOCAL SECONDARY INDEX LOCAL SECONDARY INDEX
color:red → [191, 306] color:red → [768]
color:black → [214] color:silver → [515, 893]
```
### Write path
A write to `product_id=191` goes to Partition 0. Only Partition 0 updates its local index. No other partition is involved.
### Read path
A query for `color=red` must be sent to Partition 0 AND Partition 1 (scatter), wait for both to respond, then merge the results (gather). With N partitions, the query fans out to all N partitions.
### Tail latency amplification
Scatter/gather means the query's latency is determined by the slowest partition's response. With 20 partitions, the query is as slow as the slowest of 20 responses. Adding partitions worsens this unless all partitions are identical in performance.
### When to choose local indexes
- Writes are frequent and write latency is critical.
- Secondary index queries are infrequent or can tolerate higher latency.
- Secondary index queries are structured so they can be scoped to a single partition (e.g., always filter by tenant_id first, and data is partitioned by tenant_id — so the secondary index query hits only one partition).
- The database of choice uses local indexes by default and you cannot change it (MongoDB, Cassandra, Elasticsearch, SolrCloud, Riak, VoltDB).
### Mitigation for scatter/gather cost
- Reduce partition count (fewer partitions = less scatter). Over-sharding (e.g., 500 shards for a dataset that fits in 20) amplifies scatter/gather with no benefit.
- Ensure secondary index queries always include the partition key as a filter, so the query can be routed to a single partition.
- Precompute frequently used aggregate queries (materialized views, denormalized rollup tables) instead of relying on secondary index scatter/gather for high-frequency queries.
---
## Term-Partitioned (Global) Index
### Architecture
A single global index covers all partitions, but the index itself is partitioned by the indexed term (or a hash of the term).
```
Global Index Partition 0 Global Index Partition 1
(terms a–r) (terms s–z)
color:black → [191, 515] color:silver → [515, 893]
color:red → [191, 306, 768] make:Volvo → [768]
make:Audi → [893]
make:Dodge → [214]
make:Ford → [306, 515]
make:Honda → [191]
```
The primary data is still partitioned by `product_id`. The global index is separately partitioned by the indexed attribute value.
### Write path
A write to `product_id=191` (color=red, make=Honda):
- Updates primary data in the `product_id` partition (e.g., Partition 0)
- Must also update `color:red` in Global Index Partition 0
- Must also update `make:Honda` in Global Index Partition 0
A single write now touches multiple partitions (the data partition + all relevant index partitions). In a term-partitioned global index, writes require distributed transactions across partitions — not supported by all databases, so global secondary indexes are often updated **asynchronously**.
### Read path
A query for `color=red` goes to exactly one global index partition (the one covering terms starting with "r"). That partition returns the full list of matching document IDs from all primary partitions. The query can then fetch the actual documents, potentially with targeted lookups to specific primary partitions.
### Asynchronous index updates
Because requiring synchronous updates across multiple index partitions would need distributed transactions, most global secondary index implementations update asynchronously. Amazon DynamoDB's global secondary indexes are normally updated within a fraction of a second, but may lag during infrastructure faults.
**Implication:** If you read the global index immediately after a write, the change may not yet be reflected. Design applications that use global secondary indexes to tolerate brief eventual consistency in index reads.
### Global index partition strategy
- **Range-partitioned term index:** Terms are assigned to partitions by sorted ranges. Supports range queries on the indexed attribute (e.g., price between 100 and 500). Used when the indexed attribute has range-query requirements.
- **Hash-partitioned term index:** Terms are assigned to partitions by hash of the term. More uniform load distribution. No range scan support on the index itself.
### When to choose global indexes
- Secondary index reads are frequent and latency-sensitive.
- Writes are less frequent or can tolerate asynchronous propagation.
- The application can tolerate eventual consistency in secondary index reads.
- The database supports global secondary indexes (DynamoDB, Oracle, Riak search feature).
---
## Comparison Summary
| Dimension | Local (Document-Partitioned) | Global (Term-Partitioned) |
|---|---|---|
| Write cost | Low — single partition update | High — multiple index partitions updated |
| Write consistency | Synchronous, always up-to-date | Often asynchronous — brief eventual consistency |
| Read cost (secondary index query) | High — scatter/gather all partitions | Low — single index partition |
| Read tail latency | Amplified by partition count | Stable — single partition |
| Operational complexity | Low — no separate index management | Higher — global index is a separate partitioned structure |
| Databases | MongoDB, Cassandra, Elasticsearch, Riak, VoltDB | DynamoDB global secondary indexes, Oracle, Riak search |
---
## Hybrid Pattern: Co-location to Avoid Secondary Index
In some cases, secondary index overhead can be eliminated entirely by redesigning the partition key to co-locate records that will be queried together.
**Pattern:** If secondary index queries always filter by a high-cardinality attribute (e.g., `tenant_id`, `store_id`, `user_id`), partition the primary data by that attribute. Secondary index queries within a tenant are now single-partition lookups — no scatter/gather, no global index needed.
**Trade-off:** This works only when secondary index queries are always scoped to the partition key value. If global cross-partition secondary index queries are also needed, co-location does not eliminate the scatter/gather requirement for those queries.
Classify a data workload as OLTP, OLAP, or hybrid, then recommend the appropriate database architecture — transactional database, dedicated data warehouse, o...
---
name: oltp-olap-workload-classifier
description: |
Classify a data workload as OLTP, OLAP, or hybrid, then recommend the appropriate database architecture — transactional database, dedicated data warehouse, or both with an ETL pipeline. Use when asked "should I use a data warehouse?", "why are my analytics queries slow on my production database?", "should I use Redshift/BigQuery/Snowflake?", or "can one database handle both transactions and reporting?" Also use for: designing star or snowflake schemas for analytics; deciding when column-oriented storage is appropriate; planning ETL pipeline structure between operational and analytical systems; evaluating whether HTAP (hybrid) databases fit a workload.
For choosing between relational/document/graph models, use data-model-selector instead. For storage engine internals (LSM-tree vs B-tree), use storage-engine-selector instead. For batch/stream pipeline design, use batch-pipeline-designer or stream-processing-designer instead.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/oltp-olap-workload-classifier
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [3]
pages: [90-103]
tags: [data-architecture, oltp, olap, data-warehouse, star-schema, column-storage, etl, analytics, database-selection]
execution:
tier: 1
mode: hybrid
inputs:
- type: description
description: "Query patterns, data volume, response time requirements, user count and access patterns — the skill guides gathering via structured questions"
- type: file
description: "Schema definitions, architecture docs, or existing query samples if available"
tools-required: [Read, Write]
tools-optional: [Grep, Glob]
mcps-required: []
environment: "Any agent environment. If a codebase or schema exists, scan for data access patterns."
---
# OLTP/OLAP Workload Classifier
## When to Use
You need to determine whether a system's data access patterns require a transactional database (OLTP), an analytic data warehouse (OLAP), or both — and then recommend the right architecture for each workload type.
Typical triggers:
- "Our analytic queries are slowing down production"
- "Should we build a data warehouse?"
- "What kind of database should we use for our reporting layer?"
- "Business analysts are running heavy queries on our app database"
- "We need to analyze years of transaction history"
- Designing a new system where both operational and reporting needs exist
Before starting: if the user is asking specifically about which storage engine (B-tree vs LSM-tree) to use for an OLTP system, use `storage-engine-selector` instead. If the user is asking about ETL pipeline architecture in detail, use `batch-pipeline-designer` after this skill.
## Context & Input Gathering
### Input Sufficiency Check
### Required Context (must have — ask if missing)
- **Query patterns:** What do the most important queries do?
- Check prompt for: SELECT patterns, GROUP BY, aggregations, JOIN count, filter on key vs scan
- If missing, ask: "Can you describe 2-3 of your most important or frequent queries? Are they looking up specific records by ID, or scanning large ranges of data to compute totals and statistics?"
- **Who runs the queries:** End users/customers or internal analysts?
- Check prompt for: mentions of "our users", "business analysts", "BI team", "dashboards", "reports"
- If missing, ask: "Who runs these queries — end users of your application, or internal analysts and data scientists?"
- **Write pattern:** How does data get in?
- Check prompt for: user-triggered writes, batch imports, event streams, ETL mentions
- If missing, ask: "How does data get written — individual transactions triggered by user actions, or bulk imports/batch jobs?"
### Important Context (strongly recommended)
- **Data volume:** Current and expected scale
- Ask: "How much data are we talking about — gigabytes, terabytes, or petabytes? How many rows in the main tables?"
- **Response time requirements:**
- Ask: "What latency is acceptable? Sub-100ms for user-facing? Seconds for analyst queries? Minutes for nightly reports?"
- **How many columns per query:** Does each query need all columns, or a few at a time?
- This determines column-oriented storage benefit
- **Historical data need:** Are queries over current state or history?
- Ask: "Are queries over the current state of data (e.g., 'what is the inventory right now?') or historical trends (e.g., 'how did sales vary across all of last year?')?"
### Observable Context (gather from environment)
- **Existing schema:** Scan for table structures — row count estimates in comments, presence of `fact_` or `dim_` table prefixes, heavy normalization vs wide flat tables
- **Query files:** Look for `.sql` files — presence of `GROUP BY`, `SUM`, `COUNT`, `AVG`, `JOIN` chains, `WHERE year =` predicates is an OLAP signal
- **ORM or application code:** Heavy `find_by_id`, `save`, `update` patterns = OLTP. Analytical query builders = OLAP.
- **Infrastructure config:** Presence of Redshift, BigQuery, Snowflake, dbt, Airflow configs = OLAP already in use or planned
### Default Assumptions
- If query patterns unknown → ask before proceeding (this is the classification's core input)
- If data volume unknown → assume "terabytes or less" (affects column storage urgency, not classification outcome)
- If response time unknown → assume user-facing needs <500ms (OLTP), analyst queries can tolerate seconds to minutes (OLAP)
- If write pattern unknown → assume user-triggered individual writes (OLTP default)
### Sufficiency Threshold
```
SUFFICIENT: query patterns + who runs them + write pattern are known
PROCEED WITH DEFAULTS: query patterns known but volume/latency not quantified
MUST ASK: query patterns are unknown
```
## Process
### Step 1: Score the Workload on 6 Dimensions
**ACTION:** Evaluate the described workload against the 6-dimension comparison table (Table 3-1 from the book). Assign OLTP or OLAP for each dimension. Count the majority.
**WHY:** OLTP and OLAP workloads differ not in degree but in kind. A single database engine is optimized for one or the other — not both. Row-oriented storage excels at OLTP (low-latency point lookups, frequent small writes) but is inefficient for OLAP (must load entire rows even when only 3 of 100 columns are needed). Column-oriented storage compresses well and accelerates aggregate scans but makes individual row inserts expensive. Classifying on all 6 dimensions prevents misclassification from a single misleading signal (e.g., a small analytic workload might have fast query times — that alone doesn't make it OLTP).
| Dimension | OLTP Signal | OLAP Signal |
|-----------|-------------|-------------|
| **Main read pattern** | Small number of records fetched by key (point lookup) | Aggregate over large number of records (scan + compute) |
| **Main write pattern** | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream ingestion |
| **Primarily used by** | End user / customer via web or mobile application | Internal analyst, for business intelligence and decision support |
| **What data represents** | Latest state (current point in time) | History of events that happened over time |
| **Dataset size** | Gigabytes to terabytes | Terabytes to petabytes |
| **Bottleneck** | Disk seek time (index lookup cost) | Disk bandwidth (volume of data scanned) |
**Scoring:**
- Count OLTP vs OLAP signals across the 6 dimensions
- 5-6 OLTP signals → pure OLTP workload
- 5-6 OLAP signals → pure OLAP workload
- 3-4 signals for one type → mixed/hybrid — flag as HTAP candidate (see Step 4)
**Output:** Classification label + dimension-by-dimension score table.
### Step 2: Route to Architecture
**ACTION:** Based on classification, select the recommended architecture path and proceed to the corresponding sub-step.
**WHY:** The routing decision is binary at its core — OLTP and OLAP systems have fundamentally different storage engine designs. Running analytic queries on an OLTP database does not just perform poorly; it actively harms the concurrent transactions by consuming disk I/O and CPU that the storage engine's indexes cannot help with. Enterprises learned this in the late 1980s and began extracting analytic workloads into separate data warehouses specifically to protect OLTP availability. The routing step enforces this architectural boundary.
- **Pure OLTP** → Step 3A (OLTP architecture guidance)
- **Pure OLAP** → Step 3B (data warehouse + schema design)
- **Mixed/hybrid** → Step 4 (HTAP separation strategy), then Step 3A + 3B
### Step 3A: OLTP Architecture Guidance
**ACTION:** For OLTP workloads, confirm the storage engine class and index strategy. This step is a quick check — detailed OLTP guidance lives in `storage-engine-selector`.
**WHY:** OLTP systems are optimized for individual record access: fast point reads (via B-tree or LSM-tree indexes), low-latency writes, and high concurrent user support. The storage engine should be selected to match the write-to-read ratio and durability requirements. Key decisions are: log-structured (LSM-tree) for write-heavy workloads, update-in-place (B-tree) for read-heavy workloads with mixed updates.
Guidance:
- Index on primary key + frequently queried foreign keys
- Normalize schema to minimize update anomalies
- Use transactions with appropriate isolation level (see `transaction-isolation-selector`)
- Do NOT allow ad-hoc analytic queries to run directly — plan for read replicas or export
**If the user also needs analytics** → proceed to Step 3B after this step, and plan ETL from OLTP to OLAP.
### Step 3B: OLAP Architecture — Data Warehouse Design
**ACTION:** Design the data warehouse schema using the star or snowflake schema pattern. Identify the central fact table and its dimension tables.
**WHY:** Data warehouses use a different schema paradigm than operational databases. Operational databases are normalized to minimize write anomalies. But normalization hurts analytic queries — analysts need to join many tables, and every join adds latency at scan scale. The star schema deliberately denormalizes into one wide fact table (every business event as a row) surrounded by dimension tables (the who, what, where, when, why of each event). This layout makes the most common analytic queries simple: filter dimensions, join to fact table, aggregate. Column-oriented storage then further optimizes this by reading only the few columns each query actually touches rather than loading full rows.
**Sub-step 3B-1: Identify the fact table**
The fact table is the center of the star. Each row represents one business event — one sale, one page view, one sensor reading, one log entry. Key characteristics:
- Rows represent events, not entities (events are immutable once they occur)
- Very wide: typically 100+ columns including metrics (quantities, prices, durations) and foreign keys to dimension tables
- Very tall: enterprises may have tens of petabytes of fact table rows
- Columns include measurable facts (quantity sold, net price, response time) plus foreign keys (product_sk, store_sk, customer_sk, date_key)
Ask: "What is the core business event you are tracking? What gets recorded every time that event occurs?"
**Sub-step 3B-2: Identify dimension tables**
Dimension tables answer the "who, what, where, when, how, why" of each event in the fact table. Key characteristics:
- One row per entity (one row per product, per store, per customer, per calendar day)
- Wide but short: many descriptive columns, relatively few rows
- Connected to fact table by surrogate keys (integer foreign keys, not natural business keys)
- Even time gets a dimension table (`dim_date`) — this allows encoding attributes like `is_holiday`, `weekday`, `fiscal_quarter` that enable time-based filtering without date arithmetic
Standard dimension table set for a retail fact table:
| Dimension | Purpose | Example columns |
|-----------|---------|-----------------|
| `dim_date` | Time-based filtering and grouping | `date_key`, `year`, `month`, `day`, `weekday`, `is_holiday` |
| `dim_product` | Product attributes for filtering/grouping | `product_sk`, `sku`, `description`, `brand`, `category` |
| `dim_store` | Store/location attributes | `store_sk`, `state`, `city`, `store_type`, `open_date` |
| `dim_customer` | Customer attributes | `customer_sk`, `name`, `date_of_birth`, `segment` |
| `dim_promotion` | Promotion/campaign attributes | `promotion_sk`, `name`, `ad_type`, `coupon_type` |
**Sub-step 3B-3: Choose star vs snowflake schema**
| Schema | Structure | When to use |
|--------|-----------|-------------|
| **Star schema** | Dimension tables are flat (denormalized) | Preferred for analyst usability — simpler SQL, fewer joins, faster iteration |
| **Snowflake schema** | Dimensions further normalized into sub-dimensions (e.g., `dim_brand` split out from `dim_product`) | When storage is a constraint or dimension data integrity is critical; harder for analysts to query |
Default to **star schema** unless there is a specific normalization requirement. Analysts work with the schema daily — simplicity compounds.
**Sub-step 3B-4: Apply column-oriented storage**
**WHY column-oriented storage matters for OLAP:** A typical analytic query accesses 4-5 columns out of 100+ in the fact table. Row-oriented storage must load every column of every matching row from disk — paying I/O cost for 95+ columns that are immediately discarded. Column-oriented storage keeps each column in a separate file. The query only reads the files for the columns it needs. For a 100-column fact table, this is a 20x reduction in I/O for a 5-column query.
Additional benefits of column storage:
- **Compression:** Column files contain repetitive values (e.g., `product_sk` repeating across millions of purchases). Run-length encoding and bitmap encoding compress these dramatically — often 10x or more.
- **Vectorized processing:** CPU can iterate over compressed column data in tight loops (fits L1 cache), enabling SIMD instruction optimization — faster than row-by-row processing with branch conditions.
- **Sort optimization:** Sorting fact table rows by the most common filter column (e.g., `date_key`) enables range scans that skip large portions of data. Store multiple sort orders across replicas for different query patterns.
**Column storage decision criteria:**
Use column-oriented storage when:
- Fact table has 20+ columns and queries touch fewer than 10 at a time
- Dataset is terabytes or larger
- Queries are read-heavy (analysts, not concurrent transactional writes)
- Aggregate functions (SUM, COUNT, AVG, MAX) dominate query patterns
Defer column storage (use row-oriented) when:
- Dataset fits comfortably in RAM
- Queries regularly need all or most columns per row
- Write throughput is the bottleneck (column storage writes are more complex — use LSM-tree ingestion pattern if column storage is needed)
**Sub-step 3B-5: Plan the ETL pipeline**
**WHY a separate ETL pipeline:** OLTP databases must remain highly available for user-facing transactions. Business analysts running heavy scans on the OLTP database consume disk I/O and table locks that starve concurrent transactions. The solution is to export data on a schedule (periodic dump or continuous stream of change events) from OLTP systems, transform it into the warehouse schema, and load it into a separate read-only warehouse. This is Extract–Transform–Load (ETL).
ETL design decisions:
| Decision | Options | Guidance |
|----------|---------|----------|
| **Extraction method** | Periodic full dump, incremental delta export, change data capture (CDC) stream | CDC is lowest-latency; full dumps are simplest but expensive at scale |
| **Transformation** | In the pipeline (ETL) or in the warehouse after loading (ELT) | ELT preferred for modern cloud warehouses with strong SQL compute (Snowflake, BigQuery, Redshift) |
| **Load frequency** | Nightly batch, hourly micro-batch, near-real-time | Match to analyst freshness requirements — nightly is often sufficient |
| **Schema management** | Separate warehouse schema from OLTP schema | Never query OLTP tables directly from analyst tools |
For detailed ETL/batch pipeline architecture, use `batch-pipeline-designer`.
### Step 4: Hybrid Workload (HTAP) — Separation Strategy
**ACTION:** When the workload shows both OLTP and OLAP signals (typically 3-4 signals on each side), design a two-tier architecture that separates operational and analytical processing.
**WHY:** Hybrid Transactional/Analytical Processing (HTAP) is the most common real-world scenario — an application database that also needs to support reporting. Running both on the same database is the default, and the default fails at scale: analytic queries lock rows, consume disk bandwidth, and compete with user-facing transactions for buffer pool space. The solution is always architectural separation: one system optimized for transactions, one for analytics, with a data pipeline connecting them. The separation can be light (read replica + materialized views for small scale) or full (dedicated warehouse at large scale).
Separation options by scale:
| Scale | Approach | Tooling |
|-------|----------|---------|
| Small (GB, few analysts) | Read replica + materialized views | PostgreSQL read replica, scheduled view refresh |
| Medium (TB, regular reporting) | Lightweight warehouse or columnar extension | DuckDB, ClickHouse, or PostgreSQL + TimescaleDB |
| Large (multi-TB, BI team) | Dedicated data warehouse + ETL pipeline | Snowflake, BigQuery, Redshift, Apache Hive |
| Very large (PB, real-time analytics) | Streaming pipeline + columnar store | Kafka CDC + Apache Pinot, Druid, or Flink + Iceberg |
**Decision rule:** If OLTP system availability is critical (SLA requirements), always separate — even at small scale. Analytic queries running on a production database are an availability risk that compounds as data grows.
### Step 5: Document the Decision
**ACTION:** Write a concise architecture decision document capturing the classification, recommendation, and key trade-offs.
**WHY:** OLTP/OLAP decisions have downstream consequences for schema design, team structure (DBA vs data engineer), tooling procurement, and pipeline work. Documenting the reasoning prevents the decision from being relitigated as systems grow, and gives future engineers the context for why the architecture is the way it is.
Use the output template below.
## Inputs
- Description of query patterns (point lookups vs scans and aggregations)
- Write pattern (user-triggered vs batch)
- Who uses the system (end users vs analysts)
- Data volume (current and projected)
- Response time requirements
- Existing schema or codebase (optional — scan if available)
## Outputs
### Workload Classification Report
```markdown
# Workload Classification: {System Name}
## Classification Result: {OLTP / OLAP / Hybrid}
### 6-Dimension Scorecard
| Dimension | Your Workload | Signal |
|-----------|--------------|--------|
| Main read pattern | {description} | OLTP / OLAP |
| Main write pattern | {description} | OLTP / OLAP |
| Primary users | {description} | OLTP / OLAP |
| What data represents | {description} | OLTP / OLAP |
| Dataset size | {estimate} | OLTP / OLAP |
| Primary bottleneck | {seek time / bandwidth} | OLTP / OLAP |
**Score: {X} OLTP / {Y} OLAP → Classification: {label}**
## Architecture Recommendation
### {OLTP / OLAP / Both}
**Recommended architecture:** {description}
**Key decisions:**
- Storage engine: {row-oriented / column-oriented / both}
- Schema: {normalized / star schema / snowflake schema}
- Data pipeline: {none needed / ETL nightly / CDC streaming}
- Separation strategy: {single DB / read replica / dedicated warehouse}
### Schema Design (if OLAP)
**Fact table:** `fact_{event}` — one row per {event type}
- Metrics: {list measurable columns}
- Foreign keys: {list dimension references}
**Dimension tables:**
- `dim_date` — {key time attributes}
- `dim_{entity}` — {key descriptive attributes}
- {additional dimensions}
**Schema choice:** Star / Snowflake — {reason}
### Column Storage Decision
{Apply / Defer} column-oriented storage — {reasoning}
### ETL Plan (if applicable)
- Extraction: {method}
- Transformation: {ETL in pipeline / ELT in warehouse}
- Load frequency: {schedule}
- Tools: {recommended}
## Trade-offs
**What we gain:** {performance, separation, scalability}
**What we accept:** {operational complexity, pipeline latency, cost}
## Next Steps
1. {First concrete action}
2. {Second action}
3. {If OLAP: use batch-pipeline-designer for ETL pipeline design}
```
## What Can Go Wrong
**Running analytics on the OLTP database.** The most common mistake. Business analysts get direct database credentials "temporarily" and the arrangement becomes permanent. As data grows, analytic queries take longer, table scans compete with user transactions for buffer pool, and OLTP latency degrades. Prevention: enforce separation early, before it becomes a political problem.
**Misclassifying a hybrid workload as pure OLAP.** If a system needs both fresh operational data (OLTP) and historical aggregate analysis (OLAP), designing only a warehouse misses the operational layer. The warehouse will always lag behind the operational system by the ETL interval — if analysts need current-minute data, a warehouse-only design fails.
**Choosing snowflake schema when star schema suffices.** Snowflake schemas are more normalized but require more joins in every analytic query. For most data warehouses, the storage savings don't justify the analyst experience penalty. Default to star schema and only move to snowflake when storage is genuinely constrained or dimension table updates are frequent.
**Not accounting for write complexity of column-oriented storage.** Column storage is optimized for reads. An update-in-place approach requires rewriting all column files for each affected row. Use LSM-tree ingestion (batch writes accumulate in a row-oriented in-memory store, then merge-flush to column files) for any column store that needs ingestion throughput. Systems like Vertica do this natively.
**Building a data cube too early and losing query flexibility.** Materialized aggregates (OLAP cubes) precompute answers to known queries and are very fast. But they can't answer questions their dimensions don't include. Most warehouses keep raw fact table data as the primary store and use cubes only as a performance layer for known high-frequency queries. Don't replace raw data with cubes.
**ETL pipeline latency mismatch.** A nightly ETL batch is inadequate if analysts need same-day data for decisions. Design the freshness requirement into the ETL architecture from the start — nightly batch, hourly micro-batch, or CDC streaming have very different pipeline architectures.
## Key Principles
- **The OLTP/OLAP divide is structural, not a matter of scale.** A small dataset can have an OLAP access pattern. A large dataset can be OLTP. Classification is about query shape and write pattern, not volume alone.
- **Protect OLTP availability.** OLTP systems power user-facing transactions — they must remain highly available. Any analytic access that risks OLTP availability must be separated architecturally, not managed by query limits or time windows.
- **Fact tables capture events, not entities.** The fact table records what happened — each purchase, each click, each sensor reading — as an immutable event row. Dimension tables describe the participants (the who/what/where). This separation is what enables later flexibility in analysis.
- **Star schema prioritizes analyst usability.** Fewer joins = faster iteration for analysts. Most data warehouses use star schema even at the cost of some normalization because the analyst experience pays compound dividends over time.
- **Column storage = query the columns you need, not the rows you have.** The insight is simple: if your query only needs 4 of 100 columns, reading all 100 columns for every row is 25x more I/O than necessary. Column-oriented storage eliminates that waste.
- **ETL separates concerns cleanly.** OLTP systems export data in their format; the ETL pipeline transforms it into the warehouse's format. Neither system needs to know the details of the other. This decoupling is the architectural payoff of the separate warehouse pattern.
## Examples
**Scenario: E-commerce company with slow reporting**
Trigger: "Our monthly sales reports are taking 20 minutes to run, and our DBAs say it's affecting checkout latency."
Process: Step 1 scoring — monthly reports aggregate all sales by region, category, and promotion (OLAP: read pattern, data represents history, disk bandwidth bottleneck); checkout is user-triggered individual record writes (OLTP: write pattern, used by end users, latest state). Score: 3 OLTP / 3 OLAP → Hybrid. Step 4: two-tier separation. OLTP: PostgreSQL for checkout and inventory. OLAP: dedicated warehouse (Redshift or Snowflake). ETL: nightly batch from OLTP to warehouse — analysts run against warehouse, never production DB. Step 3B: star schema with `fact_orders` at center (one row per order line), dimensions: `dim_product`, `dim_store`, `dim_customer`, `dim_date`, `dim_promotion`. Column storage on warehouse — fact table has 85 columns, typical reports use 6-8.
Output: Hybrid architecture. Nightly ETL extracts from PostgreSQL, loads star schema in Snowflake. Checkout latency restored. Analysts get dedicated compute without SLA risk.
**Scenario: SaaS application building its first analytics feature**
Trigger: "We want to add a dashboard showing our customers their usage trends over the past 12 months."
Process: Step 1 scoring — usage trends require scanning all events per customer over a year, aggregating by day/week (OLAP: read pattern, aggregation, history). But users trigger writes (OLAP + OLTP mix on write side). Volume: currently 50GB, projected 500GB in two years. Response time: dashboard can tolerate 2-3 second loads. Step 4: small-scale separation. Read replica with pre-aggregated materialized views is sufficient now; design schema to migrate to a proper warehouse when data exceeds 1TB or query complexity grows. Step 3B: `fact_usage_events` (one row per usage event per user per day), `dim_date`, `dim_user`, `dim_feature`. Star schema. Column-oriented storage deferred until >1TB — current data fits in memory.
Output: Start with PostgreSQL read replica + materialized views refreshed hourly. Schema designed as star schema from day one. Explicit migration trigger: add dedicated warehouse when data exceeds 1TB or query refresh time exceeds 10 seconds.
**Scenario: IoT sensor data platform**
Trigger: "We collect readings from 10,000 sensors every 30 seconds. We need to detect anomalies in real-time but also run monthly trend analysis."
Process: Step 1 scoring — 10,000 sensors × 2 readings/min × 60 min × 24 hr = ~28.8M rows/day. Historical trend analysis scans months of data for aggregation (OLAP: read pattern, dataset size, bandwidth bottleneck, history). Real-time anomaly detection reads the latest N readings per sensor (OLTP: read by key, latest state, low-latency). Score: 3 OLTP / 3 OLAP → Hybrid, but the write pattern (streaming ingestion at high throughput) tilts toward OLAP infrastructure. Step 4: streaming architecture — Kafka for ingestion, Flink or Spark Streaming for real-time anomaly detection (OLTP path), Apache Iceberg or ClickHouse for historical columnar storage (OLAP path). Step 3B: `fact_readings` (one row per sensor reading with timestamp, sensor_id, value, unit), `dim_sensor` (location, type, calibration metadata), `dim_date`. Column storage applied immediately — 30B rows/year, queries scan months of data for trend analysis.
Output: Two-path architecture. Real-time: Kafka → Flink anomaly detection → alert system. Historical: Kafka → ClickHouse (columnar) for trend queries. Shared ingestion pipeline, separate read paths. Cross-reference `batch-pipeline-designer` for ETL scheduling on the historical path.
## References
- For OLTP storage engine selection (B-tree vs LSM-tree), use `storage-engine-selector`
- For ETL and batch pipeline architecture, use `batch-pipeline-designer`
- For detailed comparison table and schema templates, see [references/workload-comparison-table.md](references/workload-comparison-table.md)
- Source: Designing Data-Intensive Applications, Ch. 3, "Transaction Processing or Analytics?" (pp. 90-103), Martin Kleppmann
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/workload-comparison-table.md
# OLTP vs OLAP Workload Reference
Source: Designing Data-Intensive Applications, Ch. 3 (Table 3-1), Martin Kleppmann
## 6-Dimension Comparison Table
| Property | OLTP (Transaction Processing) | OLAP (Analytic Systems) |
|----------|------------------------------|------------------------|
| **Main read pattern** | Small number of records per query, fetched by key | Aggregate over large number of records |
| **Main write pattern** | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream |
| **Primarily used by** | End user / customer, via web application | Internal analyst, for decision support |
| **What data represents** | Latest state of data (current point in time) | History of events that happened over time |
| **Dataset size** | Gigabytes to terabytes | Terabytes to petabytes |
| **Bottleneck** | Disk seek time (index lookup) | Disk bandwidth (data scanned per query) |
## Bottleneck Explained
**OLTP — disk seek time:** Each user query touches a small number of records. The storage engine uses an index (B-tree or LSM-tree) to find those records. The cost is the seek — how fast the disk can jump to the right location. Optimizing for OLTP means fast index lookups and low per-record I/O.
**OLAP — disk bandwidth:** Each analytic query scans millions or billions of rows, reading a few columns from each. The bottleneck is not seek time (the query is doing a sequential scan, not random access) but total bytes read from disk. Optimizing for OLAP means reducing the volume of data that must be transferred — column-oriented storage is the primary tool.
## Star Schema Components
### Fact Table
| Characteristic | Detail |
|---------------|--------|
| **What it stores** | One row per business event (one row per sale, click, reading) |
| **Row count** | Billions to trillions — very tall |
| **Column count** | 100+ — very wide (all metrics + all dimension foreign keys) |
| **Column types** | Measurable facts (quantities, prices, durations) + surrogate key foreign keys |
| **Mutability** | Append-only — events are immutable once recorded |
### Dimension Tables
| Characteristic | Detail |
|---------------|--------|
| **What they store** | Descriptive attributes of the entities involved in events |
| **Row count** | Thousands to millions — relatively short |
| **Column count** | Wide — all descriptive attributes relevant to analysis |
| **Relationship to fact** | Connected by surrogate keys (integer, not natural business key) |
| **Examples** | `dim_date`, `dim_product`, `dim_store`, `dim_customer`, `dim_promotion` |
### Surrogate Keys
Dimension tables use surrogate keys (auto-increment integers like `product_sk`) rather than natural business keys (like `sku`). Reasons:
- Natural keys can change (product SKUs get reassigned); surrogate keys never change
- Integers join faster than strings
- Allows encoding multiple versions of a dimension record (slowly changing dimensions)
## Star vs Snowflake Schema
| Factor | Star | Snowflake |
|--------|------|-----------|
| **Dimension structure** | Flat, denormalized | Further normalized into sub-dimensions |
| **Join complexity** | Lower — one join per dimension | Higher — multiple joins to traverse sub-dimensions |
| **Query simplicity** | High — analysts prefer | Lower — more complex SQL |
| **Storage efficiency** | Lower — some redundancy in dimension tables | Higher — normalized sub-tables eliminate redundancy |
| **Data integrity** | Lower — updates must touch dimension table rows | Higher — updates propagate through sub-dimensions |
| **Default choice** | Yes — preferred in most warehouses | Only when storage or update frequency justifies it |
## Column-Oriented Storage Decision Criteria
### Apply column-oriented storage when:
- Fact table has 20+ columns and typical queries access fewer than 10
- Dataset exceeds 1TB (compression and I/O reduction pays off at scale)
- Query mix is dominated by aggregations (SUM, COUNT, AVG, GROUP BY)
- Write throughput is manageable (column stores use LSM-tree ingestion internally)
### Defer column-oriented storage when:
- Dataset fits comfortably in RAM (caching eliminates disk I/O bottleneck)
- Most queries need all or most columns (row storage is more efficient here)
- Write-heavy workload with individual row updates (column stores prefer batch writes)
### Column Compression Techniques
| Technique | When it helps | Compression ratio |
|-----------|--------------|-------------------|
| **Run-length encoding** | Low-cardinality columns, sorted data | 10x-100x for sorted low-cardinality |
| **Bitmap encoding** | Low-cardinality columns (< ~100K distinct values) | Very efficient for WHERE IN queries |
| **Dictionary encoding** | String columns with repeated values | Significant for category, status, region columns |
| **Delta encoding** | Monotonically increasing values (timestamps, IDs) | 4-8x for sequential IDs |
Bitmap encoding is particularly powerful for WHERE predicates in data warehouse queries:
```sql
WHERE product_sk IN (30, 68, 69)
```
Load bitmaps for `product_sk = 30`, `product_sk = 68`, `product_sk = 69` and bitwise OR them. Very fast.
```sql
WHERE product_sk = 31 AND store_sk = 3
```
Load bitmaps for both, bitwise AND them. Both columns must be in the same row order — this is preserved by column storage layout.
## Example Star Schema: Grocery Retailer
From Designing Data-Intensive Applications, Figure 3-9.
```sql
-- Fact table: one row per customer purchase (line item)
CREATE TABLE fact_sales (
date_key INTEGER REFERENCES dim_date(date_key),
product_sk INTEGER REFERENCES dim_product(product_sk),
store_sk INTEGER REFERENCES dim_store(store_sk),
promotion_sk INTEGER REFERENCES dim_promotion(promotion_sk), -- NULL if no promotion
customer_sk INTEGER REFERENCES dim_customer(customer_sk), -- NULL if unknown
quantity INTEGER,
net_price DECIMAL(10,2),
discount_price DECIMAL(10,2)
);
-- Date dimension: encodes time attributes useful for analysis
CREATE TABLE dim_date (
date_key INTEGER PRIMARY KEY,
year INTEGER,
month VARCHAR(3),
day INTEGER,
weekday VARCHAR(3),
is_holiday BOOLEAN
);
-- Product dimension: all attributes relevant to product analysis
CREATE TABLE dim_product (
product_sk INTEGER PRIMARY KEY,
sku VARCHAR(20),
description VARCHAR(200),
brand VARCHAR(100),
category VARCHAR(100)
);
-- Store dimension: all attributes relevant to store analysis
CREATE TABLE dim_store (
store_sk INTEGER PRIMARY KEY,
state VARCHAR(2),
city VARCHAR(100)
);
-- Customer dimension
CREATE TABLE dim_customer (
customer_sk INTEGER PRIMARY KEY,
name VARCHAR(200),
date_of_birth DATE
);
-- Promotion dimension (NULL promotion_sk in fact table = no promotion applied)
CREATE TABLE dim_promotion (
promotion_sk INTEGER PRIMARY KEY,
name VARCHAR(200),
ad_type VARCHAR(50),
coupon_type VARCHAR(50)
);
```
### Example Analytic Query (from the book)
```sql
-- Analyzing whether people are more inclined to buy fresh fruit or candy
-- depending on the day of the week (2013 calendar year)
SELECT
dim_date.weekday,
dim_product.category,
SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
dim_date.year = 2013
AND dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
dim_date.weekday,
dim_product.category;
```
This query only accesses 3 columns from `fact_sales` (`date_key`, `product_sk`, `quantity`). In a 100-column fact table with row-oriented storage, the engine must load the full row for every matching row. With column-oriented storage, it reads only those 3 column files — a ~33x I/O reduction for this example.
## ETL Pipeline Overview
```
OLTP Systems Pipeline OLAP Warehouse
─────────────────────────────────────────────────────────────────────
┌─────────────┐ extract ┌──────────┐ transform ┌─────────────┐
│ Sales DB │ ─────────► │ │ ──────────► │ │
├─────────────┤ │ ETL │ │ Data │
│ Inventory │ ─────────► │ Process │ ──────────► │ Warehouse │◄── Business
│ DB │ │ │ │ │ Analysts
├─────────────┤ extract └──────────┘ load └─────────────┘
│ Geo DB │ ─────────────────────────────────►
└─────────────┘
```
**Extract:** Periodic full dump, incremental delta, or CDC stream from OLTP
**Transform:** Apply business rules, clean data, map to warehouse schema, resolve surrogate keys
**Load:** Bulk insert into fact and dimension tables
### ELT vs ETL
| Approach | Description | When to use |
|----------|-------------|-------------|
| **ETL** | Transform before loading into warehouse | Legacy warehouses with limited compute; transformation logic is complex |
| **ELT** | Load raw data, transform inside warehouse using SQL | Modern cloud warehouses (Snowflake, BigQuery, Redshift) — warehouse compute is cheap and powerful |
## Materialized Views and OLAP Cubes
**Materialized view:** An actual copy of query results stored on disk, updated when underlying data changes. Unlike a regular view (which is just a saved query), a materialized view is precomputed. Reads are fast; writes are more expensive (must update the view).
**OLAP cube (data cube):** A special case of materialized view. A multi-dimensional grid of precomputed aggregates. For example, `SUM(net_price)` grouped by every combination of `(date_key × product_sk)` — a 2D cube. With 5 dimensions, a 5-dimensional hypercube.
**Trade-off:** Cubes are very fast for queries that match the cube's dimensions, but cannot answer queries involving attributes not included in the cube. Raw fact table data is more flexible. Most warehouses keep raw data as the primary store and layer cubes as a performance optimization for known high-frequency queries.
## Common OLAP Database Technologies
| Scale | Technology | Type | Notes |
|-------|-----------|------|-------|
| Small | DuckDB, SQLite | Embedded columnar | Excellent for <100GB, local analysis |
| Medium | ClickHouse, TimescaleDB | Columnar OLAP | Self-hosted, high performance |
| Large (cloud) | Snowflake, BigQuery, Redshift | Managed cloud warehouse | Fully managed, scales to PB |
| Large (open source) | Apache Hive, Spark SQL, Presto | SQL-on-Hadoop | Open source equivalents |
| Very large | Apache Druid, Apache Pinot | Real-time OLAP | Sub-second queries on streaming data |
| Columnar format | Apache Parquet, ORC | Storage format | Used with Spark, Flink, Presto |
Select a data encoding format (JSON, Protobuf, Thrift, or Avro) and design a schema evolution strategy that preserves backward and forward compatibility thro...
---
name: encoding-format-advisor
description: |
Select a data encoding format (JSON, Protobuf, Thrift, or Avro) and design a schema evolution strategy that preserves backward and forward compatibility through rolling upgrades. Use when asked "should I use Protobuf or JSON?", "how do I evolve my schema without breaking old clients?", "how does Avro schema evolution work?", "what's the difference between Thrift and Protocol Buffers?", or "how do I add/remove fields without breaking compatibility?" Also use for: choosing text vs. binary encoding for internal services; checking whether a schema change breaks compatibility; diagnosing unknown field loss bugs during rolling upgrades; planning per-dataflow encoding strategy (database storage vs. REST/RPC vs. message broker).
Covers five encoding families: language-specific, JSON/XML/CSV, binary JSON, Thrift/Protobuf, and Avro — with writer/reader schema reconciliation and per-dataflow-mode analysis.
For data model selection (relational/document/graph), use data-model-selector instead. For message broker or stream pipeline design, use stream-processing-designer instead.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/encoding-format-advisor
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: [data-model-selector]
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [4]
tags: [encoding, serialization, schema-evolution, protobuf, thrift, avro, json, xml, backward-compatibility, forward-compatibility, rolling-upgrade, binary-encoding, schema-registry, dataflow, message-broker, rpc, grpc, evolvability]
execution:
tier: 1
mode: hybrid
inputs:
- type: document
description: "Current data format, schema definition files (.proto, .thrift, .avsc, .json), system topology description, or architecture document describing services and their data exchange"
- type: code
description: "Application source files using encoding libraries, or schema files to analyze for compatibility issues"
tools-required: [Read, Write]
tools-optional: [Grep]
mcps-required: []
environment: "Any agent environment. Works with pasted schema definitions, .proto/.thrift/.avsc files, docker-compose.yml, architecture.md documents, or codebase analysis."
discovery:
goal: "Produce a concrete encoding format recommendation with per-format compatibility rules and a schema evolution plan — not a survey of encoding options"
tasks:
- "Classify the dataflow mode (database, service calls, async messaging) to constrain format selection"
- "Score each encoding family against six criteria for the specific system"
- "Apply compatibility rules for the recommended format to validate each planned schema change"
- "Produce a schema evolution plan with safe change procedures and prohibited operations"
- "Document dataflow-specific encoding guidance and watch signals"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "tech-lead", "site-reliability-engineer"]
experience: "intermediate-to-advanced — assumes familiarity with building network services and basic serialization concepts"
triggers:
- "Team is designing a new internal service API and needs to choose an encoding format"
- "System uses rolling upgrades and needs a format that supports backward and forward compatibility"
- "Engineer needs to add/remove/rename a field and is unsure whether it breaks compatibility"
- "Service uses JSON but is hitting performance or payload size problems at scale"
- "Team is choosing between gRPC (Protobuf) and REST (JSON) for internal services"
- "Data pipeline needs a schema evolution strategy for records stored in a message broker or data lake"
- "Old-code/new-code coexistence is required and unknown fields are being silently dropped"
not_for:
- "Choosing specific RPC framework (gRPC vs Thrift RPC vs Avro RPC) beyond encoding format — these inherit their encoding's compatibility properties"
- "Database replication encoding format — typically database-internal and not configurable"
- "Encryption or compression decisions — separate from encoding format selection"
- "Data model shape (relational vs. document vs. graph) — use data-model-selector first"
---
# Encoding Format Advisor
## When to Use
You are designing or evolving a system that passes data between processes — over the network, through a message broker, or persisted to disk — and need to choose how to encode that data and how to evolve the schema over time without breaking running services.
This skill applies when:
- You are choosing an encoding format before building or extending a service API
- Your system uses rolling upgrades: new and old versions of code run simultaneously, reading and writing the same data
- You plan schema changes (add field, remove field, rename field, change type) and need to know whether each change is safe
- You are experiencing performance or payload problems with a text-based format and evaluating binary alternatives
- Different teams or services read the same data and cannot all upgrade simultaneously
- You need to store data durably (database, file, Kafka topic) where records outlive the code that wrote them
**This skill addresses format selection and schema evolution.** For data shape decisions (relational vs. document vs. graph), use `data-model-selector` first. For stream processing pipeline design using encoded data, see `stream-processing-designer`.
---
## Context & Input Gathering
Before running the selection framework, collect:
### Required
- **Dataflow mode:** How does data move? Database writes/reads, synchronous service calls (REST/RPC), or asynchronous message passing (message broker/queue)?
- **System boundaries:** Is this data exchanged within one organization's services, or across organizational boundaries (public API, partner integrations)?
- **Rolling upgrade requirement:** Do you need old code and new code to read the same data simultaneously? Or can you do full-fleet restarts?
- **Schema change plan:** What changes are you planning (or expecting over the system's lifetime)? Add fields? Remove fields? Change field types? Rename fields?
### Important
- **Language and platform:** What programming languages are involved on both the writer and reader side? Are they statically typed (Java, Go, C++) or dynamically typed (Python, JavaScript, Ruby)?
- **Current format (if any):** Provide existing schema files (`.proto`, `.thrift`, `.avsc`, JSON Schema) or representative payload samples for analysis
- **Performance constraints:** Are payload size or parse throughput meaningful concerns (terabyte-scale data, high-frequency messaging)?
- **Team tooling familiarity:** What encoding formats does the team already operate? (Operational familiarity is a legitimate factor.)
### Optional (improves recommendation precision)
- **Architecture or data flow documents:** Help identify all producers and consumers of each data type
- **docker-compose.yml or service topology:** Identifies message brokers, data stores, and service-to-service connections
- **Existing codebase:** Grep for encoding library imports and schema definitions to understand current baseline
If the dataflow mode and rolling upgrade requirement are missing, ask before proceeding. A format recommendation without knowing these two factors is unreliable.
---
## Process
### Step 1: Classify the Dataflow Mode
**Action:** Determine how data flows between the processes you are encoding for. The three dataflow modes impose different constraints on format selection — especially on schema version negotiation and how long compatibility must be maintained.
**WHY:** Encoding format compatibility is a property of a relationship between a writer process and a reader process. That relationship looks fundamentally different in each dataflow mode. In databases, a process writing data today may be read by the same process five years from now using a different schema version — data outlives code. In synchronous service calls (REST/RPC), you can assume servers upgrade before clients, simplifying the compatibility requirement to backward-only on requests. In asynchronous message passing, a consumer may be processing a message written weeks ago by a producer that has since been upgraded or decommissioned. Each mode changes which compatibility properties you need, and therefore which formats are viable.
**Dataflow Mode A — Databases (data-at-rest)**
- One process writes to the database; another (possibly the same process, later) reads it
- Data persists independently of the code that wrote it — data outlives code
- Schema changes may leave old and new records mixed in the same table or collection for years
- Backward compatibility is essential: new code must read data written by old code
- Forward compatibility is also required when multiple application versions run simultaneously (rolling upgrade): old code reads data written by new code
- Archival export or snapshot files also fall here
**Dataflow Mode B — Synchronous service calls (REST, RPC)**
- Client sends a request; server responds; both parties are online simultaneously
- Reasonable assumption: servers upgrade before clients (staged rollout: update servers first, then clients)
- Forward compatibility on requests (old client, new server) and backward compatibility on responses (new client, old server)
- Across organizational boundaries (public APIs): clients you don't control may never upgrade — compatibility must be maintained indefinitely
- Within one organization: API versioning gives you control over the upgrade window
**Dataflow Mode C — Asynchronous message passing (message brokers, event streams)**
- Producer encodes a message; broker holds it temporarily; consumer decodes it later
- Producer and consumer are decoupled — they may be at different schema versions
- If a consumer republishes messages to a downstream topic, it must preserve unknown fields it cannot interpret (or it will silently corrupt the event stream)
- Schema must support backward and forward compatibility: messages written by old producers must be readable by new consumers, and vice versa
- Schema registry (e.g., Confluent Schema Registry for Avro/Kafka) becomes essential at scale
---
### Step 2: Score Each Encoding Family Against Your Requirements
**Action:** Score each of the five encoding families against six criteria relevant to your system. Score 1–5 per criterion per family. Skip "language-specific" family unless evaluating whether to use it (the answer is almost always no).
**WHY:** Engineers frequently default to JSON because it is familiar or default to Protocol Buffers because they "heard it's faster," without evaluating the actual criteria relevant to their system. The scoring forces examination of the criteria that change the decision: whether the schema must be dynamically generated (favors Avro over Protobuf), whether human-readability is required for debugging (favors JSON), whether the data crosses organizational boundaries (favors JSON/REST), whether statically typed code generation is valued (favors Thrift/Protobuf), and whether schema evolution flexibility is the primary constraint (favors Avro). Running all families — even the obvious misfits — produces the rationale needed for a technical decision document.
**Encoding families:**
| Family | Examples | Key characteristic |
|--------|----------|-------------------|
| Language-specific | Java Serializable, Python pickle, Ruby Marshal | Built into the language; no cross-language support |
| Text-based | JSON, XML, CSV | Human-readable; self-describing; no schema required |
| Binary JSON variants | MessagePack, BSON, CBOR | JSON data model; binary encoding; still no schema |
| Schema-driven binary (tag-based) | Apache Thrift (BinaryProtocol, CompactProtocol), Protocol Buffers | Schema required; field tags in encoded data; compact |
| Schema-driven binary (name-based) | Apache Avro | Schema required; no tags in encoded data; writer and reader schemas resolved at decode time |
**Scoring criteria:**
**1. Cross-language support** — Can both writer and reader sides use this format regardless of programming language?
- 5 = Fully cross-language with libraries for all major languages
- 3 = Supported in most major languages; some gaps
- 1 = Tied to a single language or platform (disqualifying for most systems)
**2. Schema evolution safety** — Does the format provide explicit mechanisms for backward and forward compatibility as the schema changes?
- 5 = Explicit rules; format enforces compatibility (field tags / name-based resolution); incompatible changes detected at schema check time
- 3 = Possible with discipline; no automatic detection of incompatible changes
- 1 = No versioning support; any schema change may break readers
**3. Payload compactness** — How large are encoded payloads compared to the logical data?
- 5 = Very compact; field names omitted from encoded data; variable-length integers
- 3 = Moderate; some overhead (binary JSON keeps field names; text has quotes and punctuation)
- 1 = Verbose; full field names plus type metadata in every record (text-based formats at scale)
**4. Human-readability and debuggability** — Can engineers read and debug encoded data without tooling?
- 5 = Directly human-readable; pasteble into browser/curl for debugging
- 3 = Readable with lightweight tooling (e.g., `protoc --decode_raw`, `avro-tools`)
- 1 = Binary; requires schema and decode tool to inspect
**5. Code generation and type safety** — Does the format support generating typed structs or classes from a schema, enabling compile-time type checking?
- 5 = First-class code generation; strong type safety in statically typed languages
- 3 = Optional code generation; usable in both statically and dynamically typed contexts
- 1 = No schema; all types inferred at runtime; no static type checking
**6. Dynamically generated schema support** — Can schemas be generated programmatically (e.g., from a database table definition) without manual field tag assignment?
- 5 = Tag-free schema; generation is trivial; column name → field name directly
- 3 = Schema can be generated but requires careful tag management to avoid conflicts
- 1 = Manual tag assignment required; automation requires bookkeeping infrastructure
**Score each family:**
```
JSON/XML Binary JSON Thrift/Protobuf Avro
Cross-language support [1-5] [1-5] [1-5] [1-5]
Schema evolution safety [1-5] [1-5] [1-5] [1-5]
Payload compactness [1-5] [1-5] [1-5] [1-5]
Human-readability [1-5] [1-5] [1-5] [1-5]
Code generation/type safety [1-5] [1-5] [1-5] [1-5]
Dynamic schema support [1-5] [1-5] [1-5] [1-5]
Total [6-30] [6-30] [6-30] [6-30]
```
See `references/format-comparison-table.md` for pre-filled scores with rationale for each criterion, plus byte counts for the same record encoded in all five formats.
---
### Step 3: Apply the Format Selection Decision Rules
**Action:** Apply explicit if/then rules to produce a primary format recommendation. These rules encode the structural logic of the format characteristics — they are not heuristics but direct consequences of how each format handles field identification, schema negotiation, and type encoding.
**WHY:** Scoring produces numbers; decision rules produce a recommendation. The rules encode the non-obvious consequences of format choice: JSON's number ambiguity (integers vs. floats, no precision specification) causes silent data corruption at scale; Thrift/Protobuf's tag-based schema evolution is robust for most cases but breaks when schemas are generated dynamically (because tags must be managed manually); Avro's writer/reader schema resolution is powerful but requires a schema distribution mechanism (file header, schema registry, version negotiation) that must be built or operated. These consequences are expensive to discover after deployment.
**Rule 1 — Use JSON (or REST/JSON) if ANY of the following are true:**
- Data crosses organizational boundaries (public API, partner integrations, browser clients) — JSON is the de facto standard; the difficulty of getting external parties to adopt anything else outweighs efficiency gains
- The primary consumer is a web browser or JavaScript runtime
- Human-readability for debugging and manual API testing is a hard requirement
- The team has no existing schema management infrastructure and schema evolution is low-frequency
- Note: Use JSON with explicit schema validation (JSON Schema) if you need schema documentation; without validation, schema drift into application code is slow and hard to detect
**Rule 2 — Use Protocol Buffers (Protobuf) if ALL of the following are true:**
- Data flows within one organization's services (internal service-to-service)
- Both sides use statically typed languages (Java, Go, C++, Rust) — code generation adds significant value
- Schema changes are infrequent and managed by the team owning the schema (not dynamically generated from another source)
- You need both compact encoding and explicit compatibility rules enforced by a schema checker
- Rolling upgrade compatibility is required (field tags provide this; see Step 4)
- Note: gRPC uses Protocol Buffers and inherits its compatibility properties
**Rule 3 — Use Apache Thrift if ALL of the following are true:**
- Same conditions as Protocol Buffers, AND
- The existing codebase already uses Thrift (e.g., inherited from a Facebook/Twitter-lineage stack)
- Note: Thrift and Protocol Buffers are functionally equivalent for most decisions; the choice between them is primarily ecosystem and existing adoption. CompactProtocol is Thrift's most efficient encoding; use it over BinaryProtocol unless compatibility with older tooling is required.
**Rule 4 — Use Apache Avro if ANY of the following are true:**
- Schemas are dynamically generated from another source (e.g., a relational database schema dump, an Elasticsearch mapping, or code-generated from an ORM) — Avro's tag-free encoding means column names map directly to field names without manual tag assignment
- The data is stored in large file archives where all records share one schema (Hadoop, data lake, archival exports) — Avro object container files embed the schema once per file
- Dynamically typed languages (Python, JavaScript, Ruby) are primary consumers and code generation adds no value — Avro works well without generated code
- You are using Apache Kafka with a schema registry (Confluent Schema Registry natively supports Avro)
- Note: Avro requires a schema distribution mechanism — either embed the writer's schema in the file (object container files), store schema versions in a database (one version number per record), or negotiate schema on connection setup (Avro RPC)
**Rule 5 — Avoid language-specific encodings (Java Serializable, Python pickle, Ruby Marshal) unless:**
- Data is purely transient (in-memory cache within a single process, never written to disk or network)
- Security implications are understood and mitigated: deserializing untrusted bytes can execute arbitrary code
- No cross-language communication is needed now or in the foreseeable future
**Tie-breaker when rules 2 and 4 both apply (schema-driven binary required, but dynamic generation is also needed):**
Choose Avro if schema generation frequency is high (schemas change when the source schema changes, e.g., database column added). Choose Protobuf if schema changes are infrequent and controlled by your team (field tag management overhead is acceptable).
---
### Step 4: Apply Compatibility Rules for the Recommended Format
**Action:** For each planned or expected schema change, check it against the per-format compatibility rules. Classify each change as: safe (backward and forward compatible), backward-only (new code reads old data, but not vice versa), forward-only (old code reads new data, but not vice versa), or breaking (incompatible in at least one direction).
**WHY:** The core problem encoding formats solve is not just efficiency — it is allowing old and new versions of code to coexist while reading the same data. During a rolling upgrade, some nodes run new code and some run old code; they write data to the same database or send messages to the same topic. Forward compatibility (old code reads data written by new code) is the harder direction: it requires old code to safely ignore additions made by new code rather than crashing. Each format handles this differently, and the permitted changes differ significantly.
#### Protocol Buffers and Thrift: Field Tag Rules
Field tags (the numbers `= 1`, `= 2`, `= 3` in Protobuf; `1:`, `2:`, `3:` in Thrift) are the identity of a field in the encoded data — not the field name. The encoded data contains only tags and values; names are only in the schema. This is what enables forward compatibility: a reader that sees an unknown tag number can skip that field using the type annotation to determine how many bytes to skip.
**Safe changes (backward and forward compatible):**
- Add a new field with a new (previously unused) tag number — old code ignores it (forward compatible), new code can read old data that lacks the field (backward compatible, provided the field is optional or has a default)
- Rename a field — names are not in the encoded data; tags are; renaming is invisible to the wire format
- Change a field from `required` to `optional` — safe; `required` is a runtime check, not an encoding property
**Backward compatible only (new code reads old data; old code cannot read new data):**
- Change the field name and keep the tag — backward compatible (old code uses tag, not name); safe
**Breaking changes — never do these:**
- Remove a required field — old code will fail to parse records written by new code that omits it
- Change a field's tag number — all existing encoded data referencing the old tag becomes unreadable
- Reuse a tag number that was previously used for a removed field — new code will misinterpret old data that contains the old field's data under that tag
- Add a new field as `required` — old code that wrote data before the field existed will fail the `required` check when new code reads it; every new field added after initial deployment must be `optional` or have a default value
- Change the datatype of a field in a way the parser cannot convert (e.g., `int32` to `string`) — type mismatch causes a parse error or silent truncation
**Datatype change rules (Protobuf):**
- `int32` → `int64`: safe; new code fills missing high bits with zeros; old code reads 64-bit value into 32-bit variable (truncated if value exceeds 32-bit range)
- `optional` (single-value) → `repeated` (multi-value): safe; new code reading old data sees a list with zero or one elements; old code reading new data sees the last element of the list
- `repeated` → `optional`: safe in the reverse direction only if the new code handles a single-element list
#### Apache Avro: Writer/Reader Schema Rules
Avro does not use field tags. The encoded data contains only values concatenated in schema field order — no type annotations, no tags. The reader must have access to both the writer's schema (which defined the byte layout) and the reader's schema (which defines what the application expects). The Avro library resolves the difference by matching fields by name and filling defaults for missing fields.
**Safe changes (backward and forward compatible):**
- Add a field with a default value — old readers get the default when reading data that lacks the field (backward compatible); new readers ignore the field when reading old data that has it (forward compatible, if the new reader's schema also declares the field)
- Remove a field that has a default value — old readers that still expect the field get the default; new readers that wrote data without the field are fine
- Change field order — Avro matches by name, not position; order changes are transparent
**Backward compatible only:**
- Add a field without a default value — new code can read old data (old code wrote the field, new code reads it); old code cannot read new data (new writer omits the field; old reader has no default to fall back to)
**Forward compatible only:**
- Remove a field without a default value — old readers that expect the field get no value and have no default; this breaks backward compatibility
**Breaking changes:**
- Add a field that has no default value AND remove a field that has no default value simultaneously — breaks in both directions
- Change a field name without adding the old name as an alias in the reader's schema — the Avro resolution algorithm matches by name; a renamed field without an alias is treated as a deleted field plus a new field
**Avro null handling:** Avro does not allow null as a value for a field unless the field's type is a union that includes null (e.g., `union { null, long } favoriteNumber = null`). This is more explicit than Protocol Buffers' optional fields and prevents bugs by forcing you to declare nullability in the schema.
**Avro schema distribution checklist:**
- Large file with many records: embed writer's schema once in the Avro object container file header
- Database with individually written records: store schema version number per record; maintain a schema version registry in the database
- Network connection between two services: negotiate schema version on connection setup (Avro RPC protocol); client and server exchange schemas at handshake time
- Kafka with schema registry: each message includes a schema ID (not the full schema); schema registry stores schemas by ID; producers register schemas; consumers fetch schemas on first encounter
#### JSON/XML: Compatibility Through Convention
JSON and XML have no built-in compatibility mechanism. Compatibility is achieved by convention and application discipline.
**Safe by convention:**
- Add a new field — if all readers ignore unknown fields (the convention is to use lenient parsers), this is forward compatible; backward compatible because new readers handle missing fields with null/default in application code
- Rename a field — breaking; all readers must update simultaneously
**Common failure modes:**
- Removing a field — readers that depend on it crash or silently use a null/zero default
- Changing a field's type — JSON does not distinguish integers from floats; a field that was always an integer may be parsed as a float by some implementations, causing precision loss for integers > 2^53
- Binary strings — JSON has no binary type; binary data requires Base64 encoding, which increases size by 33% and requires encoding/decoding logic on both sides
---
### Step 5: Analyze Dataflow-Mode-Specific Encoding Guidance
**Action:** Apply dataflow-specific guidance for the recommended format. The same format behaves differently in each mode, and there are additional rules and failure modes specific to each.
**WHY:** The encoded format is not used in isolation — it is used within a dataflow mode that imposes additional constraints. Ignoring these constraints produces systems that are correct in isolation but fail in production: a consumer that reprocesses Kafka messages without schema version handling will fail on old messages; a service that decodes a JSON request into a model object and re-encodes it for storage will silently drop unknown fields added by a newer client; a database that stores model objects will lose unknown fields when old code reads and re-writes a record that contains fields it doesn't understand.
**Mode A — Databases:**
- Data outlives code. All schema changes must be backward compatible (or require a full data migration). Plan for records written by version 1 to be readable by version 5.
- Unknown field loss: Decoding a record into a typed model object and re-encoding it will silently drop unknown fields (written by newer schema versions). Protobuf parsers preserve unknown fields in a side-channel; Avro resolution ignores writer-only fields safely. In JSON, use a passthrough "unknown fields" map in the struct, or avoid read-modify-write on fields you don't touch.
- Archival exports: Use Avro object container files — schema embedded once per file; readable years later with only the Avro library.
- Prefer schema-driven binary formats (Thrift/Protobuf/Avro) over JSON for long-lived stored data: explicit schema serves as documentation that cannot drift from reality.
**Mode B — Synchronous service calls (REST, RPC):**
- Reasonable assumption: servers upgrade before clients. Validate: backward compatibility on requests (new server + old client) AND forward compatibility on responses (old client + new server).
- Across organizational boundaries: never break backward compatibility; use URL versioning (`/v1/`, `/v2/`) for breaking changes; maintain deprecated versions with explicit sunset dates.
- REST vs. gRPC: REST with JSON for public APIs (curl-testable, no client tooling required). gRPC (Protobuf) for internal services in statically typed stacks (type safety, compact encoding, HTTP/2 streaming).
**Mode C — Asynchronous message passing (message brokers):**
- Full bidirectional compatibility required simultaneously: producers and consumers deploy independently; messages persist for hours or days.
- Consumer-republish risk: a consumer that republishes messages to a downstream topic must preserve fields it cannot interpret. Failure corrupts the event stream for downstream consumers. Use Protobuf (unknown fields preserved in parser) or Avro (writer-only fields safely ignored during resolution).
- Schema registry: embed writer's schema once per Avro file (archival), or use a schema registry (Confluent Schema Registry with Kafka). Each message carries a 4-byte schema ID; consumer fetches schema on first encounter and caches it.
- Distributed actor defaults: Akka uses Java serialization (no compatibility); Erlang OTP rolling upgrades require careful planning. Replace with Protobuf before assuming rolling upgrades work.
---
### Step 6: Produce the Encoding Decision Document
**Action:** Write a structured recommendation covering format selection, compatibility assessment of planned changes, schema evolution plan, and dataflow-specific guidance. See the full output template in the three examples below.
**WHY:** A recommendation without explicit rationale cannot be reviewed or revised when requirements change. The schema evolution plan is especially important: it must specify not just which format to use, but the exact rules for each type of change the team will make over the system's lifetime, so that engineers making future changes have explicit guidance rather than relying on informal knowledge.
**Required sections:**
1. Recommended format with primary rationale (2 sentences connecting dataflow mode and rolling upgrade requirement)
2. Scoring summary table (six criteria, three to four families)
3. Schema evolution plan (table: each planned change, safe/breaking, compatibility direction, procedure)
4. Dataflow-mode-specific rules and watch signals
5. Ruled-out analysis (one sentence per format explaining the deciding criterion)
6. Implementation checklist with CI tooling and schema version management
**Related decisions:** Data model shape → `data-model-selector`. Stream processing pipeline → `stream-processing-designer`.
---
## What Can Go Wrong
These are the most common failure modes when selecting encoding formats and planning schema evolution. Review each before finalizing a recommendation.
**Adding a required field after initial deployment (Thrift/Protobuf).**
Every new field added after first deployment must be `optional` (or have a default). A `required` field will fail at parse time when reading old records that never wrote it. This is a silent misconfiguration: the schema compiles and tests pass with new test data, but fails at runtime on real old records. Rule: after initial deployment, required fields are permanently forbidden.
**Reusing a field tag number (Thrift/Protobuf).**
A retired field's tag number must be permanently marked `reserved`. Reusing a tag for a new field causes old data with the original field's bytes to be misinterpreted as the new field's type — silent data corruption or a parse error. Use `reserved 3; reserved "old_field_name";` in every `.proto` file when removing a field.
**Avro field without a default value breaks compatibility in one direction.**
Adding a field without a default breaks backward compatibility (old writers didn't include it; readers have no fallback). Removing a field without a default breaks forward compatibility (new writers omit it; old readers have no fallback). Rule: every Avro field that may be added or removed across versions must declare a default value.
**Unknown field loss in read-modify-write cycles (all formats).**
Reading a record into a typed model object, modifying one field, and writing back silently drops any fields the model type doesn't know about. Affects databases (old code reads and rewrites new records, drops new fields) and message brokers (consumer republishes a modified message, drops new producer fields). Protobuf parsers preserve unknown fields in a side-channel; Avro resolution ignores writer-only fields safely. In JSON, the struct must include an explicit "unknown fields" passthrough map.
**Number precision loss with JSON at scale.**
Integers greater than 2^53 cannot be represented exactly in IEEE 754 double-precision float (the JavaScript `Number` type). Twitter returns tweet IDs as both a JSON number and a decimal string because JavaScript clients parse the numeric form incorrectly. Mitigation: string-encode large integers in JSON APIs, or use a format with explicit 64-bit integer types.
**Adopting binary format without schema version management (Avro).**
Avro requires a mechanism for the reader to obtain the writer's schema — file header, schema registry, or connection negotiation. Without this, Avro is unusable. Retrofitting schema version IDs into records after gigabytes of data have been written is expensive. Choose a schema distribution mechanism before writing the first record.
**Switching to binary format to solve a performance problem that isn't encoding.**
For payloads under 1KB at under 10K requests/second, the encoding/decoding difference between JSON and Protobuf is negligible compared to network latency and business logic. Profile first. The operational cost of binary formats (schema management, debugging complexity) is only worth paying when encoding is confirmed as the bottleneck.
---
## Inputs / Outputs
### Inputs
- Dataflow mode description (required)
- Rolling upgrade requirement (required)
- Planned schema changes or data shape description (required)
- Existing schema files or payload samples (optional but strongly improves precision)
- Language and platform of writer and reader services (important)
- Performance constraints (optional)
### Outputs
- Encoding format recommendation with rationale and scored decision matrix
- Per-change compatibility classification (safe / breaking / direction)
- Schema evolution plan with explicit permitted and prohibited operations
- Dataflow-mode-specific encoding guidance
- Implementation checklist with tooling recommendations
- Watch signals for the most likely failure modes
---
## Key Principles
**The compatibility direction that matters depends on the dataflow mode.** In databases, both backward and forward compatibility are required simultaneously (data outlives code). In service calls, you can assume servers upgrade before clients (backward on requests, forward on responses). In async messaging, full bidirectional compatibility is required (decoupled producers and consumers at independent schema versions).
**Field tags are a permanent commitment (Thrift/Protobuf).** A field's tag number is its identity in the encoded data for the lifetime of the schema. It cannot be changed, cannot be reused after removal, and a `required` field cannot be removed. Treat tag assignments as permanent as column IDs in a relational database — they outlive any individual deployment.
**Avro's writer/reader schema resolution requires infrastructure.** Avro achieves the most compact encoding (32 bytes for the example record vs. 59 for Thrift CompactProtocol, 33 for Protobuf, 66 for MessagePack, 81 for JSON) by omitting all field identification from the encoded bytes. The cost is that the reader must have access to the writer's schema. This is not optional — it is a hard requirement that must be designed for before adopting Avro.
**Data outlives code.** A database record written today may be read five years from now by code that uses a schema three versions newer. A Kafka message written by a producer that has since been decommissioned may be replayed by a new consumer. The encoding format you choose today must support reading that data with future schema versions — not just today's.
**Schemas are documentation.** A schema registry of past schema versions is a historical record of every data structure the system has ever used. It serves as documentation that is guaranteed to be accurate (because decoding fails if it is wrong, unlike manually maintained documentation). Build schema versioning infrastructure even if you don't use it for compatibility checking immediately — the documentation value alone is worth it.
---
## Examples
### Example 1: Internal gRPC Service (Statically Typed Stack)
**Scenario:** A platform team is building a new internal recommendation service written in Go, consumed by a Java API gateway and a Python data pipeline. The service will undergo rolling upgrades — no fleet-wide restarts. The team expects to add fields to the recommendation response over time as the ML model evolves (adding score components, confidence intervals, explanation fields). Payload volume is high: ~50K recommendations/second.
**Trigger:** "Should we use JSON REST or gRPC Protobuf for our internal recommendation service? We need rolling upgrades."
**Process:**
- Step 1: Dataflow Mode B — synchronous service calls, internal organization boundary, rolling upgrades required
- Step 2: Protobuf scores high on schema evolution safety (field tags, optional fields), compactness (33 bytes vs. 81 for JSON for equivalent record), and code generation (Go and Java strongly benefit from typed structs). JSON scores high on human-readability but low on schema evolution safety. Avro scores well but dynamic schema generation is not needed here — schemas are hand-authored and evolving incrementally.
- Step 3: Rule 2 applies — internal service, statically typed languages (Go and Java), infrequent schema changes, rolling upgrade required.
- Step 4: Planned changes assessed:
- Add `confidence_interval` field: safe — new tag, optional, default 0.0
- Add `explanation` field (list of strings): safe — new tag, repeated field
- Remove `legacy_score` field: safe if retired tag is marked `reserved`; never reuse tag 3
**Output (abbreviated):**
```
## Encoding Format Decision
**System:** Recommendation Service → API Gateway, Data Pipeline
**Dataflow mode:** B — Synchronous service calls (internal)
### Recommended Format: Protocol Buffers (gRPC)
**Primary rationale:** Internal service with Go and Java consumers undergoing
rolling upgrades. Protobuf's field tags provide explicit backward/forward
compatibility; code generation gives typed structs in both languages; 33-byte
encoding vs 81-byte JSON reduces bandwidth at 50K req/sec by ~60%.
### Schema Evolution Plan
| Change | Safe? | Direction | Procedure |
|--------|-------|-----------|-----------|
| Add confidence_interval (float) | Yes | Both | New tag (e.g., 4), optional, default 0.0 |
| Add explanation (repeated string) | Yes | Both | New tag (e.g., 5), no default needed |
| Remove legacy_score | Yes | Both | Mark tag 3 as reserved; never reuse |
| Rename legacy_score to base_score | Yes | Both | Rename in .proto only; tag unchanged |
| Change score from float to double | Yes (with truncation risk) | Backward | Old readers truncate if value > float range; validate range |
### Ruled Out
**JSON/REST:** No schema-enforced compatibility; number precision issues for
score floats; human-readability benefit outweighed by 50K req/sec bandwidth
cost at this scale.
**Avro:** Dynamic schema generation not needed; schema registry adds
operational overhead not justified by dynamically generated schemas here.
### Implementation Checklist
- [ ] Define .proto schema file; assign initial field tags (never reuse)
- [ ] Add `reserved 3; reserved "legacy_score";` when legacy_score is removed
- [ ] Set up buf lint and buf breaking in CI to catch incompatible changes before merge
- [ ] Python data pipeline: use protobuf Python library (no code gen needed for dynamic language)
- [ ] Watch: if the schema starts being generated from the ML model's feature definition
rather than hand-authored, re-evaluate Avro (dynamic generation requires tag management)
```
---
### Example 2: Kafka Event Stream with Independent Producers and Consumers
**Scenario:** An e-commerce platform publishes `OrderPlaced` events to a Kafka topic. Three consumer services (inventory, fulfillment, analytics) subscribe to the topic. Services are deployed independently — the inventory service may be running version 1 of the schema when the order service publishes version 2. Messages are retained for 7 days. The analytics team wants to schema-dump the Kafka topic to Parquet files in a data lake.
**Trigger:** "We're publishing order events to Kafka. How do we handle schema changes when consumers deploy at different times?"
**Process:**
- Step 1: Dataflow Mode C — async message passing, internal organization, multiple independent consumers, 7-day message retention
- Step 2: Avro scores highest — schema registry support native with Kafka; dynamic compatibility from name-based resolution; all records in a topic share a schema per partition making schema ID per record efficient; Hadoop/Parquet interop for data lake export. Protobuf is viable but tag management becomes complex when the analytics team generates Avro schemas from Parquet column definitions.
- Step 3: Rule 4 applies — Kafka ecosystem (Confluent Schema Registry natively supports Avro), data lake export (Avro object container files embed schema for archival), dynamically generated schemas for analytics are a future requirement.
- Step 4: Compatibility rules:
- All new fields must have default values (backward and forward compatibility required simultaneously)
- Schema must be registered in schema registry before first producer uses it
- Consumers must preserve unknown fields before republishing to downstream topics
**Output (abbreviated):**
```
## Encoding Format Decision
**System:** Order events — Kafka topic, 3 consumers, data lake export
**Dataflow mode:** C — Async message passing
### Recommended Format: Apache Avro + Confluent Schema Registry
**Primary rationale:** Kafka with independent producers and consumers at
different schema versions requires simultaneous backward and forward compatibility.
Avro's name-based resolution handles field additions and removals with defaults;
schema registry provides schema distribution without embedding full schema in
every message; Avro object container files support archival export to data lake.
### Schema Evolution Plan
| Change | Safe? | Direction | Procedure |
|--------|-------|-----------|-----------|
| Add shipping_address field | Yes (if default) | Both | Add with default null; register new schema version first |
| Add discount_codes (array) | Yes (if default) | Both | Add with default [] (empty array) |
| Remove coupon_code (deprecated) | Yes (if had default) | Both | Confirm default exists; remove; add alias for old readers |
| Rename order_id to orderId | Backward only | Backward | Add "orderId" as alias in reader schema; forward breaks |
| Change amount from int to long | Safe | Both | Avro can convert; document range implications |
### Dataflow-Specific Rules
- Register new schema version in registry BEFORE deploying the producer that uses it
- Consumers must implement "preserve unknown fields" pattern when republishing to downstream topics
- Data lake export: use Avro object container files (schema embedded once per file)
- Schema registry: use BACKWARD_TRANSITIVE compatibility mode (new schema must be compatible with ALL previous versions, not just the immediately preceding one)
### Ruled Out
**Protobuf:** Tag management adds friction when analytics team generates schemas
from Parquet column definitions — Avro's name-based approach maps column names
directly to field names.
**JSON:** No schema versioning mechanism; unknown field behavior is parser-dependent
(some parsers drop unknowns, some error); 7-day retention means old records will
definitely be processed by new consumers.
### Implementation Checklist
- [ ] Set up Confluent Schema Registry; configure BACKWARD_TRANSITIVE compatibility mode
- [ ] Write schema as Avro IDL; define default values for all fields
- [ ] Producer: register schema before first publish; include schema ID in message header
- [ ] Consumer: fetch schema by ID; implement unknown field preservation before republish
- [ ] Data lake export: use Avro object container files; Parquet conversion tool reads embedded schema
- [ ] Watch: if a consumer must republish without a schema registry, embed writer's schema version in message metadata instead
```
---
### Example 3: Public REST API with Long-Lived Clients
**Scenario:** A SaaS company exposes a REST API for third-party integrations. Clients are external developers who cannot be forced to upgrade. The team needs to add `subscription_tier` to `Workspace` and deprecate `plan_name`.
**Trigger:** "How do we evolve our REST API schema without breaking external clients?"
**Process:**
- Step 1: Mode B — service calls across organizational boundaries; clients not controlled; indefinite compatibility required
- Step 2: JSON only — external clients, browser consumers, no client-side code-gen toolchain, human-readability required for developer experience. Rule 1 applies immediately.
- Step 4: Add `subscription_tier` — safe (additive). Deprecate `plan_name` (keep it populated, mark deprecated in docs) — safe. Remove `plan_name` — breaking; requires `/v2/` with a minimum 12-month sunset period for `/v1/`. Change `user_id` int → string — breaking; API version bump required.
**Output (key sections):**
```
Recommended Format: JSON (REST)
Dataflow mode: B — cross-organizational boundary
Schema Evolution Plan:
- Add subscription_tier: Safe — additive; lenient clients ignore unknown fields
- Deprecate plan_name (keep populated): Safe — document as deprecated in OpenAPI spec
- Remove plan_name: Breaking — /v2/ required; 12-month /v1/ sunset window
- Change user_id int → string: Breaking — /v2/ required; document migration guide
Dataflow rules:
- Never remove response fields without a versioned sunset period
- Adding optional request params is safe; adding required params is breaking
- Watch: if IDs exceed 2^53, return as both JSON number and decimal string (Twitter pattern)
Ruled out — Protobuf/Avro: external clients cannot be required to install code-gen
toolchains; binary format is not curl-testable.
```
---
## References
| File | Contents |
|------|----------|
| `references/format-comparison-table.md` | Full scoring matrix for all five encoding families; byte counts for the same example record in JSON (81 bytes), MessagePack (66 bytes), Thrift BinaryProtocol (59 bytes), Thrift CompactProtocol (34 bytes), Protocol Buffers (33 bytes), Avro (32 bytes); compatibility matrix comparing each format's handling of add/remove/rename/type-change operations |
| `references/schema-evolution-rules.md` | Complete per-format compatibility rule reference: all Protobuf/Thrift field tag rules, all Avro writer/reader schema resolution rules, JSON convention guidelines, with explicit permitted/prohibited change classification for each change type |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-data-model-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/format-comparison-table.md
# Format Comparison Table
## Byte Counts: Same Record in Five Formats
Reference record (from Kleppmann Chapter 4):
```json
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
```
| Format | Bytes | Notes |
|--------|-------|-------|
| JSON (no whitespace) | 81 | Field names repeated in every record; string quoting; no binary type |
| MessagePack | 66 | Binary JSON; same data model (no schema); field names still embedded |
| Thrift BinaryProtocol | 59 | Field tags + type annotations; no field names in encoded data |
| Protocol Buffers | 33 | Field tags; varint encoding for integers; single encoding format |
| Thrift CompactProtocol | 34 | Field type + tag packed into one byte; varint integers |
| Avro | 32 | No tags, no type annotations; values concatenated in schema field order |
Key observation: The difference between JSON (81 bytes) and Avro (32 bytes) is 60% — not negligible at terabyte scale. The difference between JSON (81 bytes) and MessagePack (66 bytes) is only 19% — often not worth the loss of human-readability for moderate data volumes.
---
## Scoring Matrix: Six Criteria Across Four Encoding Families
Scores are 1–5 per criterion. These represent typical scores for a general case — adjust for your specific system using the notes in each cell.
### Cross-language support (can writer and reader use different programming languages?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 5 | Libraries in every language; the default for public APIs |
| Binary JSON (MessagePack, BSON) | 4 | Libraries for major languages; smaller ecosystem than JSON |
| Thrift / Protocol Buffers | 5 | Official libraries for Java, Go, Python, C++, Ruby, C#, JavaScript, Rust, and more; gRPC expands the ecosystem |
| Avro | 4 | Java-native (Hadoop ecosystem); good Python support; thinner in Go and Rust |
### Schema evolution safety (does the format enforce backward/forward compatibility?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 2 | No built-in mechanism; relies on convention (lenient parsers, optional fields); no tooling to detect incompatible changes |
| Binary JSON | 2 | Same as JSON — binary encoding but no schema; same compatibility limitations |
| Thrift / Protocol Buffers | 5 | Field tags provide explicit compatibility; `reserved` prevents tag reuse; buf/protoc can detect breaking changes in CI |
| Avro | 5 | Name-based resolution with defaults; schema registry compatibility modes (BACKWARD, FORWARD, FULL, TRANSITIVE variants) enforce rules before a schema is registered |
### Payload compactness (how large is the encoded output relative to the logical data?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 1 | Field names repeated in every record; string quoting; no compact integer encoding |
| Binary JSON | 3 | Field names still embedded; varint integers in some implementations; moderate improvement |
| Thrift CompactProtocol / Protocol Buffers | 5 | No field names; varint integers; field type packed with tag |
| Avro | 5 | No field names, no tags, no type annotations; varint integers; most compact of all formats |
### Human-readability and debuggability (can you read encoded data without tooling?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 5 | Paste into a browser, cat to terminal, curl directly; debugging is trivial |
| Binary JSON | 3 | Requires `msgpack-tool` or equivalent; not human-readable but decode tools are lightweight |
| Thrift / Protocol Buffers | 2 | Binary; requires `protoc --decode_raw` or `thrift --decode`; schema needed for named field output |
| Avro | 2 | Binary; requires `avro-tools tojson` or `avro-tools getschema`; object container files are self-describing but binary |
### Code generation and type safety (does the format generate typed structs/classes?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 1 | No schema; types inferred at runtime; JSON Schema validation is optional and rarely enforced at compile time |
| Binary JSON | 1 | No schema; same as JSON |
| Thrift / Protocol Buffers | 5 | First-class code generation for all supported languages; IDE autocompletion; compile-time type checking in statically typed languages |
| Avro | 3 | Optional code generation; fully usable without it (especially in dynamic languages); self-describing object container files enable schema-free consumption |
### Dynamically generated schema support (can schemas be generated programmatically without manual work?)
| Format | Score | Notes |
|--------|-------|-------|
| JSON/XML | 5 | No schema required; any JSON is valid; trivial to generate |
| Binary JSON | 5 | No schema; same as JSON |
| Thrift / Protocol Buffers | 2 | Field tags must be assigned and managed; generating schemas from a database table requires careful tag assignment and bookkeeping to avoid reuse of old tag numbers |
| Avro | 5 | Field names map directly to column names; no tag management; generating a new Avro schema from an updated database table is mechanical (no bookkeeping) |
### Typical total scores (general case)
| Format | Cross-lang | Evolution | Compact | Readable | Code-gen | Dynamic | Total |
|--------|-----------|-----------|---------|----------|----------|---------|-------|
| JSON/XML | 5 | 2 | 1 | 5 | 1 | 5 | 19 |
| Binary JSON | 4 | 2 | 3 | 3 | 1 | 5 | 18 |
| Thrift/Protobuf | 5 | 5 | 5 | 2 | 5 | 2 | 24 |
| Avro | 4 | 5 | 5 | 2 | 3 | 5 | 24 |
Thrift/Protobuf and Avro tie on typical totals. The deciding criteria are: dynamic schema generation (Avro wins), code generation in statically typed languages (Protobuf/Thrift win), and Kafka ecosystem fit (Avro wins via schema registry).
---
## Compatibility Matrix: Change Types vs. Format
For each type of schema change, whether it is safe for each encoding family.
| Change type | JSON convention | Protobuf/Thrift | Avro |
|-------------|-----------------|-----------------|------|
| Add optional field | Safe (lenient parsers) | Safe (new tag, optional/default) | Safe (requires default value) |
| Add required field | Breaking | Breaking — never do post-deployment | N/A (no required concept; use union without null default) |
| Remove field (had default) | Breaking for dependents | Safe (mark tag reserved) | Safe (old readers get default) |
| Remove field (no default) | Breaking | Safe (mark tag reserved) | Breaks backward compatibility |
| Rename field | Breaking | Safe (name not in encoded data) | Backward-only (add alias in reader schema) |
| Change field type (compatible) | Risky (no type safety) | Limited (see type rules) | Possible (Avro converts compatible types) |
| Change field type (incompatible) | Breaking | Breaking | Breaking |
| Reorder fields | Safe | Safe (tags not position-based) | Safe (resolution by name, not position) |
| Add enum value | Risky (old code may error) | Safe (forward: old code ignores) | Safe (forward: old code ignores) |
| Remove enum value | Breaking for users of that value | Breaking | Breaking |
---
## Format Selection Summary by Scenario
| Scenario | Recommended format | Key reason |
|----------|--------------------|-----------|
| Public REST API | JSON | Cross-org boundary; external clients; browser-testable |
| Internal gRPC service, statically typed | Protocol Buffers | Code generation; explicit compatibility; compact encoding |
| Internal service, Facebook/Twitter stack | Apache Thrift | Same as Protobuf; existing ecosystem |
| Kafka event stream with schema registry | Apache Avro | Schema registry native integration; name-based evolution |
| Data lake archival (Hadoop, Parquet pipeline) | Apache Avro | Object container files embed schema; analytics tools native support |
| Database-to-file export with changing schema | Apache Avro | Dynamic schema generation from DB table; no tag management |
| Lightweight in-process cache | Language-specific (cautiously) | Only if truly transient; never persisted or sent over network |
| Dynamic language stack (Python/JS only) | Avro or JSON | Code generation adds no value; name-based resolution works without generated classes |
FILE:references/schema-evolution-rules.md
# Schema Evolution Rules Reference
## Definitions
**Backward compatibility:** New code can read data written by old code.
- If you deploy new code and it must read records written by old code (e.g., from a database), backward compatibility is required.
**Forward compatibility:** Old code can read data written by new code.
- If old and new code versions run simultaneously (rolling upgrade), old code must be able to read records written by new code without crashing.
**Both directions simultaneously:** Required during rolling upgrades and in message broker scenarios where producers and consumers deploy independently.
---
## Protocol Buffers Rules
### Field Tag Invariants (permanent constraints)
1. **A field's tag number is its permanent identity.** The tag number — not the field name — is what appears in the encoded bytes. It cannot change after any data has been written.
2. **A removed field's tag number is permanently retired.** Mark it with `reserved` in the `.proto` file. Never reuse it for a new field.
3. **A removed field's name is also retired.** Mark it with `reserved "field_name"` as well. This prevents a future developer from accidentally adding a new field with the same name but a different tag.
```protobuf
// Correct way to retire a field:
message Person {
reserved 3, 4;
reserved "legacy_score", "old_interests";
// Fields 3 and 4 existed before; they are gone now.
// New fields must use tags 5, 6, 7, ...
string user_name = 1;
int64 favorite_number = 2;
}
```
### Change Classification Table
| Change | Backward compatible? | Forward compatible? | Notes |
|--------|---------------------|---------------------|-------|
| Add optional field (new tag) | Yes | Yes | New code reads old data: missing field gets default. Old code reads new data: skips unknown tag. |
| Add required field | No | No | Old data didn't write the field; required check fails on read. Never add required after initial deployment. |
| Remove optional field (mark reserved) | Yes | Yes | Old readers: field is absent, gets default or zero. New readers: ignore old data's bytes for retired tag. |
| Remove required field | No | No | Old code that wrote data as required will have the field; new code that has removed it still sees it in the bytes and may error. Mark required → optional first, deploy everywhere, then remove. |
| Rename a field (same tag) | Yes | Yes | Names are not in the encoded bytes. Tag is unchanged. |
| Change tag number | No | No | Breaking permanently. All existing encoded data is now misinterpreted. |
| Reuse a retired tag number | No | No | Old data with that tag now misinterpreted as the new field type. Silent data corruption. |
| Change int32 → int64 | Yes | Partial | New code reads old int32 data: zero-pads to 64 bits (safe). Old code reads new int64 data: truncates if value > 2^31. Risky if values may exceed int32 range. |
| Change optional → repeated | Yes | Yes | New code reading old data: list with 0 or 1 element. Old code reading new data: reads last element of repeated field. |
| Change repeated → optional | Partial | Partial | Risky: new code reading old data sees only last element; data loss if old data had multiple values. |
| Add enum value | Yes | Partial | Old code reading new data: may receive unknown enum value. Depends on implementation (some error, some use 0/default). |
| Remove enum value | Partial | No | Old code that reads data with the removed value may get an error or the default. Risky. |
### Datatype Compatibility (Protobuf wire types)
Fields with the same wire type can be changed to compatible types:
- Varint (wire type 0): `int32`, `int64`, `uint32`, `uint64`, `sint32`, `sint64`, `bool`, `enum` are interchangeable with truncation/extension rules
- 64-bit (wire type 1): `fixed64`, `sfixed64`, `double` are interchangeable
- Length-delimited (wire type 2): `string`, `bytes`, embedded messages, repeated fields are interchangeable
- 32-bit (wire type 5): `fixed32`, `sfixed32`, `float` are interchangeable
Cross-wire-type changes are breaking (the parser cannot skip the bytes correctly).
### Recommended CI Tooling
- `buf breaking --against .git#branch=main` — detects breaking changes against the main branch before merge
- `buf lint` — enforces naming conventions and structural rules in `.proto` files
- Configure as a required CI check on all `.proto` file changes
---
## Apache Thrift Rules
Thrift has the same field-tag-based rules as Protocol Buffers with minor variations.
### BinaryProtocol vs. CompactProtocol
Both protocols encode the same schema and have the same compatibility rules. CompactProtocol is preferred for production use:
- BinaryProtocol: type (1 byte) + tag (2 bytes) + value = 59 bytes for the example record
- CompactProtocol: type + tag packed into 1 byte + varint value = 34 bytes for the same record
Switch between BinaryProtocol and CompactProtocol only at a fleet-wide migration point — mixing protocols within the same stream requires all readers and writers to switch simultaneously.
### Differences from Protocol Buffers
1. **List datatype:** Thrift has a dedicated `list<T>` datatype (parameterized by element type). This does not support the `optional` → `list` promotion that Protobuf's `repeated` supports.
2. **`required` vs. `optional`:** Thrift enforces `required` strictly. Same rule as Protobuf: never add a required field after initial deployment.
3. **Nested lists:** Thrift's dedicated list type supports nested lists (e.g., `list<list<string>>`). Protobuf's repeated fields do not directly support nesting — nested lists require wrapping in a message type.
### Change Classification Table
Same as Protocol Buffers table above, with the addition:
| Change | Backward compatible? | Forward compatible? | Notes |
|--------|---------------------|---------------------|-------|
| Change optional field to list | No | No | Thrift's list type is distinct from optional; no automatic promotion |
---
## Apache Avro Rules
### Core Mechanism: Writer/Reader Schema Resolution
The Avro library takes the writer's schema and the reader's schema and translates the data:
- Field present in writer, absent in reader: ignored (the bytes are skipped correctly because the writer's schema gives the type needed to skip)
- Field present in reader, absent in writer: filled with the default value declared in the reader's schema
- Field present in both: value is translated from writer's type to reader's type (if compatible)
- Fields are matched by name, not position (order in the schema does not matter for compatibility)
### Schema Distribution Mechanisms (required — choose one)
| Context | Mechanism | Implementation |
|---------|-----------|----------------|
| Large file, all records same schema | Embed writer's schema in file header | Avro object container file format; reader gets schema from file header |
| Database with per-record writes | Version number per record + schema registry DB | Store schema version as integer at start of every record; reader fetches schema by version from a schema table in the database |
| Network connection (two services) | Negotiate on connection setup | Avro RPC protocol: client and server exchange schemas at handshake; use same schema for connection lifetime |
| Kafka topics | Schema ID per message + schema registry service | 4-byte schema ID in message header; Confluent Schema Registry stores schemas by ID; consumer fetches on first encounter |
### Change Classification Table
| Change | Backward compatible? | Forward compatible? | Notes |
|--------|---------------------|---------------------|-------|
| Add field with default value | Yes | Yes | Old readers: get default for missing field. New readers: ignore field if writer didn't include it. |
| Add field without default value | No | Yes | Old readers: no default to supply; parsing fails or produces incorrect data. |
| Remove field that had default | Yes | Yes | Old readers: field is absent in new writer's data; default is used. New readers: field not present in new schema; old writer's data has it; ignored. |
| Remove field without default | Yes | No | New readers reading old writer's data: field is present in writer's schema but not reader's; ignored (safe). Old readers reading new writer's data: expected field not present; no default; fails. |
| Rename field | Backward only | No | Add old name as alias in reader's schema to restore forward compat. Example: reader schema has `"name": "userId", "aliases": ["user_id"]`. |
| Reorder fields | Yes | Yes | Resolution is by name; field order is irrelevant. |
| Change type (Avro-compatible) | Yes | Yes | Avro spec defines compatible type promotions: int→long, int→float, int→double, long→float, long→double, float→double, string→bytes, bytes→string. |
| Change type (incompatible) | No | No | Any other type change is breaking. |
| Add branch to union type | Backward only | No | Old readers don't know the new branch. |
| Remove branch from union type | No | Backward only | New readers don't handle the old branch. |
| Add null to union (make nullable) | Yes | Yes | `union { null, long }` — can add null as default branch. |
### Avro Null and Default Values
Avro's default value must be of the type of the first branch in a union:
```json
// Correct: null is first branch; default is null
{"name": "favoriteNumber", "type": ["null", "long"], "default": null}
// Correct: long is first branch; default is a long
{"name": "favoriteNumber", "type": ["long", "null"], "default": 0}
// WRONG: default null but null is not first branch — Avro schema parse error
{"name": "favoriteNumber", "type": ["long", "null"], "default": null}
```
### Schema Registry Compatibility Modes (Confluent Schema Registry)
| Mode | Meaning | Recommended for |
|------|---------|-----------------|
| BACKWARD | New schema is backward compatible with the immediately previous version | Most common; simple use cases |
| BACKWARD_TRANSITIVE | New schema is backward compatible with ALL previous versions | Recommended for Kafka topics with long retention (messages from old versions may be replayed) |
| FORWARD | New schema is forward compatible with the immediately previous version | Rare; use when consumers upgrade before producers |
| FORWARD_TRANSITIVE | Forward compatible with ALL previous versions | Rare |
| FULL | Both BACKWARD and FULL with immediately previous | Use when producers and consumers upgrade independently at any time |
| FULL_TRANSITIVE | Both BACKWARD_TRANSITIVE and FORWARD_TRANSITIVE | Strictest; use for long-lived event streams with independent deployments |
| NONE | No compatibility checking | Only for development/experimentation |
---
## JSON / XML Rules (Convention-Based)
JSON has no built-in compatibility mechanism. These are conventions:
### Safe by convention (if all consumers follow lenient parsing)
| Change | Notes |
|--------|-------|
| Add new field to response | Consumers must ignore unknown fields (lenient parsing). This is the JSON convention but not enforced by the format. |
| Add optional query parameter to request | Servers must ignore unknown parameters or return a helpful error. |
| Add optional body field to request | Server reads known fields; unknown fields ignored. |
### Breaking changes (require API version bump)
| Change | Why it's breaking |
|--------|------------------|
| Remove a field from response | Consumers that read the field get null/undefined and may crash or produce wrong results |
| Rename a field | Same as remove + add; old name becomes null |
| Add a required field to request | Old clients that don't send it will fail validation |
| Change field type | JSON does not enforce types; consumers parse to their expected type; mismatch produces silent corruption or parse error |
| Change number precision | Integers > 2^53 are not representable in IEEE 754 double; JavaScript consumers silently corrupt them |
### API Versioning Patterns
| Pattern | Example | When to use |
|---------|---------|-------------|
| URL versioning | `/v1/users`, `/v2/users` | Recommended for breaking changes; easy to route in reverse proxy |
| Accept header versioning | `Accept: application/vnd.myapi.v2+json` | RESTful purists prefer this; harder to test with browser |
| Query parameter | `?version=2` | Simplest; not RESTful; fine for internal APIs |
| Feature flags | `?include=newField` | For optional capabilities; not for breaking changes |
---
## Dataflow-Specific Compatibility Checklists
### Database (Mode A)
- [ ] Every schema change is backward compatible (data may be years old)
- [ ] New code handles missing fields from old records (null, zero, or explicit default)
- [ ] Read-modify-write operations preserve unknown fields (test this explicitly)
- [ ] Schema migration scripts do not rewrite data unnecessarily (expensive and introduces risk)
- [ ] Archival dumps use a self-describing format (Avro object container files or JSON with schema snapshot)
### Service Calls (Mode B)
- [ ] Server-side schema changes are backward compatible with the current client version
- [ ] Client-side schema changes are forward compatible with the current server version
- [ ] For public APIs: changes are backward compatible with ALL deployed client versions
- [ ] API versioning strategy is defined before first breaking change is needed
- [ ] Compatibility is tested with both old-client → new-server and new-client → old-server scenarios
### Message Passing (Mode C)
- [ ] Schema changes are both backward and forward compatible simultaneously
- [ ] All new fields have default values (Avro) or are optional (Protobuf)
- [ ] Consumers that republish messages preserve unknown fields
- [ ] Schema registry or equivalent schema distribution mechanism is operational before producers use new schema
- [ ] Consumer handles messages written by schema versions older than the current version
- [ ] Schema registry compatibility mode is set to BACKWARD_TRANSITIVE or FULL_TRANSITIVE for long-retention topics
Diagnose distributed system failures caused by network faults, unreliable clocks, or process pauses — and map each to its correct mitigation. Use when: a nod...
---
name: distributed-failure-analyzer
description: |
Diagnose distributed system failures caused by network faults, unreliable clocks, or process pauses — and map each to its correct mitigation. Use when: a node is intermittently timing out with no clear network outage; a lock-holder or leader keeps acting after being declared dead (zombie leader / split brain via distributed locking, not replication topology — use replication-failure-analyzer for replica split brain); stale reads persist beyond expected replication lag; wall-clock-based lease checks or last-write-wins conflict resolution is producing data loss under clock skew; or cascading node-death declarations are occurring under load. Also use proactively to audit timing assumptions in new system designs (absence of fencing tokens, NTP drift exposure, GC pause risk). Distinct from replication-failure-analyzer (replication lag anomalies, failover pitfalls, quorum edge cases). Produces a structured failure report: symptom → fault category → mechanism → mitigation. Covers: asynchronous network behavior, timeout tuning and cascade risk, NTP drift and clock jump mechanics, process pause causes (GC, VM migration, paging, SIGSTOP), fencing tokens with ZooKeeper zxid/cversion, Byzantine fault scoping, and system model selection (crash-stop vs. crash-recovery vs. Byzantine; synchronous vs. partially synchronous vs. asynchronous).
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/distributed-failure-analyzer
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- replication-strategy-selector
- consistency-model-selector
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [8]
tags:
- distributed-systems
- failure-analysis
- network-faults
- unreliable-clocks
- process-pauses
- fencing-tokens
- zombie-leader
- split-brain
- distributed-locking
- last-write-wins
- clock-skew
- ntp-drift
- garbage-collection
- partial-failure
- timeout-tuning
- byzantine-faults
- system-model
- crash-recovery
- quorum
- zookeeper
- lease-expiry
- partial-synchrony
execution:
tier: 2
mode: hybrid
inputs:
- type: codebase
description: "Application source code, infrastructure configs (docker-compose, k8s manifests), or architecture description revealing timing assumptions, locking patterns, and clock usage"
- type: document
description: "Incident report, runbook, or architecture description if no codebase is available"
- type: description
description: "User-described symptoms: what failed, when, topology, observed behavior vs. expected behavior"
tools-required: [Read, Write, Grep]
tools-optional: [Bash]
mcps-required: []
environment: "Run inside a project directory with codebase or configuration files, or accept a verbal description of the failure. Produces a written failure analysis report."
discovery:
goal: "Produce a structured failure analysis report: classify each symptom into its fault category, identify the specific mechanism, and recommend concrete mitigations"
tasks:
- "Gather failure symptoms, system topology, and observed vs. expected behavior"
- "Classify each symptom into network fault, clock unreliability, or process pause"
- "Identify the specific failure mechanism within the category"
- "Scan codebase for anti-patterns (clock-based ordering, unguarded lease checks, missing fencing tokens)"
- "Recommend mitigations matched to the root cause"
- "Assess Byzantine fault risk and whether it is relevant to this context"
- "Select appropriate system model assumptions for the architecture"
audience:
roles: ["backend-engineer", "software-architect", "site-reliability-engineer", "tech-lead", "data-engineer"]
experience: "intermediate-to-advanced — assumes familiarity with distributed systems concepts"
triggers:
- "Intermittent timeouts with no clear network cause"
- "Data disappearing or being silently overwritten after concurrent writes"
- "A node or leader that continues acting after being declared dead"
- "Stale reads that persist longer than replication lag explains"
- "Cascading node-death declarations under load"
- "Distributed lock / lease behavior that appears correct but causes corruption under failure"
- "Designing a system and wanting to audit timing and locking assumptions before shipping"
not_for:
- "Choosing a replication topology — use replication-strategy-selector"
- "Selecting consistency and isolation levels — use consistency-model-selector"
- "Diagnosing replication lag anomalies specifically — use replication-failure-analyzer"
---
## When to Use
You are diagnosing or preventing a failure in a distributed system and the root cause is not immediately obvious. The symptom could be: a timeout that might mean the remote node is dead, or might mean the network is congested, or might mean the node is alive but paused. A write that appeared to succeed but whose data is now missing. A leader that is still writing after another leader was elected. A lock that was held correctly but still allowed two writers simultaneously.
This skill imposes a diagnostic framework: every unexplained distributed system failure traces to one of three root fault categories — **network faults**, **clock unreliability**, or **process pauses** — and each category has a bounded set of mechanisms and well-understood mitigations. The skill maps symptoms to categories, categories to mechanisms, and mechanisms to concrete fixes.
Use it reactively (incident post-mortem, production debugging) or proactively (design review, codebase audit for timing anti-patterns).
Cross-references:
- `replication-failure-analyzer` — for failures specific to replication lag, failover, and quorum behavior
- `consistency-model-selector` — for selecting isolation and consistency guarantees that prevent a class of failures at the application layer
---
## Context and Input Gathering
Before analysis, collect the following. Ask the user for any that are missing.
**Required:**
1. **Symptom description** — what was observed vs. what was expected. Be concrete: "node A sent a write that was confirmed, but the data is now absent on node B" is better than "data loss."
2. **System topology** — number of nodes, roles (leader/follower, lock-holder/client, partitioned shards), deployment environment (bare metal, VMs, cloud, containers).
3. **Failure timeline** — when did it start, is it reproducible, does it correlate with load spikes, deploys, or maintenance events?
**Useful:**
4. **Infrastructure configuration** — timeout values, NTP setup, GC settings (heap size, collector type), VM migration policies.
5. **Relevant code** — any section that checks wall-clock time, holds a lease or lock, does conflict resolution (especially last-write-wins), or makes assumptions about execution timing.
6. **Logs or metrics** — round-trip time distributions, GC pause logs, NTP offset metrics, CPU steal time.
**If no codebase is available:** accept a verbal architecture description and produce an analysis based on stated behavior patterns. The output will note which findings are speculative vs. confirmed.
---
## Process
### Step 1 — Map symptoms to the fault taxonomy
WHY: Distributed failures are non-deterministic and the same symptom can arise from multiple root causes. The three-category taxonomy forces explicit elimination: you cannot address "network issues" without first determining which of the six network failure modes is actually occurring, or whether it is a process pause mimicking a network failure.
For each reported symptom, classify it against this taxonomy:
**Category A: Network faults**
The network is an asynchronous packet network with unbounded delays. When a request is sent and no response arrives, it is impossible to distinguish which of these occurred:
1. Request lost in transit (cable, queue drop)
2. Request queued, not yet delivered (network congestion, switch buffer overflow)
3. Remote node crashed or was powered down
4. Remote node temporarily unresponsive (GC pause — overlaps Category C)
5. Response lost in transit (misconfigured switch, asymmetric fault)
6. Response delayed (network or receiver overloaded)
Key diagnostic signal: **timeouts tell you nothing about whether the remote node executed your request.** A timeout means you gave up waiting; it does not mean the request failed.
**Category B: Clock unreliability**
Each node has its own hardware clock (quartz oscillator). Clocks drift and can jump:
- **Time-of-day clock** (wall-clock time): synchronized via NTP, but NTP accuracy is limited by network round-trip time. On a congested network, NTP error can be 35ms to over 100ms. Clocks may jump backward if they are ahead of the NTP server.
- **Monotonic clock**: suitable for measuring elapsed time (durations, timeouts) on a single node. Cannot be compared across nodes — the absolute value is meaningless.
- **Sources of clock error**: quartz drift (up to 200 ppm = 17 seconds/day without sync), NTP misconfiguration (firewall blocking NTP traffic), NTP server errors (some servers are wrong or misconfigured), leap seconds (a minute can be 59 or 61 seconds), VM virtualization (CPU time-sharing pauses the VM; from the application's view, the clock jumps forward).
**Category C: Process pauses**
A thread can be preempted at any point in its execution and paused for an arbitrary duration without being notified. Pause causes:
- **Stop-the-world GC**: JVM, .NET, Ruby, and others pause all threads during major GC. Pauses of several seconds are possible even with "concurrent" collectors like CMS (which still has stop-the-world phases).
- **VM suspension**: a hypervisor may suspend a VM (save memory to disk) and resume it seconds or minutes later — live migration between hosts does this.
- **OS context switches / CPU steal**: in multi-tenant environments, another VM consuming a shared CPU core creates "steal time" — the paused VM's threads wait with no awareness that real time is passing.
- **Disk I/O**: synchronous disk access pauses the thread. In Java, class loading can trigger disk I/O unexpectedly. On network-attached storage (EBS, NFS), I/O latency inherits network variability.
- **Memory paging (thrashing)**: if the OS swaps pages to disk under memory pressure, a thread can be paused waiting for page-in. In extreme cases the system spends most of its time paging.
- **SIGSTOP**: a Unix signal that immediately halts a process until SIGCONT. Sent accidentally by operators (Ctrl-Z in a shell), or by tooling.
After categorizing, state the most probable category per symptom and note any ambiguity.
---
### Step 2 — Identify the specific failure mechanism
WHY: The category narrows the space; the mechanism identifies the specific causal chain and determines which mitigation is effective. A "clock unreliability" failure caused by NTP being blocked at the firewall needs a different fix than one caused by VM time virtualization.
**For network faults:**
- Distinguish between node death (crash-stop) and node pause (crash-recovery or process pause). If the node later recovers, it is likely a pause, not a crash.
- Check for asymmetric faults: a node can receive messages but its outgoing messages are dropped. This node will appear dead to the rest of the cluster despite being healthy. Symptom: the "dead" node does not know it has been declared dead.
- Check for timeout calibration: is the timeout shorter than the 99th percentile round-trip time under load? Premature timeouts under load cause cascading failures — a node declared dead transfers its load to other nodes, which increases their load, which makes them slower, which causes more timeouts, which declares more nodes dead.
**For clock unreliability:**
- Identify whether the code uses wall-clock time for **ordering events across nodes** (anti-pattern) or for **measuring elapsed time on a single node** (acceptable with monotonic clock).
- Last-write-wins (LWW) conflict resolution using wall-clock timestamps is the canonical clock anti-pattern: a node with a lagging clock will silently discard writes from a node with a fast clock, because the lagging node's writes appear "more recent." Data is lost with no error reported to the application.
- Lease checks using wall-clock time are unsafe if the clock can jump or if a process pause occurs between the check and the protected operation (see Step 4).
**For process pauses:**
- GC pause: correlate with GC logs. Look for stop-the-world events. Check heap size and object allocation rate. Long-lived objects accumulate and force full GC.
- VM migration: check hypervisor or cloud provider logs for live migration events.
- Thrashing: check system swap metrics and page fault rates.
- The defining characteristic: **the paused node does not know it was paused.** When it resumes, it checks its clock and finds that very little time has passed (from its perspective). It has no awareness that the rest of the cluster has moved on, elected a new leader, or acquired a new lock.
---
### Step 3 — Scan the codebase for anti-patterns (if codebase is available)
WHY: Many distributed system bugs are latent — they exist in the code but only manifest under specific timing conditions (high load, GC pressure, VM migration). Proactive scanning finds them before they trigger in production.
Search for:
**Anti-pattern 1: Wall-clock time used for cross-node event ordering**
```
# Look for: System.currentTimeMillis(), time.time(), Date.now(), new Date()
# used to timestamp events that are replicated or compared across nodes
```
Risk: If last-write-wins is the conflict resolution strategy and timestamps come from node clocks, any clock skew causes silent data loss.
Fix: Use logical clocks (version vectors, Lamport timestamps) for causal ordering. Use physical clocks only for duration measurement (monotonic clock).
**Anti-pattern 2: Client-side lease/lock validity check before a protected operation**
```
# Look for: if (lease.isValid()) { ... do protected operation ... }
# or: if (System.currentTimeMillis() < leaseExpiryTime) { ... }
```
Risk: A process pause between the check and the operation can cause the lease to expire. The node proceeds, believing it still holds the lock, while another node has already acquired it.
Fix: The resource itself must enforce the fencing token (see Step 4). Client-side checks are defense-in-depth only, not a correctness guarantee.
**Anti-pattern 3: Timeout values hard-coded without accounting for load variability**
```
# Look for: timeout = 5000 (fixed constant), no jitter, no adaptive adjustment
```
Risk: Timeouts calibrated for p50 latency will fire on p99 latency spikes, causing false node-death declarations.
Fix: Measure round-trip time distribution empirically. Set timeout at p99 + safety margin. Consider adaptive failure detectors (Phi Accrual, used in Akka and Cassandra) that adjust based on observed jitter.
**Anti-pattern 4: Distributed locking without fencing tokens**
```
# Look for: acquire lock → use resource → release lock
# without any monotonically-increasing token passed to the resource
```
Risk: A zombie lock-holder (paused during lock hold, revived after expiry) will corrupt shared state by writing concurrently with the legitimate new lock-holder.
Fix: See Step 4.
---
### Step 4 — Apply the fencing token pattern (for zombie leader / zombie lock-holder failures)
WHY: Any lease or lock system that relies on the lock-holder checking its own lease status is vulnerable to process pauses. The lock-holder cannot detect that it was paused; the check passes because, from the holder's perspective, no time has elapsed. The only correct solution puts enforcement in the resource, not the client.
**The fencing token pattern:**
Every time the lock service grants a lock or lease, it returns a **fencing token** — a monotonically increasing integer (each new grant increments the counter). The lock-holder includes its token in every request to the protected resource. The resource tracks the highest token it has seen and rejects any request with a lower token.
Concrete mechanics:
1. Client 1 acquires lock → receives token 33.
2. Client 1 goes into a stop-the-world GC pause for 15 seconds.
3. Client 1's lock expires. Client 2 acquires lock → receives token 34.
4. Client 2 writes to the resource with token 34. Resource records "highest seen: 34."
5. Client 1 resumes. It believes it still holds the lock (lease check passes — its clock says almost no time has elapsed). It sends a write with token 33.
6. Resource rejects Client 1's write: token 33 < 34. Corruption prevented.
**ZooKeeper integration:** If ZooKeeper is used as the lock service, the transaction ID `zxid` or the node version `cversion` serve as fencing tokens. They are guaranteed to be monotonically increasing.
**When the resource does not support fencing tokens natively:** For a file storage service, encode the fencing token in the filename or as a conditional write (compare-and-swap on a version field). Some kind of server-side enforcement is required — client-side enforcement alone is not sufficient for correctness.
**Important:** Fencing tokens protect against *inadvertent* zombie behavior (a node that does not know it has been declared dead). They do not protect against *deliberate* misbehavior — that requires Byzantine fault tolerance (see Step 5).
---
### Step 5 — Assess Byzantine fault risk
WHY: Byzantine fault tolerance is expensive and complex. For most datacenter systems it is unnecessary. The analysis must be explicit about whether Byzantine faults are in scope, so engineering effort is not misallocated.
**Byzantine faults are relevant when:**
- Nodes may send arbitrary, incorrect, or malicious messages (not just slow or silent).
- The system spans multiple organizations where participants may have conflicting incentives (blockchain, inter-bank settlement).
- Hardware operates in high-radiation environments (aerospace, embedded safety systems) where memory or CPU registers may be silently corrupted.
**Byzantine faults are NOT relevant for most server-side data systems when:**
- All nodes are controlled by your organization.
- The threat model is accidental failure, not adversarial behavior.
- Radiation levels are low enough that memory corruption is not a realistic concern.
For typical datacenter systems: assume crash-recovery faults (nodes may fail and restart, stable storage survives crashes, in-memory state is lost on crash), not Byzantine faults. Standard authentication, checksums in the application protocol, and NTP with multiple servers cover the "weak lying" cases (corrupted packets, misconfigured servers) without the overhead of full Byzantine fault-tolerant protocols.
---
### Step 6 — Select the appropriate system model
WHY: Distributed algorithms are designed for specific system model assumptions. Running an algorithm in a system whose actual behavior violates its model assumptions causes correctness failures. This step makes the model explicit so algorithm selection is grounded.
**Timing models:**
| Model | Assumption | Reality fit |
|---|---|---|
| Synchronous | Bounded network delay, bounded process pauses, bounded clock error | Not realistic for most packet networks and commodity hardware |
| Partially synchronous | Usually synchronous, occasionally exceeds bounds | Realistic for most production systems |
| Asynchronous | No timing assumptions, no timeouts | Very restrictive; few practical algorithms can operate without timeouts |
**Node failure models:**
| Model | Assumption | When to use |
|---|---|---|
| Crash-stop | Node fails by stopping; never recovers | Simplest; safe assumption for algorithm design even if nodes do recover |
| Crash-recovery | Nodes may crash and restart; stable storage survives; in-memory state is lost | Realistic for server-side systems with durable storage |
| Byzantine | Nodes may do anything, including sending false messages | Only for multi-party untrusted environments or high-radiation hardware |
**Recommended default for most server-side data systems:** partially synchronous model + crash-recovery faults.
State the chosen model explicitly in the analysis report. Any algorithm the team adopts (leader election, distributed locking, consensus) must be evaluated against this model.
---
### Step 7 — Produce the failure analysis report
Output a structured report with:
1. **Executive summary** — one paragraph: what failed, root fault category, severity.
2. **Symptom-to-mechanism mapping** — for each symptom: category, mechanism, confidence level.
3. **Anti-patterns found** — code locations (if codebase available), description, risk.
4. **Mitigations** — prioritized list, matched to root cause, with implementation notes.
5. **System model recommendation** — which timing model and node failure model the system should be designed against.
6. **Open questions** — what additional information (logs, metrics, code) would increase diagnostic confidence.
---
## Common Misdiagnoses
These are the mistakes most frequently made when reasoning about distributed failures. Treat each as an active risk to check.
**"The network is reliable inside our datacenter — this must be a software bug."**
Network faults occur inside datacenters. One study found ~12 network faults per month in a medium-sized datacenter, half disconnecting an entire rack. Redundant networking gear does not reduce faults proportionally because human error (misconfiguration) is a primary cause.
**"The timeout fired, so the node must be dead."**
A timeout means you stopped waiting. The remote node may have received your request and processed it, with only the response being lost or delayed. Acting on a false timeout (e.g., promoting a new leader) can create two leaders simultaneously.
**"We use NTP so our clocks are synchronized."**
NTP accuracy is limited by network round-trip time. On a congested network, NTP error can exceed 100ms. VMs are paused by the hypervisor and see their clock jump forward when resumed. Clocks behind a firewall that blocks NTP silently drift. "Synchronized" does not mean "accurate to the millisecond."
**"The node checked its lease and it was valid, so it was safe to proceed."**
A check-then-act sequence is not atomic in a distributed system. A GC pause, VM migration, or OS context switch between the check and the act can make the lease expire. Only server-side enforcement (fencing tokens) provides correctness.
**"Last-write-wins is fine because our timestamps are accurate enough."**
LWW silently discards writes when a node with a lagging clock overwrites values from a node with a fast clock. Clock skew between nodes under 3ms can cause this. The application receives no error. The data is simply gone.
**"The node is dead — it stopped responding."**
The node may be in a stop-the-world GC pause. It will resume, discover that it was declared dead, and attempt to continue its previous role. Without fencing tokens, this zombie behavior can corrupt state.
**"We need Byzantine fault tolerance because we can't trust all nodes."**
In a datacenter where your organization controls all nodes, Byzantine fault tolerance is almost certainly not needed and its cost (algorithmic complexity, performance overhead) is not justified. Standard authentication and checksums handle the realistic "lying" cases.
---
## Examples
### Example 1: Silent data loss in a multi-leader database
**Scenario:** A team runs Cassandra with multi-datacenter replication. After a network partition heals, some writes from one datacenter appear to be silently overwritten or lost. No errors were logged.
**Trigger:** Post-incident investigation after user complaints about missing data. Developer wonders if there is a replication bug.
**Process:**
1. Symptom classification: data loss with no error, after concurrent writes during a partition → Category B (clock unreliability) combined with conflict resolution policy.
2. Mechanism: Cassandra uses last-write-wins by default. During the partition, both datacenters accepted writes to the same key. On reconciliation, the write with the higher timestamp wins. If the losing datacenter's clock was ahead, its writes had higher timestamps and survived — but those writes occurred causally *before* the winning datacenter's writes. The causally later write is silently discarded.
3. Anti-pattern confirmed: timestamp-based conflict resolution with wall-clock time.
4. Mitigation: Replace LWW with application-level conflict resolution using version vectors (CRDTs for counters/sets, explicit merge logic for other data types). If LWW is retained, monitor clock offsets between datacenters and alert when skew exceeds acceptable threshold. Declare nodes with excessive clock drift dead and remove them from the cluster.
**Output:** Failure analysis report identifying LWW + clock skew as root cause. Migration plan to version-vector-based conflict resolution. Alert thresholds for inter-datacenter clock offset monitoring.
---
### Example 2: Zombie leader corrupts shared file storage
**Scenario:** A service uses a ZooKeeper-based distributed lock before writing to shared object storage. Occasionally, two clients write to the same object simultaneously, corrupting it. The team has confirmed the locking code "looks correct."
**Trigger:** Data corruption incident. Developer audits the locking implementation.
**Process:**
1. Symptom classification: two concurrent writers despite a mutual exclusion lock → Category C (process pause) causing zombie lock-holder behavior.
2. Mechanism: Client 1 acquires the lock and begins writing. A stop-the-world GC pause of 18 seconds occurs. The lock expires. Client 2 acquires the lock and begins writing. Client 1 resumes, checks `lease.isValid()`, finds the clock says only milliseconds have elapsed, concludes it still holds the lock, and continues writing. Two writers are now active simultaneously.
3. Anti-pattern confirmed: client-side lease validity check without server-side fencing token enforcement.
4. Mitigation: Implement fencing tokens. ZooKeeper's `zxid` is a suitable fencing token — it is monotonically increasing and returned with each lock grant. Pass the `zxid` with every write to the storage service. The storage service rejects writes with a `zxid` lower than the highest it has seen. Alternatively: use conditional writes (compare-and-swap on an object version field) at the storage layer to detect and reject stale writes.
**Output:** Failure analysis report. Code diff showing where to extract the `zxid` from ZooKeeper and pass it to storage writes. Storage service modification to track and enforce the fencing token.
---
### Example 3: Cascading node-death declarations under load spike
**Scenario:** During a traffic spike, a distributed database cluster declares multiple nodes dead in rapid succession. The cluster degrades severely. When traffic drops, all nodes recover and show healthy, but the cluster has partially lost quorum and requires manual intervention.
**Trigger:** Post-incident review. SRE team wants to understand why nodes were declared dead when they were actually alive.
**Process:**
1. Symptom classification: multiple simultaneous timeout-triggered node-death declarations under load → Category A (network fault) + timeout calibration error.
2. Mechanism: Timeouts were configured at 2 seconds — calibrated for p50 latency. During the load spike, p99 latency jumped to 4–6 seconds due to network congestion and CPU queueing. Health-check RPCs timed out. Nodes were declared dead. Their load was redistributed to remaining nodes, increasing their latency further, causing more timeouts, triggering more declarations — a cascading failure.
3. Anti-pattern: static timeout value calibrated for median, not tail, latency.
4. Mitigation: (a) Recalibrate timeout to p99.9 latency under expected peak load. (b) Add jitter to timeout values to prevent synchronized timeout storms. (c) Implement backpressure: before declaring a node dead, check whether the local node is itself under load (a proxy for network congestion). (d) Consider a Phi Accrual failure detector (used in Akka, Cassandra) that adapts to observed jitter. (e) Tune load shedding to shed excess traffic rather than redistributing it to already-loaded nodes.
**Output:** Failure analysis report. Timeout recalibration recommendation with p99.9 measurement methodology. Configuration changes. Recommendation to evaluate Phi Accrual failure detector.
---
## References
- `references/failure-taxonomy.md` — complete taxonomy: network fault modes, clock error sources, process pause causes, with detection signals and mitigation options
- `references/fencing-token-pattern.md` — fencing token mechanics, ZooKeeper integration, implementation for storage services without native fencing support
- `references/system-models.md` — timing models (synchronous, partially synchronous, asynchronous), node failure models (crash-stop, crash-recovery, Byzantine), safety vs. liveness properties
- `references/clock-pitfalls.md` — NTP accuracy limits, LWW data loss mechanics, clock confidence intervals, Google Spanner TrueTime approach
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-replication-strategy-selector`
- `clawhub install bookforge-consistency-model-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/clock-pitfalls.md
# Clock Pitfalls in Distributed Systems
Full reference for Step 2 (clock unreliability) of the distributed-failure-analyzer skill.
---
## The Core Problem
Each node in a distributed system has its own hardware clock. Clocks are not perfectly accurate and cannot be perfectly synchronized. Software that assumes clock accuracy across nodes will behave incorrectly in ways that are subtle, silent, and difficult to reproduce.
"Synchronized" does not mean "accurate." Even with NTP running correctly, clocks across nodes may differ by tens of milliseconds. Under network congestion, errors can exceed 100 ms. In virtualized environments, clocks can jump forward by arbitrary amounts.
---
## Time-of-Day vs. Monotonic Clocks
### Time-of-day clock (wall-clock time)
- **API examples:** `clock_gettime(CLOCK_REALTIME)` (Linux), `System.currentTimeMillis()` (Java), `time.time()` (Python), `Date.now()` (JavaScript)
- **Returns:** seconds (or milliseconds) since the Unix epoch (midnight UTC, January 1, 1970)
- **Synchronized via:** NTP
- **Jump risk:** YES — if the local clock is ahead of the NTP server, it may be forcibly reset backward. Applications observing time before and after this reset see time go backward or jump forward.
- **Use for:** timestamps of events for human display, expiry dates, calendar scheduling.
- **Do NOT use for:** measuring elapsed time (can jump), ordering events across nodes (skew causes incorrect orderings).
### Monotonic clock
- **API examples:** `clock_gettime(CLOCK_MONOTONIC)` (Linux), `System.nanoTime()` (Java), `time.monotonic()` (Python)
- **Returns:** nanoseconds since an arbitrary point (usually system boot). The absolute value is meaningless.
- **Synchronized via:** not synchronized — does not need to be, because it is only used to measure duration on a single node.
- **Jump risk:** NO — guaranteed to always move forward. NTP may *slew* (speed up or slow down) the monotonic clock by up to 0.05%, but it cannot jump.
- **Use for:** timeouts, measuring response latency, measuring elapsed time on a single node.
- **Do NOT use for:** comparing times across different nodes (absolute values are incomparable).
**Rule:** Use monotonic clocks for timeouts and durations. Use time-of-day clocks only where you need a point in calendar time, and never for cross-node ordering.
---
## Sources of Clock Error
### 1. Quartz drift
- Quartz crystal oscillators run faster or slower depending on temperature.
- Google's assumption for server hardware: 200 ppm drift.
- 200 ppm = 6 ms drift for a clock resynchronized every 30 seconds; 17 ms/day for a clock synced once daily.
- Even with NTP, clock drift between syncs accumulates.
### 2. NTP accuracy limits
- NTP accuracy is bounded by network round-trip time to the NTP server.
- Minimum achievable error (local network, well-behaved): ~35 ms.
- Under network congestion: errors spike to over 100 ms.
- On public internet NTP servers: accuracy is limited further by variable internet latency.
### 3. NTP misconfiguration and blocking
- If a firewall blocks NTP traffic (UDP port 123), the NTP client cannot sync.
- The misconfiguration is silent — the clock keeps running and drifting.
- Anecdotal evidence: this happens in practice and goes unnoticed for extended periods.
- If the NTP client detects it is too far from the server's time, it may refuse to sync (to avoid destabilizing the system with a large jump).
### 4. Wrong or misconfigured NTP servers
- Some public NTP servers report incorrect times (hours off).
- NTP clients query multiple servers and reject outliers, but this protection is not perfect.
### 5. Leap seconds
- A leap second inserts an extra second (23:59:60) or deletes a second, making a minute 61 or 59 seconds.
- Systems not designed with leap seconds in mind have crashed.
- Best practice: configure NTP servers to "smear" the leap second — spread the adjustment gradually over the course of a day, so clocks see a slightly slower/faster rate of advance rather than a discontinuity.
- NTP server behavior on leap seconds varies in practice.
### 6. VM clock virtualization
- When a VM is paused (CPU time given to another VM), the VM's hardware clock is frozen.
- When the VM resumes, the hypervisor updates the VM's clock to the current time.
- From the application's perspective, the clock suddenly jumps forward by the pause duration.
- Pause durations for live migration: seconds to minutes, depending on memory write rate.
- This manifests as a time-of-day clock jump on resume.
---
## Clock-Based Anti-Patterns
### Anti-pattern 1: Last-write-wins with wall-clock timestamps
**How it fails:**
- In a multi-leader or leaderless database, concurrent writes to the same key from different nodes are resolved by keeping the write with the higher timestamp.
- If Node A's clock is 5 ms ahead of Node B's, a causally later write from Node B (timestamp = Node B's time) may have a lower timestamp than an earlier write from Node A (timestamp = Node A's time).
- Node B's write is discarded. No error is logged. The application has no idea.
- Example: Cassandra and Riak use LWW as a conflict resolution option.
**Concrete illustration (from the book, Figure 8-3):**
- Client A writes `x=1` on Node 1 at timestamp `42.004`.
- Client B increments `x` on Node 3, producing `x=2` at timestamp `42.003` (Node 3 is 1 ms behind).
- After replication, Node 2 receives both. LWW picks `x=1` (higher timestamp). Client B's increment is lost.
- Clock skew of 1 ms is sufficient to cause data loss. Typical inter-datacenter skew is much larger.
**Fix:**
- Replace LWW timestamp-based resolution with logical clocks (version vectors) for causal ordering.
- If LWW is required by the storage system, monitor clock offsets between nodes and alert when skew exceeds an acceptable threshold. Nodes with excessive drift should be removed.
### Anti-pattern 2: Using clock timestamps as transaction IDs for ordering
**How it fails:**
- Generating monotonically increasing transaction IDs from wall-clock time in a distributed system requires that clocks are synchronized well enough that a later transaction on any node always gets a higher timestamp than earlier transactions.
- This is not achievable with NTP alone. Two transactions on different nodes within the same millisecond may receive the same timestamp.
- Transaction ID ordering breaks; snapshot isolation semantics are violated.
**Partial solution (Google Spanner's TrueTime):**
- Spanner's TrueTime API returns `[earliest, latest]` for the current time.
- Spanner waits out the uncertainty interval before committing a transaction — ensuring that any future transaction's earliest possible timestamp is after the current transaction's latest possible timestamp.
- This provides correct ordering at the cost of commit latency proportional to the uncertainty interval (~7 ms with GPS/atomic clocks in each datacenter).
- Not practical without GPS/atomic clock infrastructure.
**Alternative:** Use logical clocks (Lamport timestamps, vector clocks) or a centralized sequence number generator for transaction IDs. Twitter's Snowflake generates approximately monotonically increasing IDs in a distributed way (but cannot guarantee consistency with causality).
### Anti-pattern 3: Clock-based lease validity check before a protected operation
Documented separately in `fencing-token-pattern.md`. In brief: checking `System.currentTimeMillis() < leaseExpiryTime` before acting on the lease is not atomic with the action. A process pause between the check and the action can cause the lease to have expired by the time the action executes.
---
## Clock Monitoring
Incorrect clocks are insidious because most things continue to work. The system does not crash or throw errors — it silently produces wrong results or drops data. This is why explicit monitoring is essential.
**Recommended monitoring:**
1. **NTP offset metrics**: track the offset between each node's clock and the NTP server. Alert when offset exceeds your acceptable threshold (typically 50–100 ms for most applications; much tighter for financial or high-precision systems).
2. **Inter-node clock skew**: for systems using timestamp-based conflict resolution, directly measure clock differences between nodes. Alert when skew approaches the granularity of your timestamps.
3. **Node removal on excessive drift**: any node whose clock drifts too far from the others should be declared dead and removed from the cluster. The node's incorrect timestamps can corrupt data or cause incorrect ordering.
**Tools:** `ntpq -p` (NTP status), `chronyc tracking` (chrony), Prometheus `node_timex_offset_seconds` metric, AWS CloudWatch `ClockErrorBound` for Spanner-equivalent services.
---
## When Clock Accuracy Really Matters
For most server-side data systems, the mitigations above are sufficient. But some domains require sub-millisecond clock accuracy:
- **High-frequency trading:** MiFID II (European regulation) requires financial institutions to synchronize clocks to within 100 microseconds of UTC.
- **Distributed tracing and log correlation:** millisecond-accurate clocks are usually sufficient; sub-millisecond is rarely needed.
- **Distributed snapshot isolation (Spanner-style):** requires GPS receivers or atomic clocks per datacenter.
For these use cases, the Precision Time Protocol (PTP, IEEE 1588) and GPS/atomic clock hardware are the relevant tools, not NTP.
FILE:references/failure-taxonomy.md
# Failure Taxonomy: Network, Clock, Process
Full reference for Step 1 and Step 2 of the distributed-failure-analyzer skill.
---
## Category A: Network Faults
### Background
The distributed systems discussed in this context are **shared-nothing systems**: nodes have their own memory and disk; communication happens only via messages over a network. Most internal datacenter networks (Ethernet) and the internet are **asynchronous packet networks**: there is no upper bound on delivery time, and no delivery guarantee.
### The Six Network Failure Modes (for a single request/response)
When a client sends a request and receives no response, any of these may have occurred:
| # | What failed | Distinguishable? | Notes |
|---|---|---|---|
| 1 | Request lost in transit | No | Cable unplugged, queue drop, routing failure |
| 2 | Request queued, not yet delivered | No | Network congestion, switch buffer full |
| 3 | Remote node crashed | No | Indistinguishable from 1, 5, 6 via timeout alone |
| 4 | Remote node temporarily unresponsive | No | GC pause, high load — node will recover |
| 5 | Response lost in transit | No | Misconfigured switch, asymmetric fault |
| 6 | Response delayed | No | Overloaded network or receiver node |
**Critical implication:** A timeout tells you that you stopped waiting. It does not tell you whether the remote node executed your request. If you retry after a timeout, the operation may execute twice.
### Network Fault Detection Signals
- **TCP RST or FIN**: the OS on the remote machine closed the port (process crashed, port not listening). Does not tell you how much work the process completed before crashing.
- **ICMP Destination Unreachable**: the router believes the IP is unreachable. Subject to the same limitations as other participants.
- **Management interface query**: for switches you control, can confirm link-level failure at hardware level. Not available in shared/cloud environments.
- **Application-level acknowledgment**: the only reliable confirmation that a request was processed correctly. TCP acknowledgment confirms delivery to the OS, not to the application.
### Network Fault Statistics (real-world)
- One study of a medium-sized datacenter: ~12 network faults per month. Half disconnected a single machine; half disconnected an entire rack.
- Adding redundant networking gear does not proportionally reduce faults — human error (misconfiguration) is a primary cause.
- Public clouds (e.g., EC2): frequent transient network glitches are documented.
- Asymmetric faults observed in practice: a network interface that drops all inbound packets but sends outbound packets successfully.
### Timeout Calibration
**Problem:** A long timeout delays failure detection. A short timeout causes false positives — declaring a live-but-slow node dead.
**Cascade failure mechanism:** Node declared dead → load redistributed to remaining nodes → remaining nodes become slower → their health checks time out → more nodes declared dead → cluster collapses.
**Calibration approach:**
1. Measure round-trip time distribution over an extended period, across many machines.
2. Set timeout at p99 or p99.9 latency under expected peak load, not median.
3. Add jitter to prevent synchronized timeout storms.
4. Consider adaptive failure detectors: **Phi Accrual** (used in Akka, Cassandra, and in TCP retransmission) dynamically adjusts the failure threshold based on observed jitter.
**Formula for a theoretical synchronous network:** If max packet delay is `d` and max processing time is `r`, timeout = `2d + r`. In asynchronous networks there is no `d`, so there is no correct value — only empirical calibration.
---
## Category B: Clock Unreliability
### Background
Each node has its own hardware clock — typically a quartz crystal oscillator. Clocks drift (run faster or slower than real time). Distributed systems cannot share a clock.
### Clock Types
| Type | Purpose | Cross-node comparison | Jump risk |
|---|---|---|---|
| Time-of-day (wall-clock) | Points in time (timestamps, event ordering) | Not safe | Yes — can jump backward on NTP reset |
| Monotonic | Elapsed time (durations, timeouts) | Never safe | No — guaranteed always forward |
**Rule of thumb:** Use monotonic clocks for timeouts and measuring durations on a single node. Never use time-of-day clocks to order events across nodes.
### Sources of Clock Error
| Source | Typical magnitude | Notes |
|---|---|---|
| Quartz drift | Up to 200 ppm (Google's server assumption) | 200 ppm = 6 ms drift if synced every 30s; 17 ms/day if synced once a day |
| NTP sync error | ~35 ms minimum on congested networks; spikes to >100 ms | Limited by network round-trip time |
| NTP firewall block | Unbounded — grows without bound | Goes unnoticed until clock is severely wrong |
| Wrong/misconfigured NTP servers | Hours off | Some public NTP servers are misconfigured |
| Leap seconds | ±1 second | A minute may be 59 or 61 seconds; has crashed production systems |
| VM time virtualization | Tens of milliseconds per pause | When a VM is paused (CPU given to another VM), the clock jumps forward on resume |
| User-controlled clocks | Arbitrary | On user devices (mobile, embedded); users may set clocks to incorrect values intentionally |
### Clock Confidence Interval
A clock reading is not a point in time — it is a range. `clock_gettime()` returns a single value but does not expose its uncertainty. The actual time may be anywhere within a confidence interval.
Google's **TrueTime API** (used in Spanner) explicitly returns `[earliest, latest]` — the earliest and latest possible current time. Spanner waits out the confidence interval before committing read-write transactions to ensure causal ordering. This requires GPS receivers or atomic clocks in each datacenter (uncertainty: ~7 ms).
For most systems without TrueTime: assume clock uncertainty of tens of milliseconds on a LAN, and design accordingly.
### Last-Write-Wins (LWW) Data Loss Mechanism
1. Client A writes `x=1` on Node 1 at wall-clock time `t=42.004`.
2. Client B increments `x` on Node 3, resulting in `x=2` at wall-clock time `t=42.003` (Node 3's clock is 1 ms behind Node 1's).
3. Both writes replicate to Node 2.
4. Node 2 applies LWW: `t=42.004 > t=42.003`, so `x=1` wins.
5. Client B's increment (`x=2`) is silently discarded. No error is reported.
**Detection:** Monitor inter-node clock offset. Alert when skew exceeds the acceptable threshold for your LWW use case. Nodes with excessive drift should be removed from the cluster before they cause data loss.
**Fix:** Replace LWW timestamp-based conflict resolution with **logical clocks** (version vectors, Lamport timestamps) for causal ordering. Logical clocks do not measure wall-clock time — they track only causal happens-before relationships, which are correct regardless of clock skew.
---
## Category C: Process Pauses
### Background
A thread can be preempted at any arbitrary point in its execution and paused for an arbitrary duration. The thread has no awareness that it was paused — when it resumes, it continues from where it left off, with no knowledge that real time has passed.
### Pause Causes
| Cause | Typical duration | Notes |
|---|---|---|
| Stop-the-world GC | Milliseconds to several seconds | JVM, .NET, Ruby. "Concurrent" collectors (CMS) still have stop-the-world phases. |
| VM suspension / live migration | Seconds to minutes | Hypervisor saves VM memory to disk; restores on another host. Cloud providers do this without notice. |
| OS context switch | Microseconds to milliseconds | Normal, but under CPU steal (multi-tenant) the paused thread may wait much longer. |
| CPU steal time | Variable | Another VM is using the shared CPU core. The paused VM's threads wait with no awareness. |
| Synchronous disk I/O | Milliseconds to seconds | Worse on network-attached storage (EBS, NFS). Java class loading can trigger this unexpectedly. |
| Memory paging (thrashing) | Seconds | OS swaps pages to disk under memory pressure. Extreme: system spends most time swapping. |
| SIGSTOP signal | Arbitrary — until SIGCONT | Sent by Ctrl-Z in a shell, or accidentally by operations tooling. |
### The Zombie Problem
When a paused node resumes:
1. It does not know it was paused.
2. Its wall-clock time jumps forward (in virtualized environments).
3. It checks its state (e.g., lease validity) and finds it appears valid.
4. It continues its previous role — even though the rest of the cluster has declared it dead, elected a new leader, or granted a new lock.
5. Result: two nodes acting as leader simultaneously, or two lock-holders writing to the same resource.
### GC Pause Mitigation (without real-time systems)
- **Treat GC pauses as planned node outages**: drain requests from the node before GC, let other nodes handle traffic during GC, resume the node after collection.
- **Use short-lived object patterns**: restart processes periodically before long-lived objects accumulate enough to trigger a full GC. This works like a rolling restart.
- **Tune GC settings**: increase heap size (delays GC), tune allocation patterns. Reduces frequency but does not eliminate pauses.
- **Monitor**: track GC pause durations. Alert on pauses that exceed your timeout budget.
---
## Summary: Symptom-to-Category Quick Reference
| Symptom | Most likely category | Secondary category |
|---|---|---|
| Intermittent timeout, node recovers | A (network) or C (process pause) | — |
| Node declared dead, then recovers with no awareness of downtime | C (process pause) | A (asymmetric network fault) |
| Two leaders active simultaneously | C (pause caused zombie) or A (split brain) | — |
| Silent data loss after concurrent writes | B (clock / LWW) | — |
| Lock corruption despite mutual exclusion code | C (pause caused zombie) | — |
| Stale reads longer than replication lag | A (delayed response) or B (clock) | — |
| Cascading node-death declarations under load | A (timeout miscalibration) | — |
| Clock-ordered events appear out of causal order | B (clock skew) | — |
FILE:references/fencing-token-pattern.md
# Fencing Token Pattern
Full reference for Step 4 of the distributed-failure-analyzer skill.
---
## Problem
A distributed lock or lease is meant to ensure that only one node acts as the chosen one (leader, lock-holder) at any time. But any locking scheme based on time-limited leases is vulnerable to process pauses:
1. Client 1 acquires a lease with a 30-second expiry.
2. Client 1 enters a stop-the-world GC pause for 40 seconds.
3. The lease expires. Client 2 acquires the lease.
4. Client 2 begins writing to the shared resource.
5. Client 1 resumes. Its clock has not advanced by 40 seconds (virtualized clock, or monotonic clock from its perspective). It checks `lease.isValid()` — the check passes. It begins writing.
6. Two writers are now active simultaneously. The shared resource is corrupted.
**This bug is documented in the HBase codebase** and is not theoretical.
The client-side lease check is the anti-pattern. The check is not atomic with the protected operation — any pause between them invalidates the check's conclusion.
---
## Solution: The Fencing Token
The lock service issues a **fencing token** with every lock or lease grant. A fencing token is a monotonically increasing integer — each new grant increments the counter. The lock-holder must include its token in every request to the protected resource. The resource tracks the highest token it has processed and rejects any request with a lower token.
### Sequence Diagram
```
Lock service Client 1 Client 2 Storage
| | | |
|<-- acquire lock ---------| | |
|-- ok, token=33 --------->| | |
| | | |
| [Client 1: stop-the-world GC pause — 40 seconds] |
| | | |
| lease expired | |
|<-- acquire lock ---------|--------------------->| |
|-- ok, token=34 --------------------------------->| |
| | |-- write, token=34 -->|
| | |<-- ok ----------------|
| | | |
| [Client 1 resumes — does not know it was paused] |
| | | |
| |-- write, token=33 ----------------------->|
| |<-- REJECTED: token 33 < 34 seen ----------|
```
Client 1's write is rejected. No corruption.
---
## Implementation
### Requirements on the lock service
- Every lock grant returns a token that is strictly greater than all previously issued tokens.
- Tokens need not be sequential (gaps are allowed) — they must only be monotonically increasing.
### Requirements on the resource
- Maintain state: `highest_token_seen`.
- On receiving a write request with token `t`:
- If `t > highest_token_seen`: accept the write, update `highest_token_seen = t`.
- If `t <= highest_token_seen`: reject the write with an error indicating stale token.
- This state must be persisted durably (survives crashes).
### Requirements on the client
- Extract the fencing token from the lock service response.
- Pass the token with every request to the protected resource.
- Handle rejections: a rejection means the lock has been superseded. The client should not retry — it should abandon the operation and re-acquire the lock if it still needs access.
---
## ZooKeeper Integration
If ZooKeeper is used as the lock service:
- **`zxid` (ZooKeeper transaction ID)**: a globally monotonically increasing ID assigned to every transaction in ZooKeeper. When a client acquires a lock (creates an ephemeral node), the `zxid` of that transaction serves as the fencing token.
- **`cversion` (child version)**: the version number of a node's children, incremented on each child modification. Also monotonically increasing and usable as a fencing token.
Both are guaranteed to be monotonically increasing and are available from the ZooKeeper API response.
**Example (Java, using Curator):**
```java
InterProcessMutex lock = new InterProcessMutex(client, "/locks/my-resource");
lock.acquire();
long fencingToken = client.getZookeeperClient()
.getZooKeeper()
.exists("/locks/my-resource", false)
.getCzxid(); // creation zxid — use as fencing token
storageService.write(data, fencingToken); // pass token to resource
lock.release();
```
---
## When the Resource Does Not Natively Support Fencing Tokens
Not all storage services expose a fencing token check API. Options:
### Option 1: Conditional writes (compare-and-swap)
Most databases support conditional updates. Use an object version field:
```sql
-- Write only if version matches what we last read
UPDATE resource SET data = ?, version = version + 1
WHERE id = ? AND version = ?
```
This is not exactly fencing tokens, but achieves a similar effect: stale writers whose version does not match the current version will have their writes rejected.
### Option 2: Encode the token in the resource name
For file storage: include the fencing token in the filename or object key.
- Client acquires lock with token 33: writes to `data-33.json`.
- Client acquires lock with token 34: writes to `data-34.json`.
- Readers always use the file with the highest token number.
This is a workaround, not a full solution — it requires readers to be aware of the convention.
### Option 3: External serialization layer
Wrap the resource access in a proxy or middleware that enforces fencing token ordering before forwarding requests to the underlying resource.
---
## What Fencing Tokens Do NOT Protect Against
Fencing tokens protect against **inadvertent** zombie behavior — a node that does not know it has been declared dead, acting in good faith.
They do NOT protect against:
- A node that **deliberately** sends a faked high fencing token to bypass the check. This is a Byzantine fault.
- A resource that does not enforce the token check (client-side checking alone is insufficient).
- Race conditions at the storage layer if the fencing token check and the write are not atomic.
---
## Relationship to Leases and Heartbeats
A lease (lock with timeout) combined with fencing tokens is a robust pattern:
- The lease ensures the lock is eventually released if the holder crashes (liveness).
- The fencing token ensures that a zombie holder cannot corrupt the resource after expiry (safety).
Heartbeats alone (leader sends keep-alive to remain leader) do not solve the zombie problem — a paused leader stops sending heartbeats, gets declared dead, then resumes and continues acting as leader.
---
## Further Reading
- Kleppmann, "Designing Data-Intensive Applications," Chapter 8, pp. 302–303 (fencing tokens) and p. 295–297 (process pauses and the lease anti-pattern).
- Martin Kleppmann, "How to do distributed locking" (blog post) — expands on why Redlock (Redis-based distributed locking without fencing tokens) is unsafe.
FILE:references/system-models.md
# System Models for Distributed Algorithms
Full reference for Step 6 of the distributed-failure-analyzer skill.
---
## Why System Models Matter
A distributed algorithm is correct only within a specific system model. If the actual system violates the model's assumptions, the algorithm's correctness guarantees no longer hold. Making the model explicit allows:
1. Selecting algorithms that match the system's real behavior.
2. Identifying cases where the implementation may fail even if the algorithm is "correct."
3. Reasoning about safety and liveness properties in the presence of faults.
---
## Timing Models
### Synchronous Model
**Assumptions:** Bounded network delay, bounded process pauses, bounded clock error.
Does not mean perfectly synchronized clocks or zero delay — only that known fixed upper bounds exist.
**Reality fit:** Not realistic for most packet-switched datacenter networks or commodity hardware running general-purpose operating systems. Packet delays are unbounded; GC pauses are unbounded; clock drift has no fixed bound without dedicated hardware (GPS, atomic clocks).
**Use case:** Algorithm analysis baseline. Some embedded and real-time systems (RTOS + dedicated hardware) can approximate this model.
### Partially Synchronous Model
**Assumptions:** The system behaves like a synchronous system *most of the time*, but occasionally exceeds the bounds for network delay, process pauses, and clock drift.
**Reality fit:** Realistic for most production server-side systems. Networks are usually well-behaved; processes usually respond quickly. But any timing assumption may be violated occasionally (load spikes, GC events, VM migrations, network congestion).
**Implication:** Algorithms designed for this model must be correct even when timing bounds are temporarily exceeded. Safety properties must hold always; liveness properties are allowed to wait until the system returns to synchronous behavior.
**Recommended default for most datacenter systems.**
### Asynchronous Model
**Assumptions:** No timing assumptions whatsoever. No clock, no timeouts.
**Reality fit:** Very conservative. Some algorithms (e.g., certain consensus protocols) are designed for this model, making them maximally portable. However, the model is extremely restrictive — without timeouts, failure detection is impossible, and many practical algorithms cannot be expressed.
---
## Node Failure Models
### Crash-Stop Faults
**Assumptions:** A node fails by stopping responding. It never comes back. Once a node is declared dead, it stays dead.
**Notes:** Simplest model. Safe to use as a design assumption even if nodes do recover in practice — algorithms designed for crash-stop remain correct if nodes happen to recover (they just may not take advantage of the recovery).
### Crash-Recovery Faults
**Assumptions:** Nodes may crash at any time. They may start responding again after an unknown amount of time. On crash-recovery: stable storage (non-volatile disk) is preserved across crashes; in-memory state is lost.
**Notes:** Realistic for server-side systems with durable storage (databases, write-ahead logs). Most production distributed databases assume this model.
**Implication:** Algorithms must handle the case where a recovered node has forgotten its in-memory state. This is why distributed databases write decisions to disk before responding — the disk is the source of truth after recovery.
### Byzantine (Arbitrary) Faults
**Assumptions:** Nodes may do anything — crash, respond slowly, send incorrect or conflicting messages, lie about their state.
**Notes:** Covers adversarial behavior and hardware memory corruption. Requires supermajority quorums (>2/3 of nodes functioning correctly). Protocols are significantly more complex and expensive.
**When relevant:**
- Multi-party untrusted environments (blockchain, inter-bank settlement, peer-to-peer networks).
- High-radiation environments (aerospace, space systems) where memory corruption is a realistic concern.
- NOT relevant for typical datacenter systems where your organization controls all nodes.
---
## Safety and Liveness Properties
Every correctness property of a distributed algorithm is classified as either safety or liveness.
### Safety Properties
**Definition:** "Nothing bad happens."
Formally: if a safety property is violated, there exists a specific point in time at which the violation occurred. The violation cannot be undone — the damage is done.
**Examples:**
- Uniqueness (fencing tokens): no two requests for a fencing token return the same value.
- Monotonic sequence: tokens are always increasing.
- Linearizability: reads always see the most recent write.
**Requirement:** Safety properties must hold in *all* situations in the system model — even if all nodes crash or all network delays become infinite. The algorithm must never return a wrong result.
### Liveness Properties
**Definition:** "Something good eventually happens."
Liveness properties often include the word "eventually" in their definition.
**Examples:**
- Availability: a node that requests a fencing token and does not crash eventually receives a response.
- Eventual consistency: if no new writes occur, all replicas eventually converge to the same value.
**Requirement:** Liveness properties may include caveats. For example: "a request receives a response *if* a majority of nodes are functional *and* the network eventually recovers." The partially synchronous model requires that any period of timing violation is finite.
**Note:** Eventual consistency is a liveness property. Linearizability is a safety property. These have fundamentally different guarantees and operational implications.
---
## Recommended System Model for Most Server-Side Data Systems
**Timing model:** Partially synchronous
**Node failure model:** Crash-recovery
This combination is used by most well-known distributed algorithms (Raft, Paxos, ZooKeeper's ZAB, PBFT with crash-recovery modifications).
**Implications for your system:**
1. Timeouts will sometimes fire incorrectly — design for this.
2. A recovered node must re-read its state from stable storage before acting.
3. Safety properties (uniqueness, linearizability) must hold even during timing violations.
4. Liveness properties (eventually getting a response, eventual leader election) are allowed to stall during partitions or excessive delays.
---
## Mapping System Models to Reality
System models are abstractions. Real systems frequently encounter scenarios the model does not cover:
- **Stable storage assumed to survive crashes** — but what if the disk is corrupted by a firmware bug? What if the server fails to recognize its drives on reboot?
- **Crash-recovery assumed** — but what if a node has amnesia and forgets previously stored data (hardware failure wiping the disk)?
- **Non-Byzantine assumed** — but what if a software bug causes a node to send incorrect responses (a bug is effectively a Byzantine fault that a Byzantine-tolerant algorithm cannot save you from, if all nodes run the same buggy code)?
These edge cases do not make system models useless. They mean that:
1. Theoretical algorithm analysis and empirical testing are both required.
2. Real implementations must handle cases the model declares impossible — even if that handling is just an alert + manual intervention.
3. Safety properties in a real system are best-effort, with monitoring as the backstop.
Choose between relational, document, and graph data models for an application by analyzing data shape, relationship complexity, and query patterns. Use when...
---
name: data-model-selector
description: |
Choose between relational, document, and graph data models for an application by analyzing data shape, relationship complexity, and query patterns. Use when asked "should I use MongoDB or PostgreSQL?", "when does a graph database make sense?", "how do I choose between SQL and NoSQL?", or "what data model fits my access patterns?" Also use for: evaluating impedance mismatch between data model and application code; deciding schema-on-read vs. schema-on-write for heterogeneous data; diagnosing whether many-to-many relationships call for relational or graph model; choosing between property graphs and triple-stores; deciding when polyglot persistence is appropriate. Produces a concrete recommendation with trade-off analysis — not "it depends." Covers relational (PostgreSQL, MySQL), document (MongoDB, CouchDB), and graph (Neo4j, Datomic) models including schema enforcement strategies and data locality trade-offs.
For storage engine internals (LSM-tree vs B-tree), use storage-engine-selector instead. For OLTP vs. analytics routing, use oltp-olap-workload-classifier instead.
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/data-model-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on: []
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [2]
tags: [data-model, relational, document, graph, nosql, schema-on-read, schema-on-write, data-locality, joins, many-to-many, polyglot-persistence, neo4j, mongodb, postgresql, property-graph, triple-store, cypher, sparql, architecture-decision]
execution:
tier: 1
mode: hybrid
inputs:
- type: document
description: "System description, data relationship description, and expected query patterns — as a written brief, existing schema files (.sql, .json), architecture document, or codebase to analyze"
tools-required: [Read, Write]
tools-optional: [Grep]
mcps-required: []
environment: "Any agent environment. Works with pasted descriptions, schema files, docker-compose.yml, or architecture.md documents."
discovery:
goal: "Produce a concrete data model recommendation with trade-off analysis — not a balanced overview of options"
tasks:
- "Classify data shape (tree-structured vs. interconnected graph vs. tabular)"
- "Score each model against relationship complexity, join requirements, schema flexibility needs, and query patterns"
- "Identify schema enforcement strategy (schema-on-read vs. schema-on-write)"
- "Analyze data locality requirements"
- "Produce a recommendation document with trade-off rationale and implementation guidance"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "tech-lead", "technical-manager"]
experience: "intermediate-to-advanced — assumes familiarity with relational databases and SQL"
triggers:
- "User is designing a new system and needs to choose a database model"
- "User has an existing system with schema complexity or join explosion symptoms"
- "User is evaluating document databases (MongoDB, CouchDB, RethinkDB) vs. relational"
- "User has graph-like data (social networks, recommendation engines, fraud detection, routing)"
- "User is experiencing object-relational impedance mismatch and wants to know if document model helps"
- "User needs to decide between schema-on-read and schema-on-write for a heterogeneous data source"
- "User wants to validate whether their current data model fits their access patterns"
not_for:
- "Choosing specific databases within a model (e.g., PostgreSQL vs. MySQL, MongoDB vs. CouchDB) — this skill selects the model, not the product"
- "Storage engine selection (LSM-tree vs. B-tree) — use storage-engine-selector"
- "Partitioning strategy for distributed deployments — use partitioning-strategy-advisor"
- "Replication topology decisions — out of scope for this skill"
---
# Data Model Selector
## When to Use
You are designing a new system or evaluating an existing one and need to decide how to structure your data: relational tables with joins, self-contained documents with nested structure, or a graph of vertices and edges.
This skill applies when:
- You are choosing a data model before picking a specific database product
- Your application code has become complex managing joins or object-relational translation
- You have data with heterogeneous structure or schema requirements that vary per record
- You are seeing many-to-many relationships proliferating and are unsure whether relational or graph is the right solution
- You want to document the trade-off rationale for an architecture decision record
**This skill selects the model, not the product.** Once you have a recommendation, proceed to product selection based on operational requirements (replication, consistency guarantees, operational complexity). For storage engine decisions within your chosen model, see `storage-engine-selector`. For partitioning strategy once you have a model and product, see `partitioning-strategy-advisor`.
---
## Context & Input Gathering
Before running the decision framework, collect:
### Required
- **Data shape description:** What entities exist in your system, and how do they relate to each other? (e.g., "users have many orders, orders have many line items, line items reference products")
- **Relationship type:** Are relationships one-to-many (tree structure), many-to-many (network), or highly interconnected (graph)? Do you know the depth of nesting?
- **Primary query patterns:** How will data most commonly be read? By single entity ID? By traversing relationships? By aggregation across many records? By pattern-matching across connections?
- **Schema stability:** Is your data structure uniform across all records, or does it vary (different fields per record type, externally controlled structure, frequently evolving fields)?
### Important
- **Join frequency expectations:** Will your application routinely query data that spans multiple entity types, or mostly retrieve a single entity and all its associated data at once?
- **Write pattern:** Is data written as complete self-contained units, or updated field-by-field with references to other entities?
- **Team/existing stack:** What models does your team already operate? (Operational familiarity is a legitimate factor in model selection.)
### Optional (improves recommendation precision)
- **Existing schema files:** Provide `.sql`, `.json` schema, or ORM model definitions for analysis
- **Architecture or data flow documents:** Help identify implied relationships not stated explicitly
- **Application code with data access patterns:** ORM queries, aggregation pipelines, or graph traversals reveal actual access patterns vs. assumed ones
If the data shape and relationship description are missing, ask for them before proceeding. A model recommendation without relationship complexity analysis is unreliable.
---
## Process
### Step 1: Classify the Data Shape
**Action:** Determine the primary structure of your data. Every data model has a structural assumption — violate it, and complexity moves into your application code.
**WHY:** The three models evolved to solve different structural problems. The hierarchical model (predecessor to document databases) worked for one-to-many tree structures but broke on many-to-many relationships. The relational model solved many-to-many but introduced a translation layer between objects and tables. Graph databases are the right tool when connections between entities are as important as the entities themselves. Choosing a model that matches your data's natural structure minimizes the impedance mismatch between your application and your storage layer — meaning less translation code, simpler queries, and a data model that evolves naturally with your application.
Identify which category best describes your data:
**Category A — Tree structure (one-to-many, self-contained)**
- Each primary entity has subordinate data that "belongs to" it
- Subordinate data is accessed almost always together with the parent (user + all their profile data; order + all its line items)
- Relationships between different top-level entities are rare or non-existent
- Example: a résumé — one user, many positions, many education records, many contact items; rarely do positions reference other positions
**Category B — Interconnected entities (many-to-many, tabular)**
- Multiple entity types reference each other in both directions
- You need to join across entity types routinely to answer queries
- Normalization matters: you want a single source of truth for shared data (e.g., one record for each city, referenced by many users)
- Example: e-commerce — users place orders, orders contain products, products have categories, users write reviews of products
**Category C — Highly connected graph (many-to-many with traversal)**
- Relationships between entities are as important as the entities themselves
- Queries require following chains of relationships of variable or unknown depth
- Different types of entities and relationship types are mixed in a single data store
- Example: social network — people know people who know people; fraud detection — transactions connect accounts, devices, IP addresses; routing — roads connect junctions across variable-length paths
---
### Step 2: Score Each Model Against Your Data
**Action:** Apply the five-criteria scoring rubric to rate each model's fit. Score each criterion 1–5 per model.
**WHY:** Scoring all three models — even the obvious misfits — forces explicit trade-off reasoning. Practitioners who skip this step often discover later that their "obvious" choice scores poorly on a criterion they hadn't considered (e.g., choosing a document model for "flexibility" without noticing that many-to-many relationships are already present, which will require application-side join emulation). Running the full rubric also produces the rationale needed for an architecture decision record.
**Scoring criteria:**
**1. Data locality fit** — Does the model match how data is naturally retrieved?
- 5 = Data is almost always fetched as a single unit. The model stores it that way.
- 3 = Data is sometimes fetched as a unit, sometimes across entities.
- 1 = Data is routinely fetched by joining many separate entities.
**2. Relationship complexity fit** — Does the model handle the relationship types present?
- 5 = Model is designed for the relationship types present (tree for document, normalized for relational, graph for traversal).
- 3 = Model can emulate the relationship type but with added application-side complexity.
- 1 = Model is a poor fit; relationships must be emulated awkwardly (e.g., document model for many-to-many, relational for variable-depth traversal).
**3. Schema flexibility** — Does the model handle heterogeneous or evolving record structures?
- 5 = Records naturally have different structures; the model imposes no schema.
- 3 = Records are mostly uniform; occasional variation manageable.
- 1 = All records have the same structure; enforcing a schema is an asset.
**4. Query pattern support** — Does the model's native query language match how you will query?
- 5 = Primary queries are natively supported (SQL joins for relational, document retrieval for document, Cypher traversal for graph).
- 3 = Queries are possible but require workarounds (recursive CTEs in SQL for graph traversal, application-side joins in document databases).
- 1 = Primary queries require the model to be emulated in a way that's significantly slower or more complex.
**5. Normalization requirement** — Does the model support the deduplication level your data needs?
- 5 = Model naturally enforces a single source of truth for shared data.
- 3 = Partial normalization possible with workarounds.
- 1 = Model encourages denormalization, creating update anomalies for shared data.
**Score each model:**
```
Relational Document Graph
Data locality [1-5] [1-5] [1-5]
Relationship complexity [1-5] [1-5] [1-5]
Schema flexibility [1-5] [1-5] [1-5]
Query pattern support [1-5] [1-5] [1-5]
Normalization [1-5] [1-5] [1-5]
Total [5-25] [5-25] [5-25]
```
See `references/decision-matrix.md` for a pre-filled scoring guide with example systems and their typical scores.
---
### Step 3: Apply the Decision Rules
**Action:** Apply explicit if/then decision rules to produce a primary model recommendation. These rules encode the structural logic of the chapter's analysis — they are not heuristics, but direct consequences of how each model handles relationships.
**WHY:** Scoring produces numbers; decision rules produce a recommendation. The rules encode the non-obvious consequences of model choice that practitioners discover only after the fact: document model users discovering that many-to-many relationships require application-side join emulation; relational model users discovering that object-relational impedance mismatch creates thousands of lines of translation code; graph database adopters discovering that simple one-to-many queries are awkward in Cypher. The rules prevent these surprises.
**Rule 1 — Use the document model if ALL of the following are true:**
- Your data has mostly one-to-many relationships (tree structure, Category A)
- You typically fetch all subordinate data together with the parent entity in one query
- Many-to-many relationships are rare or absent today — and features on your roadmap are unlikely to introduce them
- Records in your collection have heterogeneous structure (different fields per type, or externally controlled structure), making schema enforcement more hindrance than help
- Documents can be kept small (locality advantage degrades sharply for large documents loaded partially)
**Rule 2 — Use the relational model if ANY of the following are true:**
- Many-to-many relationships are already present or expected as features evolve
- You need a single source of truth for shared data (normalization to avoid update anomalies)
- Your query patterns require joining across multiple entity types
- Records have uniform structure and you want the database to enforce schema correctness
- You need full-text search, aggregations, or reporting across many records simultaneously
- Your application is evolving rapidly and you want to add query patterns without restructuring data
**Rule 3 — Use the graph model if ALL of the following are true:**
- Anything is potentially related to everything — connection patterns are as important as the entities
- Queries require following chains of relationships of variable or unknown depth (e.g., "find all friends-of-friends," "shortest path between two nodes," "all transactions connected to this account within 3 hops")
- You have multiple entity types and multiple relationship types coexisting in a single data store
- Relationships themselves carry properties (timestamp, weight, type label) that you need to query
**Rule 4 — Consider polyglot persistence if:**
- Different subsystems of your application clearly fall into different categories (e.g., the user profile is Category A, the recommendation engine is Category C)
- Operational complexity of running multiple datastores is acceptable
- The performance or expressiveness benefit of model specialization outweighs the operational cost
**Tie-breaker when rules conflict:** The relationship complexity criterion is the primary tie-breaker. Many-to-many relationships are the historical dividing line between models — they are why the relational model was invented (to solve the hierarchical model's many-to-many problem) and why graph databases exist (to handle cases where relational joins become impractical). If many-to-many relationships are present or expected, the document model is strongly contraindicated.
---
### Step 4: Analyze Schema Enforcement Strategy
**Action:** Determine whether schema-on-read or schema-on-write is appropriate, independent of the model choice. Note that most relational databases now support document-like JSON columns, and most document databases are adding join-like capabilities — model convergence means this is increasingly a configuration choice, not just a database product choice.
**WHY:** Schema-on-read (structure is implicit, interpreted by application code at read time) vs. schema-on-write (structure is explicit, enforced by the database at write time) is a distinct decision from model selection, but it profoundly affects how easily your data can evolve. Schema-on-write is like static type checking: it catches structural errors at write time, provides documentation, and makes queries predictable. Schema-on-read is like dynamic type checking: it is more flexible, tolerates heterogeneous records, but pushes structural validation into application code. Neither is superior — the right choice depends on whether your records have heterogeneous structure and how rapidly your schema evolves.
**Prefer schema-on-write (relational enforced schema, JSON Schema validation) when:**
- All records are expected to have the same structure
- Multiple applications or services write to the same database (shared schema is the contract)
- You want the database to catch structural errors before bad data enters
- Schema changes are infrequent and go through a migration process
**Prefer schema-on-read (document databases, JSON columns without schema validation) when:**
- Records have different structures within a collection (many types of events, different product attributes, externally-sourced data)
- Structure is controlled by an external system you don't own and that may change
- You need to roll out schema changes incrementally (new fields written by new code, old fields still read by old code simultaneously)
**Schema-on-read migration pattern:** When changing a field format, write new documents with new fields and handle both old and new format at read time in application code — no migration required.
**Schema-on-write migration pattern:** Use `ALTER TABLE` + `UPDATE`. Schema changes have a reputation for downtime that is mostly undeserved — PostgreSQL executes most `ALTER TABLE` in milliseconds; MySQL is the exception (full table copy). See `references/decision-matrix.md` for tooling options.
---
### Step 5: Evaluate Data Locality
**Action:** Determine whether data locality is a meaningful performance factor for your application, and whether it argues for or against the document model.
**WHY:** The document model's storage locality advantage — all related data stored as a single continuous string, fetched in one disk read — is real but conditional. It only matters if you need large parts of the document at the same time. If you access only part of a document (e.g., you only need the user's name, not all their positions and education history), the database still loads the entire document — making locality wasteful. For documents that are large or frequently partially accessed, relational row-level access is more efficient. Furthermore, on writes, modifying a document that grows in size usually requires rewriting the entire document. Keep this in mind: locality is an advantage only when documents are small and accessed whole.
**Locality strongly favors document model when:**
- The entire document is rendered or processed together on most reads (e.g., rendering a profile page, sending a complete order to a fulfillment service)
- Documents are small (kilobytes, not megabytes)
- Write patterns append or replace the whole document rather than updating individual fields
**Locality advantage is diminished when:**
- You routinely access only a small subset of a document's fields
- Documents grow large over time (accumulated history, logs, nested lists)
- High write throughput updates individual fields rather than replacing whole documents
Note: Locality is not unique to document databases. Google Spanner's interleaved tables and Oracle's multi-table index cluster tables bring locality to relational models. Cassandra and HBase use column families for locality. If locality is the primary driver, evaluate whether your target relational database offers nested table features before switching models entirely.
---
### Step 6: Assess Graph Model Subtypes (if applicable)
**Action:** If the graph model scored highest in Steps 2–3, determine which graph subtype is appropriate: property graph or triple-store.
**WHY:** Both subtypes model the same underlying graph structure (vertices and edges with properties), but they use different storage formats, query languages, and tooling. The property graph model (Neo4j, Titan, InfiniteGraph) stores structured property key-value pairs on vertices and edges and uses the Cypher query language. The triple-store model (Datomic, AllegroGraph) stores all information as subject-predicate-object triples and uses SPARQL or Datalog. The choice matters because it determines your query language, ecosystem, and how naturally your data maps to the storage format. For most application development use cases, property graphs are the more practical starting point. Triple-stores become more relevant when you need to combine data from multiple external sources (the original motivation for RDF and SPARQL).
**Property graph (Neo4j) is appropriate when:**
- Vertices and edges are well-defined entities with named properties
- You need to query traversals expressively (Cypher reads like ASCII-art of graph patterns)
- Your data is primarily internal to your application
- You need good tooling, visualization, and community support
**Triple-store (RDF/SPARQL) is appropriate when:**
- You need to merge data from multiple external sources with different schemas
- You want to leverage the semantic web ecosystem (linked data, ontologies)
- Your team already uses Datalog-family languages (Datomic)
- You need maximum flexibility in predicate types (RDF treats properties and relationships uniformly as predicates)
**Graph traversal in relational databases:** SQL supports variable-depth traversal via recursive common table expressions (`WITH RECURSIVE`). The same query that takes 4 lines in Cypher typically takes 29 lines in SQL. If graph queries are occasional (not primary), staying in a relational model with `WITH RECURSIVE` is a viable option. If traversal queries are central to your application, a native graph model pays for itself quickly in query expressiveness and performance.
---
### Step 7: Produce the Recommendation Document
**Action:** Write a structured recommendation that covers the primary model, the rationale, key trade-offs, what was ruled out and why, and implementation guidance.
**WHY:** A recommendation without explicit rationale cannot be reviewed, challenged, or revised when requirements change. Stating what was ruled out — and why — is as important as stating what was chosen: it prevents the recommendation from being relitigated by team members who weren't present for the analysis, and it records the assumptions so that if those assumptions change (e.g., a many-to-many relationship is added), the decision can be revisited.
**Output format:**
```
## Data Model Recommendation
**System:** [System name / brief description]
**Date:** [Date]
---
### Recommended Model: [Relational / Document / Graph / Polyglot]
**Primary rationale:** [1–2 sentences connecting the data shape and relationship
complexity analysis to the model choice]
**Data shape classification:** [Category A (tree) / B (interconnected) / C (graph)]
---
### Scoring Summary
| Criterion | Relational | Document | Graph |
|-----------------------|-----------|----------|-------|
| Data locality | [1-5] | [1-5] | [1-5] |
| Relationship complexity| [1-5] | [1-5] | [1-5] |
| Schema flexibility | [1-5] | [1-5] | [1-5] |
| Query pattern support | [1-5] | [1-5] | [1-5] |
| Normalization | [1-5] | [1-5] | [1-5] |
| **Total** | [5-25] | [5-25] | [5-25]|
**Winner:** [Model] with [score]/25
---
### Key Trade-offs of the Recommended Model
**Strengths for this system:**
- [Specific strength tied to your data shape]
- [Specific strength tied to your query patterns]
**Limitations to manage:**
- [Specific limitation and how to mitigate it]
- [Specific limitation and how to mitigate it]
---
### Schema Enforcement Strategy
**Recommendation:** Schema-on-[read/write]
**Rationale:** [Why this fits the data heterogeneity and team workflow]
---
### Ruled Out
**[Model]:** [1–2 sentences on why it scored lower and what specific criterion failed]
**[Model]:** [1–2 sentences on why it scored lower]
---
### Implementation Guidance
**Immediate next steps:**
1. [Concrete first step]
2. [Concrete second step]
**Watch for:**
- [Specific signal that would indicate the model choice needs revisiting]
- [Specific signal]
**Related decisions:**
- Storage engine selection: see storage-engine-selector
- Partitioning strategy: see partitioning-strategy-advisor (if distributing)
```
---
## What Can Go Wrong
These are the most common failure modes when selecting a data model. Review each before finalizing a recommendation.
**Choosing document model for "flexibility," then adding many-to-many relationships later.**
The document model's join support is weak by design. When many-to-many relationships appear (as they almost always do as applications evolve — organizations become entities, recommendations reference users, tags cross-reference content), the join work moves into application code. Application-side joins are slower than database joins (multiple round trips vs. optimized index lookups inside the database) and create significant complexity. If your roadmap includes any features that create cross-entity references, score the relational model at least as a baseline.
**Treating "schemaless" as "no schema."**
Document databases don't eliminate schema — they move it from the database into the application code that reads data. This is sometimes called the implicit schema problem. Every piece of code that reads a document assumes something about its structure. When that structure changes, all reading code must be updated. The difference from schema-on-write is that the database cannot help you find all the places where the assumption was made. Budget for the discipline required to manage implicit schemas.
**Using graph query syntax for simple data.**
Graph databases excel at traversal queries (variable-depth path following). For simple one-to-many reads, Cypher is more verbose than SQL, and a graph database adds operational complexity without benefit. Do not choose a graph model because your data "could" be a graph — choose it because your primary queries require graph traversal.
**Underestimating the impedance mismatch tax.**
If your application code uses objects, and your data store uses tables with rows, every read and write passes through a translation layer (ORM). ORMs reduce boilerplate but cannot eliminate the impedance mismatch. When this translation layer becomes a bottleneck (complex join queries expressed as object graphs, N+1 query problems, schema migrations that require coordinating application and database changes simultaneously), it is a signal that the data shape mismatches the model — not that the ORM is inadequate.
**Choosing model based on write throughput alone, ignoring read patterns.**
NoSQL databases were partly adopted for write scalability. But the data model decision (relational vs. document vs. graph) is a separate concern from the scalability decision. A relational database can scale writes with appropriate partitioning. A document database with many-to-many relationships will have read complexity problems regardless of how fast it writes. Separate the model-fit question from the scalability question.
**Locking in on a model before the data shape is understood.**
Starting with a document database because JSON is "natural for web applications" without first analyzing relationship complexity is a common path to regret. Run Step 1 (data shape classification) before any other discussion. If the shape is not yet clear because the product is early-stage, prefer relational: it can be refactored to document or graph more easily than the reverse, because it enforces the normalization that makes data portable.
---
## Inputs / Outputs
### Inputs
- System description or data entity list (required)
- Relationship description between entities (required)
- Primary query patterns (required)
- Schema stability assessment (required)
- Existing schema files or architecture documents (optional)
- Application code with data access patterns (optional)
### Outputs
- Data shape classification (Category A / B / C)
- Scored decision matrix (three models, five criteria)
- Primary model recommendation with rationale
- Schema enforcement strategy recommendation
- Graph subtype recommendation (if graph model selected)
- Ruled-out analysis for the other models
- Implementation guidance and watch signals
---
## Key Principles
**Data relationships are the primary decision axis.** The three models evolved to solve different relationship problems: the hierarchical model handled one-to-many, the relational model solved many-to-many, graph databases handle variable-depth traversal where relational joins become impractical. Map your data's relationship type to the model designed for it.
**Many-to-many relationships are the historical dividing line.** When many-to-many relationships appear, the document model loses its advantage — joins move into application code. The entire history of database systems (IMS → relational/network model debate of the 1970s) is the history of solving this problem. Do not repeat it by choosing a model that doesn't handle the relationships you have.
**Schema-on-read is not schemaless — it is implicit schema.** Every piece of code that reads a document assumes a structure. That assumption is your schema; the question is only whether the database enforces it (schema-on-write) or the application enforces it (schema-on-read). Use schema-on-read for heterogeneous records; schema-on-write when records are uniform and correctness must be enforced at write time.
**Document and relational databases are converging.** PostgreSQL 9.3+ supports JSON, MySQL 5.7+ supports JSON, RethinkDB supports joins, MongoDB reinvented SQL as an aggregation pipeline. The practical choice is increasingly about which model fits your data as primary — secondary capabilities are often available in either system.
---
## Examples
### Example 1: User Profile Service (SaaS Application)
**Scenario:** A SaaS application needs to store user profiles. Each profile contains basic info (name, email), job history (multiple positions with title and organization), education history (multiple schools with dates), and contact info (multiple URLs by type). The primary read pattern is: load the complete profile for display on a page. Cross-profile queries are rare — mostly search by email or user ID.
**Trigger:** "Should we use MongoDB or PostgreSQL for our user profiles?"
**Process:**
- Step 1: Category A — tree structure. Each user is the root; positions, education, contact_info are subordinate. Rarely queried across users.
- Step 2: Document scores high on locality (whole profile loaded together), relationship complexity (one-to-many tree is native), schema flexibility (profiles may have different contact types). Relational scores high on normalization but the normalization benefit is minimal here — organizations and schools are just strings, not shared entities referenced by multiple users.
- Step 3: Rule 1 applies — mostly one-to-many, fetch whole profile together, no many-to-many present.
- Watch for: If "organizations" become full entities (with their own pages, logos, employee counts — as LinkedIn did), the model needs to shift. Organization references become many-to-many (each résumé links to an organization entity, each organization is referenced by many résumés). At that point, document model requires application-side joins.
**Output:**
```
## Data Model Recommendation
**System:** User Profile Service
### Recommended Model: Document
**Primary rationale:** Profile data forms a self-contained tree — one user with
many subordinate records that are always fetched together. The document model's
locality advantage applies directly: one query returns the complete profile.
**Data shape classification:** Category A (tree, one-to-many)
### Scoring Summary
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 5 | 1 |
| Relationship complexity| 4 | 5 | 2 |
| Schema flexibility | 3 | 4 | 4 |
| Query pattern support | 3 | 5 | 1 |
| Normalization | 5 | 3 | 3 |
| **Total** | 17 | 22 | 11 |
**Winner:** Document with 22/25
### Key Trade-offs
**Strengths:**
- One query fetches a complete profile; no joins
- Flexible contact_info types (blog, twitter, phone) without schema changes
**Limitations to manage:**
- If organizations become shared entities, many-to-many appears — reassess model
- Application must manage implicit schema (validate structure at write time in code)
### Schema Enforcement Strategy
**Recommendation:** Schema-on-read
**Rationale:** Contact info types vary per user; new contact types can be added
without migration. Application validates structure at write time.
### Ruled Out
**Relational:** Normalized representation requires multi-table joins on every
profile read; locality benefit is lost. Normalization benefit is minimal since
organizations and schools are not shared entities needing deduplication.
**Graph:** No traversal queries; no many-to-many relationship patterns present.
### Implementation Guidance
1. Store each profile as a single document; keep documents under ~16KB
2. Index on user_id and email for primary lookup patterns
3. Monitor for feature requests that add cross-profile references (organization entities, recommendations from other users) — these signal relational migration
**Watch for:** Any feature where one user's data references another user's data
(recommendations, shared organization entities) — this introduces many-to-many
and breaks the document model's join-free assumption.
```
---
### Example 2: E-commerce Platform
**Scenario:** An e-commerce platform has: users, orders, order line items, products, product categories, and reviews. Users write reviews of products. Products belong to multiple categories. Orders reference products. The team needs analytics queries (top products by category, revenue by user segment). The team currently uses MongoDB and is experiencing complex application-side join code.
**Trigger:** "Our MongoDB queries are getting complex and we're doing a lot of application-side joining. Should we migrate to PostgreSQL?"
**Process:**
- Step 1: Category B — interconnected entities. Products are referenced by orders, categories, and reviews. Users are referenced by orders and reviews. Queries span multiple entity types.
- Step 2: Many-to-many relationships are present throughout (products ↔ categories, users ↔ products via reviews). Relational model's join support is exactly what's missing. Document model's weakness (application-side joins) is the stated pain point.
- Step 3: Rule 2 applies — many-to-many relationships present, normalization needed (single source of truth for product data referenced by many orders and reviews), query patterns span entity types.
- Schema: Uniform structure per entity type; schema-on-write appropriate.
**Output (abbreviated — full format shown in Example 1):**
```
## Data Model Recommendation: E-commerce Platform
### Recommended Model: Relational (Score: 21/25 vs Document 9/25 vs Graph 13/25)
Primary rationale: Many-to-many relationships are pervasive (products ↔ categories,
users ↔ products via reviews, orders ↔ products). Application-side join complexity
is the document model failing at exactly what it was not designed for.
Schema strategy: Schema-on-write — entity types are uniform; multiple services
write to the same database; schema is the contract.
Ruled out — Document: Application-side joins are the current pain point; migrating
to another document store does not solve it.
Ruled out — Graph: Many-to-many but not variable-depth; SQL joins are sufficient.
Next steps:
1. Migrate one entity type at a time; start with products (most referenced)
2. Use PostgreSQL JSON columns for product attributes that vary by category
Related: See storage-engine-selector for covering index strategy on category +
price range queries.
```
---
### Example 3: Fraud Detection System
**Scenario:** A financial services company needs to detect fraud by finding connections between transactions, accounts, devices, IP addresses, and known fraudsters. Queries include: "find all accounts connected to this device within 3 hops," "identify clusters of accounts that share IP addresses," "shortest path between a flagged account and any account in this transaction."
**Trigger:** "We're trying to build fraud detection. Our PostgreSQL queries for finding connected accounts are very slow — 29-table recursive CTEs. Would a graph database help?"
**Process:**
- Step 1: Category C — highly connected graph. Queries require variable-depth traversal. Connections between entities are as important as the entities. Multiple entity types (accounts, devices, IPs, transactions) coexist.
- Step 2: Graph model is native for traversal queries; relational requires `WITH RECURSIVE` which the team has already found unwieldy. The 29-line recursive CTE is the canonical signal that graph traversal in relational has exceeded its practical limit.
- Step 3: Rule 3 applies — variable-depth traversal is the primary query, multiple entity types and relationship types coexist, relationships carry properties (timestamp, amount).
- Subtype: Property graph (Neo4j) — data is internal, relationships carry properties (amounts, timestamps), Cypher is expressive for pattern-matching fraud queries.
**Output (abbreviated — full format shown in Example 1):**
```
## Data Model Recommendation: Fraud Detection System
### Recommended Model: Graph — Property Graph subtype (Score: 23/25)
Relational: 12/25. Document: 9/25.
Primary rationale: The 29-line recursive CTE is the canonical signal that graph
traversal in relational has exceeded its practical limit. Variable-depth traversal
(account → device → account → IP → account within N hops) is exactly what the
property graph model is designed for.
Subtype rationale: Property graph (Neo4j) — data is application-internal, edge
properties (amount, timestamp) must be queryable, Cypher pattern-matching maps
directly to ring detection and shared-device cluster queries.
Schema strategy: Schema-on-read — fraud patterns require different vertex/edge
property sets; heterogeneous structure is expected.
Ruled out — Relational: WITH RECURSIVE is syntactically clumsy and cannot use
graph-optimized index structures; query expressiveness gap is the deciding factor.
Ruled out — Document: No traversal capability; relationship chain following is
the entire problem.
Next steps:
1. Model as labeled vertices (Account, Device, IPAddress, Transaction) + labeled
edges (USES_DEVICE, SHARES_IP, TRANSFERS_TO) with amount/timestamp properties
2. Start with shared-device cluster detection as the first migrated query
3. Polyglot: keep relational for transaction ledger and compliance reporting
Watch for: If queries simplify to fixed-depth lookups (always 2 hops),
re-evaluate whether SQL joins are sufficient.
```
---
## References
| File | Contents |
|------|----------|
| `references/decision-matrix.md` | Pre-filled scoring guide with 8 common system types (e-commerce, social network, IoT time-series, CMS, fraud detection, recommendation engine, ERP, analytics pipeline) showing typical scores per model and decision rationale; cross-reference table mapping query patterns to model fit |
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
This skill is standalone. Browse more BookForge skills: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/decision-matrix.md
# Data Model Decision Matrix
Reference for `data-model-selector`. Use this during Step 2 (scoring) and Step 3 (decision rules) of the skill process.
---
## Scoring Rubric — Full Criteria Definitions
### 1. Data Locality Fit
Measures whether the model stores data the way the application retrieves it.
| Score | Meaning |
|-------|---------|
| 5 | Data is almost always retrieved as a single unit; the model stores it that way (document as one JSON blob, graph vertex with inline properties). One query = one retrieval. |
| 4 | Data is usually retrieved as a unit; occasional secondary lookups for referenced data. |
| 3 | Data is retrieved partially as a unit (some subordinate data inline, some via join/reference). |
| 2 | Multiple queries or joins required for most primary reads. |
| 1 | Every primary read requires joining 3+ entity types or following multiple document references. |
**Notes:**
- Document model locality advantage applies only when documents are small and loaded whole. Large documents (>100KB) or partial-access patterns eliminate the advantage.
- Relational databases with row interleaving (Google Spanner, Oracle cluster tables) can match document locality.
- Graph vertex properties are inline; edge traversal requires index lookups — locality varies by query.
---
### 2. Relationship Complexity Fit
Measures whether the model is designed for the relationship types present in the data.
| Score | Meaning |
|-------|---------|
| 5 | Model is designed exactly for the relationship type present: document for one-to-many trees, relational for many-to-many normalized, graph for variable-depth traversal. |
| 4 | Model handles the relationships with minor workarounds (relational for shallow graph traversal with fixed-depth joins, document with occasional cross-document references). |
| 3 | Model can emulate the relationship type, but with application-side complexity (document model emulating many-to-many with application-side joins). |
| 2 | Key relationship types are awkward in this model; significant complexity moves to application layer. |
| 1 | Model is a poor fit; the primary relationship type cannot be expressed without major workarounds. |
**Decision rule trigger:** If document model scores 2 or below on this criterion, Rule 1 (document model) does not apply. Escalate to Rule 2 (relational) or Rule 3 (graph).
---
### 3. Schema Flexibility
Measures whether the model accommodates heterogeneous or frequently evolving record structure.
| Score | Meaning |
|-------|---------|
| 5 | Records naturally have different structures within the same collection; schema-on-read is the natural fit. |
| 4 | Records are mostly similar but with meaningful variation (optional fields, polymorphic types). |
| 3 | Records are mostly uniform; occasional heterogeneity manageable. |
| 2 | Records are uniform; schema enforcement is desirable but adds overhead. |
| 1 | All records have identical structure; schema enforcement is strictly beneficial. |
**Note:** Schema flexibility scoring is independent of model choice. PostgreSQL with JSON columns can score 4–5 on this criterion while still being a relational model.
---
### 4. Query Pattern Support
Measures how naturally the model's native query language handles primary read patterns.
| Score | Meaning |
|-------|---------|
| 5 | Primary queries are expressed directly in the model's native language with no workarounds. |
| 4 | Primary queries are well-supported; secondary queries require minor workarounds. |
| 3 | Primary queries are possible but verbose or require features not universally supported (e.g., JSON path expressions, recursive CTEs). |
| 2 | Primary queries require combining multiple queries or significant application-side processing. |
| 1 | Primary queries cannot be expressed efficiently; require application-level emulation. |
**Signal:** If relational scores 1–2 on this criterion due to graph traversal requirements, this is the canonical signal for graph model. The 29-line recursive CTE is the concrete indicator.
---
### 5. Normalization Requirement
Measures how important it is that shared data has a single source of truth.
| Score | Meaning |
|-------|---------|
| 5 | Shared data is referenced by many records; update anomalies are a serious correctness risk without normalization. |
| 4 | Normalization is important; some shared data across records. |
| 3 | Some duplication is acceptable; data changes rarely enough that update anomalies are manageable. |
| 2 | Most data is entity-specific; normalization adds overhead without major benefit. |
| 1 | All data is unique to each record; denormalization has no downside. |
**Note:** Normalization scores high when: shared reference data exists (regions, categories, organizations), data is updated (not append-only), multiple records reference the same entity.
---
## Pre-filled Scores: 8 Common System Types
### 1. User Profile Service (LinkedIn-style résumé)
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 5 | 1 |
| Relationship complexity| 4 | 5 | 2 |
| Schema flexibility | 3 | 4 | 3 |
| Query pattern support | 3 | 5 | 1 |
| Normalization | 4 | 3 | 3 |
| **Total** | 16 | 22 | 10 |
**Recommendation:** Document
**Key reason:** One-to-many tree (user → positions, education, contact_info); whole profile fetched together
**Watch for:** When organizations become entities (many-to-many) — migrate toward relational or polyglot
---
### 2. E-commerce Platform
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 3 | 2 | 2 |
| Relationship complexity| 5 | 1 | 3 |
| Schema flexibility | 3 | 4 | 2 |
| Query pattern support | 5 | 2 | 2 |
| Normalization | 5 | 1 | 3 |
| **Total** | 21 | 10 | 12 |
**Recommendation:** Relational
**Key reason:** Products, categories, orders, reviews form many-to-many; normalization prevents product name update anomalies across thousands of orders
**Watch for:** Product attribute heterogeneity (electronics vs. clothing attributes) — use JSON columns in PostgreSQL for this specific case while keeping relational model overall
---
### 3. Social Network / Friend Graph
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 2 | 5 |
| Relationship complexity| 2 | 1 | 5 |
| Schema flexibility | 3 | 4 | 5 |
| Query pattern support | 2 | 1 | 5 |
| Normalization | 4 | 2 | 4 |
| **Total** | 13 | 10 | 24 |
**Recommendation:** Graph
**Key reason:** Variable-depth traversal (friends-of-friends, 6 degrees), multiple entity types (people, posts, events, groups) connected by multiple relationship types
**Subtype:** Property graph (Neo4j) — connection properties (when friended, interaction weight) matter; Cypher patterns match social queries naturally
---
### 4. IoT Time-Series (Sensor Data)
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 3 | 5 | 1 |
| Relationship complexity| 4 | 4 | 1 |
| Schema flexibility | 2 | 4 | 3 |
| Query pattern support | 3 | 4 | 1 |
| Normalization | 4 | 3 | 2 |
| **Total** | 16 | 20 | 8 |
**Recommendation:** Document (or time-series database)
**Key reason:** Each event is a self-contained record (device_id, timestamp, readings); events are never joined to other events; schema varies by sensor type
**Note:** For very high ingest rates, a purpose-built time-series database (InfluxDB, TimescaleDB) may outperform a general document store — but the data model is document-like in either case
---
### 5. Content Management System (CMS)
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 5 | 2 |
| Relationship complexity| 4 | 3 | 3 |
| Schema flexibility | 2 | 5 | 4 |
| Query pattern support | 4 | 4 | 2 |
| Normalization | 4 | 3 | 3 |
| **Total** | 16 | 20 | 14 |
**Recommendation:** Document (with relational for structured metadata)
**Key reason:** Content is heterogeneous (articles have different fields from products, events, etc.); each content item is a self-contained unit fetched whole
**Polyglot note:** User accounts, permissions, and structured metadata often fit relational; content bodies and rich media fit document — CMS is a classic polyglot use case
---
### 6. Fraud Detection / Anti-Money Laundering
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 1 | 4 |
| Relationship complexity| 2 | 1 | 5 |
| Schema flexibility | 3 | 3 | 5 |
| Query pattern support | 1 | 1 | 5 |
| Normalization | 4 | 2 | 4 |
| **Total** | 12 | 8 | 23 |
**Recommendation:** Graph
**Key reason:** Fraud queries are path-finding and cluster-detection by nature (shared device, IP rings, mule account networks); variable-depth traversal is the primary operation
**Subtype:** Property graph — edge properties (transaction amounts, timestamps) are queried; Cypher pattern matching maps directly to fraud pattern expressions
**Polyglot note:** Keep relational for the transaction ledger (compliance reporting, aggregation); use graph for traversal and pattern matching only
---
### 7. ERP / Enterprise Business Application
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 3 | 2 | 2 |
| Relationship complexity| 5 | 1 | 3 |
| Schema flexibility | 2 | 3 | 3 |
| Query pattern support | 5 | 2 | 2 |
| Normalization | 5 | 1 | 3 |
| **Total** | 20 | 9 | 13 |
**Recommendation:** Relational
**Key reason:** ERP data is the canonical many-to-many, normalized use case — customers reference orders reference products reference vendors reference accounts. Normalization is not optional; update anomalies in billing or inventory data are business-critical failures
**Note:** ERP is why the relational model was invented and where it remains the strongest choice
---
### 8. Recommendation Engine
| Criterion | Relational | Document | Graph |
|------------------------|-----------|----------|-------|
| Data locality | 2 | 3 | 4 |
| Relationship complexity| 2 | 2 | 5 |
| Schema flexibility | 3 | 4 | 5 |
| Query pattern support | 2 | 2 | 5 |
| Normalization | 3 | 2 | 4 |
| **Total** | 12 | 13 | 23 |
**Recommendation:** Graph
**Key reason:** Recommendation queries are "users who bought X also bought Y" or "find items similar to Z via shared attributes" — these are graph traversal queries (user → item → user → item chains)
**Subtype:** Property graph — similarity weights, interaction types (viewed, purchased, rated), and timestamps on edges matter for ranking
**Alternative:** Collaborative filtering at scale often uses matrix factorization (not a graph query), which can run on top of any model. Graph is optimal for neighborhood-based recommendations; matrix factorization can use document or relational storage.
---
## Query Pattern to Model Fit Mapping
| Query pattern | Natural model | Workaround cost in alternatives |
|--------------|--------------|--------------------------------|
| Fetch entire entity with all subordinate data | Document | Relational: multi-table join (medium). Graph: vertex + edge traversal (medium). |
| Join multiple entity types | Relational | Document: application-side join (high). Graph: pattern match (low-medium). |
| Find all X connected to Y within N hops | Graph | Relational: recursive CTE (high). Document: not feasible without graph-like indexing. |
| Aggregate across all records (GROUP BY, SUM) | Relational | Document: aggregation pipeline (medium). Graph: awkward (high). |
| Full-text search within records | Relational (with FTS extensions) or dedicated search index | All models benefit from a dedicated search index (Elasticsearch) regardless of primary model. |
| Retrieve record by ID | All models (similar cost) | No meaningful difference for simple key lookups. |
| Heterogeneous records with different fields | Document | Relational: JSON columns (low). Graph: schema-free (low). |
| Variable-depth path traversal | Graph | Relational: WITH RECURSIVE (high — verbose, slow for deep paths). Document: not supported. |
| Pattern matching across connections | Graph | Relational: self-joins or recursive CTEs (high). Document: not supported. |
---
## Relationship Type Quick Reference
| Relationship type | Definition | Natural model | Example |
|-------------------|-----------|--------------|---------|
| One-to-one | Each A has exactly one B | Relational or Document | User ↔ UserProfile |
| One-to-many | Each A has many B; B belongs to one A | Document (if B is always fetched with A) or Relational | User → Orders |
| Many-to-one | Many A reference one B (normalization) | Relational | Many Users → one Region |
| Many-to-many | A references many B; B referenced by many A | Relational | Products ↔ Categories |
| Variable-depth traversal | Follow relationship chains of unknown length | Graph | Friends-of-friends, shortest path, cluster detection |
| Heterogeneous graph | Multiple entity types and relationship types in one store | Graph | Social graph (people, posts, events, groups, comments) |
---
## Schema Strategy Decision Tree
```
Is data structure uniform across all records?
├── Yes → Are multiple services writing to the same database?
│ ├── Yes → Schema-on-write (enforced)
│ └── No → Schema-on-write preferred; schema-on-read acceptable
└── No → Why is it heterogeneous?
├── Different object types (event types, product categories) → Schema-on-read
├── Externally controlled structure (third-party API, webhook payload) → Schema-on-read
└── Evolving fields (rolling out new fields incrementally) → Schema-on-read
```
---
## Convergence: What Modern Databases Support Across Model Lines
| Capability | Relational | Document | Graph |
|-----------|-----------|----------|-------|
| JSON document storage | PostgreSQL 9.3+, MySQL 5.7+, SQL Server | Native | Via vertex properties |
| Joins | Native (SQL) | RethinkDB native; MongoDB limited | Via traversal |
| Schema enforcement | Native (DDL) | Optional (JSON Schema) | Optional (constraints) |
| Graph traversal | WITH RECURSIVE (SQL:1999) | Not supported | Native (Cypher, SPARQL) |
| Full-text search | Native (FTS) or via extension | Via text index | Via external index |
| Aggregation | Native (GROUP BY, window functions) | Aggregation pipeline | Limited |
| ACID transactions | Native | Varies by product | Varies by product |
**Implication:** If your primary model is relational, JSON columns cover most document flexibility needs. If your primary model is document, RethinkDB covers most join needs. The models are converging — optimize for data shape fit first, then check whether your target product covers secondary needs.
Design the integration architecture for systems with multiple specialized data stores (Postgres, Elasticsearch, Redis, data warehouses) that must stay in syn...
---
name: data-integration-architect
description: |
Design the integration architecture for systems with multiple specialized data stores (Postgres, Elasticsearch, Redis, data warehouses) that must stay in sync. Use when deciding how data flows between components, avoiding dual writes, reasoning about correctness across system boundaries (idempotency, end-to-end operation identifiers), choosing between Lambda and Kappa architecture, or applying the "unbundling databases" pattern to compose specialized tools instead of relying on a single monolith. Trigger phrases: "how do I keep Postgres and Elasticsearch in sync?", "should I use CDC or event sourcing to propagate data?", "how do I avoid dual writes across microservices?", "my downstream systems are going out of sync — how do I fix the architecture?", "how do I design derived data pipelines?", "what is the system of record pattern?", "how do I integrate OLTP with a search index and an analytics warehouse?", "how do I design for end-to-end idempotency?". This is the capstone skill for data systems design — it synthesizes batch pipelines, stream integration, consistency, and replication into a single architecture recommendation. Produces a component map (systems of record vs derived views), data flow diagram, and correctness analysis. Does not replace batch-pipeline-designer or stream-processing-designer — delegates to them for pipeline internals.
model: sonnet
context: 1M
execution:
tier: 1
mode: hybrid
inputs:
- type: document
description: "Current architecture description, data flow diagram, or system design document (architecture.md, data-flow.md)"
- type: codebase
description: "docker-compose.yml, schema.sql, config files showing current infrastructure and database choices"
- type: none
description: "Skill can work from a verbal description of the system and its data stores"
tools-required: [Read, Write, TodoWrite]
tools-optional: [Grep, Bash]
environment: "Any agent environment. Codebase access enables concrete analysis of existing infrastructure. Works equally well from a written system description."
depends-on:
- batch-pipeline-designer
- stream-processing-designer
- consistency-model-selector
- replication-strategy-selector
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [12]
tags:
- data-integration
- derived-data
- system-of-record
- unbundling-databases
- lambda-architecture
- kappa-architecture
- change-data-capture
- event-sourcing
- total-order-broadcast
- idempotency
- end-to-end-correctness
- dataflow
- federated-database
- write-path
- read-path
- dual-writes
- multi-system-consistency
- operation-identifiers
- coordination-avoiding
- timeliness
- integrity
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/data-integration-architect
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
---
# Data Integration Architect
## When to Use
Use this skill when you are designing or evaluating an architecture where **multiple specialized data stores need to work together** and you must reason about how data flows between them, stays consistent, and remains correct in the face of faults.
Concrete preconditions:
- You have (or are planning) more than one data store: for example, a primary relational database plus a search index, an analytics warehouse, a cache, or a machine learning feature store.
- You need to decide the **integration strategy**: should writes go to all stores directly, or should one store be the source of truth with others derived from it?
- You need to choose the **propagation mechanism**: synchronous distributed transactions, change data capture, event sourcing, or batch ETL.
- You need to evaluate a **processing architecture**: Lambda (separate batch and stream layers), Kappa (unified stream layer), or a single integrated system.
- You have a correctness problem: downstream systems are drifting, duplicate events are being applied, or a system boundary is leaking bugs.
**Do not use this skill** if you have a single-database workload with no integration requirements — use `oltp-olap-workload-classifier` and `storage-engine-selector` instead.
---
## Context and Input Gathering
Before proceeding, gather the following. If any are unknown, ask the user.
**Required:**
1. **Current data stores** — list every database, index, cache, or data warehouse in the system (e.g., PostgreSQL, Elasticsearch, Redis, Snowflake, Kafka).
2. **Data sources** — what writes new data into the system, and at what volume/rate?
3. **Data consumers** — who reads from each store, and what access patterns do they require (full-text search, OLAP aggregation, low-latency lookup, ML training)?
4. **Consistency requirements** — which parts of the system require linearizability (user-facing correctness), and which can tolerate eventual consistency (derived views)?
5. **Growth trajectory** — current data volume, expected growth rate, and peak write throughput.
**Optional but valuable:**
- Existing architecture diagram or `architecture.md` / `docker-compose.yml`
- Current integration mechanism (if any: ETL scripts, dual writes, triggers, CDC)
- Known pain points: lag, drift, data loss, duplicate processing, schema migration friction
**If dependencies are installed**, invoke them to fill gaps:
- `consistency-model-selector` → determine per-operation consistency requirements
- `replication-strategy-selector` → determine replication topology for the primary store
- `batch-pipeline-designer` → design the batch processing layer if historical reprocessing is needed
- `stream-processing-designer` → design the stream processing layer if low-latency propagation is needed
---
## Process
### Step 1: Map the Current Dataflow
**ACTION:** Identify every system that stores data and every transformation that moves data between systems.
Draw (or describe) the current state:
```
[Write source] → [Store A] → ??? → [Store B]
→ ??? → [Store C]
```
For each data path, classify the current integration mechanism:
- **Dual writes** (application writes to A and B directly) — flag as anti-pattern
- **Distributed transactions** (2PC across stores) — note heterogeneity constraints
- **Change data capture** (CDC from A's replication log to B)
- **Event sourcing** (events written to a log; A and B derived independently)
- **Batch ETL** (periodic full or incremental extract from A to B)
**WHY:** You cannot improve what you cannot see. Dual writes are the most common source of drift between systems — if the write to Store B fails after the write to Store A succeeds, the stores diverge permanently. Making the dataflow explicit immediately surfaces where this risk exists.
---
### Step 2: Identify the System of Record
**ACTION:** For each logical data entity (user, order, product, event), determine which single store is the **system of record** — the authoritative source of truth that other representations are derived from.
Apply this test: if Store A and Store B disagree about the value of an entity, which one is correct by definition?
- If the answer is "A" → A is the system of record; B is a derived view.
- If the answer is "it depends" or "neither" → you have a multi-master problem that must be resolved by architecture (see Step 4).
Document the result as a table:
| Entity | System of Record | Derived Views |
|--------|-----------------|---------------|
| User profile | PostgreSQL users table | Elasticsearch user index, Redis session cache |
| Order | PostgreSQL orders table | Kafka order-events log, Snowflake orders_fact |
| Search index | (none — fully derived) | Elasticsearch ← PostgreSQL via CDC |
**WHY:** The system of record designation is not just documentation — it determines write authority. Only the system of record should accept writes for its entities. All other stores are read-optimized derived views that are populated by processing the source of truth. This is what prevents split-brain inconsistency between stores.
---
### Step 3: Choose the Integration Mechanism
**ACTION:** For each system of record → derived view pair, select the integration mechanism from this decision framework:
**Decision tree:**
```
Is the derived view latency requirement < 1 minute?
├─ YES → Use stream-based propagation (CDC or event sourcing)
│ → invoke stream-processing-designer for details
└─ NO → Can you afford to reprocess all historical data?
├─ YES → Consider batch ETL (simpler, replayable)
│ → invoke batch-pipeline-designer for details
└─ NO → Use incremental batch or hybrid (CDC + periodic batch)
```
**For the propagation mechanism, choose:**
| Mechanism | Use when | Trade-off |
|-----------|----------|-----------|
| **Change data capture (CDC)** | Existing database cannot be changed; need low-latency propagation | Requires access to replication log; schema changes need care |
| **Event sourcing** | New system or greenfield; want full audit log; need replayability | Application must be redesigned around immutable events |
| **Batch ETL** | High latency acceptable; historical reprocessing needed regularly | Simple but creates lag windows; schema evolution is manual |
| **Log-based messaging (Kafka)** | High throughput; multiple consumers; need replay; decoupled teams | Operational complexity; ordering only within partition |
**Key principle — prefer the event log over distributed transactions:**
Distributed transactions (2PC) across heterogeneous stores are fragile: they have poor fault tolerance (coordinator failure leaves participants in-doubt), poor performance (blocking protocol), and require all participating systems to speak the same transaction protocol. An ordered log of events with idempotent consumers achieves the same correctness properties with better fault isolation and looser coupling. A fault in one consumer is contained locally; it does not abort the writes to all other systems.
**WHY:** The integration mechanism determines whether failures in one part of the system cascade to other parts. Synchronous coupling (dual writes, distributed transactions) amplifies failures across system boundaries. Asynchronous event logs contain failures: a slow or failed consumer falls behind, but the producer and other consumers continue unaffected.
---
### Step 4: Decide — Single System vs. Composed Pipeline
**ACTION:** Evaluate whether the requirements can be satisfied by a single integrated system or require a composed pipeline of specialized tools.
Apply this test:
**Use a single system if:**
- One technology satisfies all access patterns with adequate performance (e.g., PostgreSQL with full-text search is sufficient for the scale)
- The team lacks operational capacity to run multiple systems
- The workload fits within the throughput and latency envelope of a single product
**Use a composed pipeline if:**
- No single technology satisfies all access patterns (OLTP + full-text + analytics + ML feature serving)
- Scale requirements exceed what a single system can handle
- Different teams own different parts of the data, requiring independent deployment
**The unbundling principle:** A database internally implements secondary indexes, materialized views, replication logs, and caching. When you compose specialized systems — a primary database plus a search index plus an analytics warehouse plus a feature store — you are "unbundling" those features into separate, independently deployable components. The event log plays the role of the replication log that connects them.
The goal of unbundling is **breadth**: achieving good performance across a wider range of workloads than any single system supports. It is not to maximize the number of moving parts.
**WHY:** Many teams reach for multi-system architectures prematurely and add operational complexity without proportionate benefit. The decision must be grounded in actual access pattern requirements that cannot be satisfied by a simpler alternative.
---
### Step 5: Choose the Processing Architecture
**ACTION:** If a composed pipeline is warranted, select the processing architecture:
**Lambda Architecture:**
- Two parallel layers: a batch layer (reprocesses all historical data; produces accurate views) and a speed layer (processes recent events; produces approximate views)
- Reads merge results from both layers
- **Use when:** Batch processing is significantly simpler and less bug-prone than the stream processor you can operate, AND the latency of batch processing alone is unacceptable
- **Problems:** Maintaining the same logic in two frameworks doubles complexity; merging batch and stream outputs is non-trivial for operations beyond simple aggregations; batch layer often ends up doing incremental processing anyway, negating its simplicity advantage
**Kappa Architecture:**
- Single stream processing layer that handles both historical reprocessing (by replaying the log) and recent events
- **Use when:** A stream processor with replay capability (e.g., Apache Flink reading from Kafka with historical retention, or Apache Beam on Google Cloud Dataflow) can match the correctness guarantees of batch processing
- **Required capabilities:** Log-based broker with configurable retention (replay); exactly-once semantics in the stream processor; event-time windowing (not processing-time windowing, which is meaningless during replay)
- **Preferred** when these capabilities are available — eliminates the dual-system burden of Lambda
**Single unified system:**
- Modern stream processors (Flink, Beam) unify batch and stream by treating batch as a bounded stream
- **Use when** the team can operate one such system well; eliminates the architecture choice entirely
**WHY:** Lambda architecture was an important idea that made event reprocessing central to data system design. But the practical problems — dual codebases, merged outputs, incremental batch complexity — are significant. The Kappa architecture achieves the same benefits (replayability, correctness, low latency) without the dual-system burden, provided the stream processor supports replay and exactly-once semantics.
---
### Step 6: Design for Total Ordering
**ACTION:** Identify which data flows require a total order and evaluate whether total order is achievable given your deployment constraints.
**Total ordering is required when:**
- Multiple systems derive state from the same events and must remain consistent with each other (they must process events in the same order)
- A uniqueness constraint must be enforced (who claimed the username first?)
- Causal dependencies exist between events in different partitions or services (the "unfriend before message" example: a message notification system must not deliver a message to a user who was unfriended before the message was sent)
**Constraints on total ordering:**
| Constraint | Effect |
|------------|--------|
| Single leader | Total order is feasible; the leader serializes all writes |
| Multiple partitions | Order within a partition is guaranteed; across partitions is not |
| Multiple datacenters | Total order requires synchronous cross-datacenter coordination (high cost) |
| Independent microservices | Events originating in different services have no defined order |
**For causal ordering without total order, consider:**
- **Logical timestamps** (Lamport clocks, vector clocks): provide causal ordering without a single leader; see `consistency-model-selector`
- **Causal event identifiers**: log the state the user saw before making a decision (the event identifier); downstream consumers can reference this to reconstruct the causal dependency
- **Partition routing by entity**: route all events for a given entity (user ID, order ID) to the same partition, ensuring per-entity ordering
**WHY:** Total order broadcast is equivalent to consensus — it requires a single node (or Raft/Paxos cluster) to serialize all events. This scales well on a single machine but becomes a bottleneck at very high throughput or across geographically distributed systems. Understanding exactly which parts of your data flow require total ordering — versus which can tolerate partial ordering — lets you apply the strong guarantee only where it is necessary.
---
### Step 7: Enforce End-to-End Correctness
**ACTION:** Apply the end-to-end argument to the integration architecture. Verify that correctness guarantees are not assumed from any single layer.
**The end-to-end argument:** TCP suppresses duplicate packets within a connection. Databases provide transactional atomicity. Stream processors provide exactly-once semantics within the processing framework. But none of these individually prevent a user from submitting a duplicate request after a network timeout. Solving the problem requires passing a unique operation identifier all the way from the end-user client to the final data store.
**Correctness checklist — apply to every data flow:**
1. **Idempotency at the consumer:** Can the consumer safely process the same event twice? If not, implement idempotency using a deduplication table keyed by operation ID.
```sql
-- Pattern: unique constraint on request_id suppresses duplicates
ALTER TABLE requests ADD UNIQUE (request_id);
INSERT INTO requests (request_id, ...) VALUES ('uuid-here', ...);
-- Second identical insert fails at the constraint level — safe to retry
```
2. **Operation identifier propagation:** Is a unique operation ID generated by the client (UUID or hash of request fields) and passed through every hop — HTTP request → message broker → stream processor → database write?
3. **Single-message atomicity for multi-partition operations:** When an operation must affect multiple partitions (e.g., debit account A and credit account B), do not use distributed atomic commit. Instead:
- Write the entire request as a single message to a log partition (by request ID)
- A stream processor reads the request and emits individual instructions to each partition's stream (with the request ID included)
- Downstream processors for each partition deduplicate by request ID
- This achieves equivalent correctness to atomic commit without cross-partition coordination
4. **Timeliness vs. integrity separation:** Distinguish what must be linearizable (users must see their own writes immediately) from what requires only integrity (data must not be lost or corrupted). Violations of timeliness are eventually consistent; violations of integrity are permanent corruption.
- **Timeliness failures:** the user sees stale data temporarily → recover by waiting
- **Integrity failures:** data is lost, double-charged, or corrupted → cannot self-heal
Design the architecture so integrity is maintained in all cases, even if timeliness guarantees are weak.
5. **Loose constraint enforcement:** Not every uniqueness constraint requires synchronous linearizable enforcement. Evaluate whether the business cost of a constraint violation is recoverable:
- If two users claim the same username concurrently and one succeeds, the other gets a rejection — recoverable
- If a financial transaction is applied twice to an account — not recoverable
For recoverable constraints, asynchronous detection and compensation (apology workflow) may be sufficient, enabling coordination-avoiding architectures with better availability and performance.
**WHY:** This is the most critical step and the most commonly skipped. Engineers assume that because their database provides transactions and their message broker provides exactly-once delivery, their system is correct. The end-to-end argument shows this is false: each layer handles its own scope, but correctness across the entire request path — from client to database — requires an explicit end-to-end mechanism. The operation identifier is the minimal such mechanism.
---
### Step 8: Produce the Integration Architecture Document
**ACTION:** Write an `integration-architecture.md` with three sections:
**Section 1: Component Map**
List every data store with its role (system of record vs. derived view), the entity types it owns or serves, and the access patterns it satisfies.
```
[System of Record]
PostgreSQL — users, orders, products
Access: OLTP reads/writes, foreign key enforcement
[Derived Views — propagated via Kafka CDC]
Elasticsearch — full-text search on products and orders
Snowflake — orders_fact, products_fact for OLAP analytics
Redis — user session cache, hot product cache
[Processing Layer]
Kafka — event log (ordered message delivery, 7-day retention)
Debezium — CDC from PostgreSQL to Kafka
Apache Flink — stream processor (CDC → Elasticsearch, Snowflake)
```
**Section 2: Data Flow Diagram**
For each entity type, show the write path (how new data enters) and the read path (how consumers access the derived view).
```
Write Path:
User action → Application server → PostgreSQL (system of record)
→ Debezium reads WAL → Kafka topic (ordered)
→ Flink consumer → Elasticsearch (search index)
→ Flink consumer → Snowflake (analytics)
Read Path:
Full-text search → Elasticsearch
OLAP query → Snowflake
OLTP query → PostgreSQL
Session lookup → Redis (populated from PostgreSQL on login)
```
**Section 3: Correctness Analysis**
Document the ordering guarantees, idempotency mechanisms, and constraint enforcement strategy for each critical data flow.
---
## Examples
### Example 1: E-commerce Platform — OLTP + Search + Analytics
**Scenario:** Online retailer with PostgreSQL for orders/products, Elasticsearch for product search, Snowflake for business analytics. Current architecture uses dual writes from application code. Products sometimes appear in search before inventory is updated; analytics dashboards occasionally show orders that do not exist in PostgreSQL.
**Trigger:** "Our Elasticsearch and PostgreSQL are drifting. Products show in search that are out of stock. How do we fix the architecture?"
**Process:**
- Step 1 identifies dual writes as the anti-pattern causing drift
- Step 2 designates PostgreSQL as system of record for products, orders, inventory
- Step 3 selects CDC (Debezium) from PostgreSQL WAL → Kafka → Elasticsearch/Snowflake consumers
- Step 6 confirms that per-product ordering (route by product_id partition) is sufficient; total ordering across all products is not required
- Step 7 verifies Flink consumers are idempotent (upsert by product_id handles redelivery)
**Output:** Replace dual writes with Debezium CDC pipeline. Elasticsearch and Snowflake become read-only derived views, populated exclusively from the Kafka event log. Drift is eliminated because both stores process the same ordered event sequence from the same source.
---
### Example 2: Financial Services — Multi-Partition Transfer with End-to-End Correctness
**Scenario:** Payment processing system where transferring money requires debiting one account (partition A) and crediting another (partition B). Current implementation uses two-phase commit across partitions; this causes availability problems when the coordinator fails.
**Trigger:** "We're using 2PC for cross-account transfers and it's killing our availability. How do we redesign this?"
**Process:**
- Step 3 selects event sourcing: log the transfer request as a single message
- Step 6 identifies that per-request ordering (route by request_id) is needed to prevent duplicate application
- Step 7 applies the multi-partition correctness pattern:
1. Client generates UUID request_id; application appends transfer request to Kafka (keyed by request_id)
2. Stream processor reads request; emits debit instruction to partition A's stream (with request_id) and credit instruction to partition B's stream (with request_id)
3. Account processors for A and B each deduplicate by request_id using a unique constraint on a requests table
- If the stream processor crashes and reprocesses the request, it produces identical debit/credit instructions; the unique constraint suppresses the duplicates
**Output:** Remove 2PC. Achieve equivalent correctness (every transfer applied exactly once to both accounts) without cross-partition coordination. Availability improves because no coordinator failure mode exists.
---
### Example 3: Social Platform — Causal Ordering Across Services
**Scenario:** Social network with friendship status stored in service A, notification delivery in service B. Users report receiving notifications from people they have unfriended — the unfriend event and the message-send event are processed in the wrong order.
**Trigger:** "Users are getting messages from people they unfriended. The unfriend event seems to arrive after the message sometimes."
**Process:**
- Step 1 identifies that friendship and messaging are independent event streams; no total order exists between them
- Step 6 identifies a causal dependency: the message-send event causally depends on the friendship status the sender observed
- Fix: When the sender sends a message, log the event identifier of the most recent friendship-status read (the state the sender saw). The notification service checks this event ID; if it has not yet processed past that event in the friendship log, it defers the notification.
- Alternative: Route all friendship and messaging events through a single Kafka topic partitioned by the pair of user IDs, imposing a per-pair total order
**Output:** Causal dependency captured via event identifiers. Notification service becomes causally consistent without requiring total ordering across all users.
---
## References
- [Architecture patterns: federated databases vs. unbundled databases](references/unbundling-patterns.md)
- [Lambda vs. Kappa architecture decision guide](references/lambda-kappa-decision.md)
- [End-to-end correctness and operation identifier implementation](references/end-to-end-correctness.md)
- [Timeliness vs. integrity: distinguishing the two dimensions of consistency](references/timeliness-integrity.md)
**Cross-skill references:**
- `batch-pipeline-designer` — design the batch reprocessing layer (historical data, schema migrations)
- `stream-processing-designer` — design the stream propagation layer (CDC, event sourcing, window types, join types)
- `consistency-model-selector` — determine per-operation consistency requirements; distinguish linearizability from serializability; select consensus mechanisms where needed
- `replication-strategy-selector` — determine the replication topology of the primary system of record
- `transaction-isolation-selector` — evaluate isolation requirements for the system of record's OLTP workload
- `distributed-failure-analyzer` — diagnose correctness failures in the existing integration
**Source:** Designing Data-Intensive Applications, Martin Kleppmann (O'Reilly, 2017), Chapter 12: The Future of Data Systems, pp. 489–544.
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-batch-pipeline-designer`
- `clawhub install bookforge-stream-processing-designer`
- `clawhub install bookforge-consistency-model-selector`
- `clawhub install bookforge-replication-strategy-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/end-to-end-correctness.md
# End-to-End Correctness and Operation Identifier Implementation
## The End-to-End Argument
The end-to-end argument (Saltzer, Reed, and Clark, 1984) states:
> The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible.
Applied to data systems: **correctness guarantees provided by individual layers — TCP, database transactions, stream processor exactly-once — cannot, by themselves, guarantee correctness across the entire request path from client to database.**
### The Concrete Problem
A user submits a bank transfer via a web form:
1. Browser sends HTTP POST to application server
2. Application server opens a database transaction
3. Transaction updates two account balances
4. Transaction commits (ACID guarantee: atomically)
5. Database sends success response to application server
6. Application server sends 200 OK to browser
**What if the network fails at step 6?** The transaction committed in step 4, but the browser never received the success response. The browser shows an error. The user clicks "Submit" again. Now:
- TCP duplicate suppression: does not help — this is a new TCP connection
- Database transaction: does not help — this is a new transaction; the database does not know it is a retry
- 2PC: does not help — it handles coordinator failure during the commit protocol, not client-level retries after commit
The transfer executes twice. The user is charged $22 instead of $11.
**Real banks do not work this way.** Real banks use end-to-end idempotency: each transfer request carries a unique identifier that is stored in the database. A retry of the same request is detected and rejected at the application level.
## The Operation Identifier Pattern
### Structure
Every mutating operation that must be idempotent end-to-end needs a unique operation identifier:
1. **Generated at the client** — not at the application server. The client generates the ID before the first attempt and reuses it on every retry.
2. **Derived deterministically** — either a UUID (random, globally unique) or a hash of the request's meaningful fields (from_account, to_account, amount, initiating_session). Deterministic derivation allows reconstruction if the client forgets the ID.
3. **Passed through every hop** — HTTP header, message broker key, stream processor message field, database column. The ID is never dropped or regenerated at any intermediate layer.
4. **Enforced at the final storage layer** — a unique constraint on the operation ID in the database prevents duplicate execution, even under concurrent retries.
### Implementation
```sql
-- Schema: operation ID as a unique constraint
CREATE TABLE money_transfers (
request_id UUID PRIMARY KEY,
from_account BIGINT NOT NULL,
to_account BIGINT NOT NULL,
amount DECIMAL NOT NULL,
initiated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- The unique constraint on request_id makes concurrent retries safe
-- even at weak isolation levels (read committed, snapshot isolation)
-- because uniqueness constraints are enforced correctly at all isolation levels
-- in most relational databases
```
```sql
-- Execution: idempotent transfer using operation identifier
BEGIN TRANSACTION;
-- Insert the transfer request — fails with a unique violation if already processed
INSERT INTO money_transfers (request_id, from_account, to_account, amount)
VALUES ('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00)
ON CONFLICT (request_id) DO NOTHING; -- Idempotent: second attempt is a no-op
-- Conditionally apply the debit/credit only if the insert succeeded
-- (i.e., this is the first time this request_id has been processed)
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234
AND EXISTS (SELECT 1 FROM money_transfers WHERE request_id = '0286FDB8-D7E1-423F-B40B-792B3608036C');
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321
AND EXISTS (SELECT 1 FROM money_transfers WHERE request_id = '0286FDB8-D7E1-423F-B40B-792B3608036C');
COMMIT;
```
**Note on isolation level:** The unique constraint on `request_id` works correctly even at read committed isolation, because uniqueness constraints are enforced at a lower level than MVCC snapshot visibility. A concurrent insert of the same `request_id` will either serialize (one succeeds, one fails) or abort. The `ON CONFLICT DO NOTHING` clause makes the retry a safe no-op.
### The requests Table as an Event Log
The `money_transfers` table in the pattern above functions as an event sourcing log for transfer requests. The account balance updates do not need to happen in the same transaction as the transfer request insertion — they can be applied asynchronously by a downstream consumer that reads the `money_transfers` table as an event log, because:
- The request is durably recorded (the INSERT committed)
- The downstream consumer processes each request exactly once (it deduplicates by `request_id`)
- The downstream consumer is idempotent (applying the same transfer twice is prevented by the deduplication check)
This is the event sourcing pattern: record the intent durably first; derive the effects asynchronously.
## Multi-Partition Correctness Without Distributed Transactions
When an operation spans multiple partitions — for example, debiting account A (in partition 1) and crediting account B (in partition 2) — distributed atomic commit (2PC) is the traditional solution. 2PC achieves atomicity but has high cost: blocking during coordinator failure, heterogeneous protocol requirements, poor fault tolerance.
The alternative achieves equivalent correctness without cross-partition coordination:
### The Pattern
**Step 1:** Client generates a unique `request_id`. Application server appends a single transfer-request message to a Kafka topic partitioned by `request_id`. The message contains `{request_id, from_account, to_account, amount}`. This is a single-object write — atomic in almost all message brokers.
**Step 2:** A stream processor reads the transfer-request log. For each message, it emits two derived messages:
- A debit instruction to the `account-debits` topic, partitioned by `from_account`, containing `{request_id, from_account, amount}`
- A credit instruction to the `account-credits` topic, partitioned by `to_account`, containing `{request_id, to_account, amount}`
If the stream processor crashes and replays from its last checkpoint, it re-emits the same debit and credit instructions. Because the derivation is deterministic and the `request_id` is preserved, the downstream processors can safely deduplicate.
**Step 3:** Separate account processors for debits and credits each consume their respective topics. They maintain a deduplication table keyed by `request_id`. For each message:
- If `request_id` already in the deduplication table → skip (already applied)
- Otherwise → apply the debit/credit and record the `request_id`
**Why this achieves the same correctness as 2PC:**
- Every valid transfer request is applied exactly once to both the payer and payee accounts
- This holds even in the presence of stream processor crashes and message redelivery
- No cross-partition coordination is required at any step
- A failure of the debit processor does not affect the credit processor (they are independent consumers)
**The key insight:** Distributing the operation into two separately partitioned stages, connected by a durably logged request, avoids the need for atomicity across partitions. The `request_id` propagated through all stages is the mechanism that ties the distributed operation together and ensures exactly-once application.
## Timeliness vs. Integrity
These are two distinct dimensions of correctness that are conflated by the term "consistency":
**Timeliness:** Users observe the system in an up-to-date state. A timeliness violation means a user reads stale data — they see a version of the state that has since been updated. Timeliness violations are **temporary**: waiting and retrying will eventually show the correct state (assuming eventual consistency).
**Integrity:** No data is lost and no contradictory or false data exists. An integrity violation means the data is permanently corrupted — a record is missing, an account balance is wrong, a derived view contains a record that does not exist in the source. Integrity violations are **permanent**: waiting and retrying does not fix database corruption.
| Property | Violation | Recovery | Example |
|----------|-----------|----------|---------|
| Timeliness | Stale read | Wait and retry | User's profile update not yet visible in search index |
| Integrity | Data corruption | Manual repair | Account charged twice; search index contains records deleted from the source |
**Design principle:** In most applications, integrity is far more important than timeliness. Design the integration architecture to preserve integrity unconditionally — even at the cost of weak timeliness guarantees. Stale data is annoying; corrupted data is catastrophic.
The event log + idempotent consumer pattern preserves integrity while accepting weak timeliness: derived views will eventually be consistent with the source of record (when the consumer catches up), and the derivation will not corrupt data even if it runs multiple times.
## Auditing and Self-Verification
The end-to-end argument implies that end-to-end integrity checks are valuable. Because each individual layer assumes the layers above and below it are working correctly, bugs in any layer can produce corruption that only manifests at the application level.
**Useful audit patterns:**
1. **Periodic reconciliation:** Count or sum records in the source of record and each derived view. Discrepancies indicate lost events or failed derivations.
2. **Event log as audit trail:** If the event log is immutable and append-only, it provides a complete, replayable record of all writes. Any derived view can be verified by replaying the log and comparing the output to the current state.
3. **Deterministic reprocessing as verification:** Run the derivation function twice on the same input. If the outputs differ, the function is non-deterministic (a bug). If the output differs from the stored derived view, events were lost or applied out of order.
4. **Cryptographic integrity:** Hash the event log periodically and store the hashes. A mismatch indicates tampering or silent corruption. This is the principle behind Merkle trees used in certificate transparency systems.
FILE:references/lambda-kappa-decision.md
# Lambda vs. Kappa Architecture Decision Guide
## The Problem Both Architectures Solve
When batch processing is used to reprocess historical data (producing accurate, complete views) and stream processing is used to handle recent events (producing low-latency, approximate views), you need to decide how to combine them. Lambda and Kappa are the two main architectural patterns for this decision.
The underlying principle they share: **incoming data should be recorded as an immutable append-only log of events. Read-optimized views are derived from this log.** The difference is in how many systems derive those views.
## Lambda Architecture
### Structure
```
Incoming events
│
├──► [Batch Layer] ──────────────────────────────────────────────► Batch views (accurate)
│ Hadoop MapReduce / Spark │
│ Reprocesses all historical data │
│ Produces corrected, complete views ▼
│ [Serving Layer] ──► Query response
└──► [Speed Layer] ──────────────────────────────────────────► Real-time views
Apache Storm / Kafka Streams (approximate)
Processes recent events quickly
Produces approximate, low-latency views
```
### Core Idea
- The batch layer is authoritative and simple: it reprocesses everything, so there is no state to manage, no fault tolerance to implement beyond restarting the job, and no incremental complexity.
- The speed layer is fast but imprecise: it handles the lag between the last batch run and now, accepting that its output may have small errors.
- Queries merge results from both layers at read time.
### When Lambda Makes Sense
- The batch processing logic is significantly simpler and less bug-prone than the equivalent stream processing logic.
- The team's stream processing infrastructure is less mature or reliable than their batch infrastructure.
- The workload uses approximate algorithms in the speed layer (e.g., HyperLogLog cardinality estimates, sketches) that are fast but slightly inaccurate, while the batch layer computes exact results.
### Practical Problems with Lambda
1. **Dual codebase burden.** The same business logic must be implemented in two different frameworks — the batch framework (Spark, MapReduce) and the streaming framework (Storm, Flink, Kafka Streams). These frameworks have very different APIs, semantics, and operational models. Any change to the logic requires updating both implementations.
Libraries like Apache Beam or Summingbird attempt to abstract this by providing a single API that compiles to both batch and streaming backends, but debugging, tuning, and operational behavior still differ between the two runtimes.
2. **Merging batch and stream outputs is non-trivial.** Merging a simple aggregation (sum, count) over a tumbling window is straightforward. Merging session-based aggregations, joins, or stateful computations is much harder. The serving layer must implement query-time merge logic that mirrors the derivation logic — a third implementation of the same business logic.
3. **Batch layer ends up doing incremental processing.** Reprocessing all historical data from scratch on every batch run is expensive at scale. The batch layer is usually configured to process only recent data (e.g., the last hour), which means it faces the same windowing and late-data problems as the stream layer — negating its simplicity advantage.
4. **Two systems to operate.** Maintaining two independent distributed systems (a batch cluster and a streaming cluster) doubles infrastructure and operational complexity.
## Kappa Architecture
### Structure
```
Incoming events (immutable log, long retention)
│
└──► [Stream Processing Layer] ──────────────────────────────────► Views
Apache Flink / Apache Beam / Kafka Streams
Processes both historical (replay) and recent events
Produces accurate views via exactly-once semantics
```
### Core Idea
Unify batch and stream processing in a single system by treating batch jobs as a special case of stream processing: a job over a bounded (finite) stream. The stream processor reads from the beginning of the log for historical reprocessing, and from the current offset for ongoing processing. No separate batch layer is needed.
### Required Capabilities for Kappa
For Kappa to work correctly, three capabilities are needed:
1. **Log-based broker with configurable retention and replay.** The message broker must retain events long enough to replay all history needed for reprocessing. Kafka's configurable log retention (time-based or size-based) satisfies this. The stream processor reads from offset 0 to reprocess history, then continues from the current offset.
2. **Exactly-once semantics in the stream processor.** The output of a reprocessing run must be the same as if no faults had occurred. This requires: (a) exactly-once message delivery within the processor, (b) idempotent output writes, or (c) transactional output commits. Apache Flink's checkpointing with transactional sinks achieves this.
3. **Event-time windowing.** When reprocessing historical data, processing-time timestamps are meaningless (the job runs now, but the events are from the past). The stream processor must window on the event's original timestamp, not on when the processor handles it. Apache Beam and Flink both support event-time windowing with watermarks for handling late-arriving events.
### When Kappa Wins
- A mature stream processor with the three required capabilities (replay, exactly-once, event-time) is available and the team can operate it.
- Business logic is complex enough that maintaining two implementations is a significant burden.
- Schema evolution or business logic changes are frequent (each change requires reprocessing the log; with Kappa, this is one reprocessing job, not a coordinated update to two systems).
### When Lambda May Still Be Preferred
- The stream processor available to the team lacks replay capability or exactly-once semantics.
- The batch computation is a genuinely simpler algorithm than the equivalent streaming computation (e.g., sorting a bounded dataset vs. approximate sorting in a stream).
- The team has significantly more operational expertise in batch systems than streaming systems.
## The Convergence Trend
Modern processing engines are converging toward the Kappa model:
- **Apache Spark:** Originally batch-only (MapReduce replacement). Added Structured Streaming as a micro-batch stream processor. Spark can read Kafka topics as a stream or as a bounded dataset for batch — same API.
- **Apache Flink:** Originally a stream processor. Added DataSet API for batch processing. Flink 1.12+ unified both under a single streaming model: batch is a bounded stream.
- **Apache Beam:** Provides a single portable API for both batch and streaming. Backends include Flink, Dataflow, and Spark.
The practical consequence: if the team uses Flink or Beam, the Lambda vs. Kappa choice largely dissolves — the same code runs in both modes.
## Decision Matrix
| Criterion | Lambda | Kappa |
|-----------|--------|-------|
| Separate batch and stream codebases | Required | Not needed |
| Exactly-once stream semantics required | No (batch corrects errors) | Yes |
| Log replay capability required | No | Yes (long retention) |
| Query-time merge of batch + stream outputs | Required | Not needed |
| Appropriate when stream processor is immature | Yes | No |
| Appropriate when team has high streaming maturity | Less relevant | Yes |
| Schema evolution / logic change cost | High (update 2 codebases + reprocess) | Lower (one reprocessing job) |
| Operational complexity | High (2 clusters) | Medium (1 cluster) |
## Recommendation
**Default to Kappa** if the team can operate a stream processor with replay, exactly-once, and event-time capabilities (Flink, Beam on Dataflow, or Kafka Streams with Kafka log retention).
**Use Lambda** only if the stream processing capability is genuinely immature relative to the batch capability, or if the batch algorithms being used are fundamentally simpler than their streaming equivalents (which is increasingly rare).
**Use a unified batch/stream system** (Flink, Spark with Structured Streaming, Beam) to make the choice irrelevant.
FILE:references/timeliness-integrity.md
# Timeliness vs. Integrity: Two Dimensions of Consistency
## The Core Distinction
The term "consistency" in distributed systems conflates two properties that are worth separating:
**Timeliness** — ensuring that users observe the system in an up-to-date state. If data has been written, any subsequent read should return the new value, not a stale version.
**Integrity** — ensuring the absence of corruption; no data loss, no contradictory or false data. If a derived dataset is maintained as a view onto underlying data, the derivation must be correct and complete.
### The Slogan
- Violations of timeliness are **"eventual consistency"** — the system will catch up.
- Violations of integrity are **"perpetual inconsistency"** — the corruption is permanent.
## Why the Distinction Matters for Data Integration
### Traditional Transactions Conflate Both
ACID transactions provide both timeliness (linearizability — a committed write is immediately visible to all subsequent reads) and integrity (atomicity and durability — a write either commits completely or rolls back completely, with no partial effects).
Because transactions provide both together, architects using transactional systems rarely need to distinguish them. This creates a false assumption: that any system providing "consistency" must provide both simultaneously.
### Asynchronous Log-Based Systems Decouple Them
Event-log-based derived data systems (CDC, event sourcing, stream processing) are designed to preserve integrity while accepting weak timeliness.
- **Integrity is central:** Exactly-once semantics, idempotent consumers, and operation identifiers ensure that every event is applied exactly once to every derived view. No data is lost; no event is applied twice.
- **Timeliness is weak:** Derived views lag behind the source of record by the propagation latency (milliseconds to minutes, depending on the system). A read from the derived view may return stale data.
This is not a bug — it is a deliberate design trade-off that enables better performance, availability, and fault isolation than synchronous distributed transactions.
## Practical Implications
### What Integrity Requires in a Composed Pipeline
1. **Exactly-once event delivery to each consumer.** An event that is lost causes the derived view to permanently miss that update. An event that is applied twice may corrupt the derived view (unless the consumer is idempotent).
2. **Correct ordering where ordering matters.** If two events must be applied in a specific order (a delete must not precede the record's creation), the consumer must receive them in that order or detect and handle out-of-order delivery.
3. **Idempotent consumers or deduplication.** Because message delivery may provide at-least-once semantics (duplicates possible), consumers must be idempotent (applying the same event twice produces the same result) or must deduplicate by event ID.
4. **Durable event log.** If the event log is lost, derived views cannot be reconstructed. The log must be replicated and retained for at least as long as any consumer might need to replay it.
### What Timeliness Requires
1. **Low propagation latency.** The gap between a write to the source of record and its visibility in derived views must be small enough for the use case (sub-second for real-time dashboards; minutes may be acceptable for analytics).
2. **Read-your-own-writes consistency** (if required). A user who updates their profile and immediately views the result should see their new profile, not the pre-update version. This requires either routing the user's reads to the system of record (not the derived view) for a period after the write, or using causal consistency mechanisms.
3. **Causal consistency** (if required). If Event B causally depends on Event A (B happened because of A), any user who observes B should also observe A. See the causal ordering section of the main skill.
### The Trade-Off Is Asymmetric
In most applications, integrity violations are far more costly than timeliness violations:
| Scenario | Timeliness violation | Integrity violation |
|----------|---------------------|---------------------|
| E-commerce order | Product shows as available for 30s after selling out | Customer is charged twice for the same order |
| Bank account | Balance shows yesterday's value for 500ms | $11 is debited but only $5.50 is credited |
| Social profile | Updated bio visible to some users 2s before others | Profile update is permanently lost |
| Search index | New document not searchable for 10s | Deleted document remains in search results forever |
Timeliness violations are annoying and recoverable. Integrity violations are catastrophic and may be unrecoverable without manual intervention.
## Coordination-Avoiding Architectures
An important corollary: if integrity is more important than timeliness, and integrity can be maintained without synchronous cross-system coordination, then coordination-avoiding architectures are strictly better for most workloads.
**Coordination-avoiding systems** maintain integrity guarantees on derived data without requiring:
- Atomic commit across multiple stores
- Linearizability (synchronous coordination for recency guarantee)
- Synchronous cross-partition or cross-region coordination
They achieve this through:
- Deterministic derivation functions (same input → same output, always)
- Exactly-once processing (idempotent consumers, operation identifiers)
- Immutable event logs (replay reconstructs any derived view)
A coordination-avoiding system can operate across multiple datacenters in a multi-leader configuration, with weak timeliness guarantees, while still maintaining strong integrity guarantees. This is the sweet spot for most large-scale data integration scenarios.
## Loose Constraint Enforcement
Not every constraint must be enforced synchronously. Many business constraints can be violated temporarily and repaired after the fact ("apologize later").
### Hard Constraints (must be synchronous)
- Double-charging a payment card
- Permanently deleting data that a user wants back
- Any operation that cannot be reversed or compensated
### Soft Constraints (can be eventual or compensating)
- Two users claiming the same username simultaneously — apologize to the loser; ask them to choose another
- Overselling a product — apologize to the excess purchasers; offer a refund or backorder
- Overbooking an airline seat — apologize; offer compensation; bump a passenger
The cost of the apology (money, reputation, customer service effort) determines whether the constraint must be enforced synchronously. If the apology cost is low and the apology workflow must exist anyway (for other reasons like fraud, cancellations, etc.), strict synchronous enforcement may be unnecessary.
**The key question:** "What is the cost of temporarily violating this constraint, and can we detect and repair the violation before it causes irreversible harm?"
If the answer is "low cost, yes we can detect and repair" → soft constraint, coordination-avoiding enforcement is sufficient.
If the answer is "high cost or irreversible" → hard constraint, synchronous enforcement required.
## Applying This in Practice
When designing a data integration architecture, apply this framework to each constraint:
1. Identify the constraint (uniqueness, referential integrity, balance non-negativity, etc.)
2. Classify: hard (synchronous enforcement required) vs. soft (can be eventually enforced with compensation)
3. For hard constraints: design the partition routing to ensure conflicts go to a single partition, enabling a stream processor to enforce the constraint without distributed coordination
4. For soft constraints: design the async detection and compensation workflow; keep the write path fast and coordination-free
5. Ensure integrity (no data loss, no double-application) in all cases, regardless of timeliness classification
FILE:references/unbundling-patterns.md
# Unbundling Patterns: Federated Databases vs. Composed Pipelines
## The Core Analogy
A relational database internally maintains several subsystems:
- **Secondary indexes** — allow efficient lookup by non-primary-key fields
- **Materialized views** — precomputed query results stored for fast read access
- **Replication logs** — propagate changes to follower replicas
- **Full-text indexes** — built into some databases (PostgreSQL tsvector, MySQL FULLTEXT)
- **Query optimizer** — selects efficient execution plans across indexes and joins
When you compose specialized systems — a primary OLTP database, a search index, an analytics warehouse, a feature store — you are implementing these same subsystems outside the database, as independent services connected by an event log. The event log plays the role of the replication log.
This is the "unbundling databases" or "database inside-out" approach: taking features that are built into databases as tightly coupled subsystems and implementing them as loosely coupled independent services.
## Two Approaches to Composition
### Federated Databases (Unified Reads)
A federated database (or polystore) provides a unified query interface over multiple underlying storage engines. Users query a single endpoint; the federated layer routes subqueries to the appropriate backend and merges results.
**Examples:** PostgreSQL Foreign Data Wrappers (FDW), BigQuery federated queries, Presto/Trino.
**Strengths:**
- Single query language for users
- No data movement required for read-only analytical queries
- Applications that need a specialized data model still have direct access to their native storage
**Weaknesses:**
- Query optimization across heterogeneous engines is complex
- No good answer to synchronizing writes across backends — reads are unified, writes are not
- Performance depends on the least-capable backend for cross-store queries
**When to use:** When you need unified read access to existing independent stores and can tolerate the write-synchronization problem being handled separately.
### Composed Pipelines (Unified Writes)
The composed pipeline approach focuses on write synchronization: one system is the source of truth; all others are derived. The event log synchronizes writes deterministically.
**The analogy to Unix pipes:**
```
# Unix: small tools, uniform interface (byte streams), composable
cat access.log | grep "ERROR" | awk '{print $1}' | sort | uniq -c
# Data systems: small tools, uniform interface (event log), composable
PostgreSQL → Debezium CDC → Kafka → Flink → Elasticsearch
→ Snowflake
→ Redis
```
Each tool does one thing well. The event log is the pipe. Application code is the shell script that wires them together.
**Strengths:**
- Loose coupling: a fault in Elasticsearch does not affect PostgreSQL or Snowflake
- Each store is optimized for its specific access pattern
- Historical reprocessing: replay the event log to rebuild any derived view
- Independent team ownership: each team operates their downstream consumer independently
**Weaknesses:**
- Operational complexity: more systems to deploy, monitor, and upgrade
- Asynchronous propagation: derived views lag behind the source of truth
- No standardized "pipe" yet: unlike Unix pipes, there is no universal protocol for composing storage systems
## Creating an Index as the Template
The `CREATE INDEX` operation in a relational database illustrates the pattern:
1. The database takes a consistent snapshot of the table
2. It scans all rows, extracts the indexed field values, sorts them, and writes the index structure
3. It applies any writes that arrived during the scan from the replication backlog
4. It continues maintaining the index on every subsequent write
This is exactly the bootstrap pattern for setting up a new derived view in a composed pipeline:
1. Take a consistent snapshot of the source of record (initial export)
2. Apply the snapshot to the derived store
3. Switch to consuming the ongoing event log from the point of the snapshot
4. Continue consuming the log indefinitely
The difference is that `CREATE INDEX` is built into the database and runs automatically. In a composed pipeline, you implement this bootstrap manually — but the logic is identical.
## The Missing Piece
Unix has the shell as a high-level language for composing tools via pipes. Data systems do not yet have an equivalent. The vision:
```
# Hypothetical declarative composition
mysql | elasticsearch # = CREATE INDEX in MySQL, continuously replicated to Elasticsearch
```
This would mean: take all documents in MySQL, index them in Elasticsearch, and continuously apply all future changes. Tools like Debezium + Kafka + a stream processor approximate this, but with significantly more configuration and operational overhead than a pipe operator.
## When a Single System Wins
The goal of unbundling is breadth — serving more access patterns than any single system can. It is not a goal in itself.
**Use a single system when:**
- One technology satisfies all access patterns at the required scale
- The marginal benefit of specialized stores does not justify the operational cost
- The team does not have capacity to operate multiple systems reliably
"If there is a single technology that does everything you need, you're most likely best off simply using that product rather than trying to reimplement it yourself from lower-level components." — Kleppmann, Ch. 12
Premature unbundling — composing systems before the single-system ceiling is reached — adds complexity without benefit and may lock the team into an inflexible design.
Choose the correct consistency model (linearizability, causal consistency, or eventual consistency) for each operation in a distributed system, and select th...
---
name: consistency-model-selector
description: |
Choose the correct consistency model (linearizability, causal consistency, or eventual consistency) for each operation in a distributed system, and select the matching implementation mechanism. Use when designing a new distributed data system, deciding whether ZooKeeper or etcd is needed for coordination, evaluating whether two-phase commit is appropriate for cross-node transactions, debugging correctness violations (stale reads, split-brain, uniqueness constraint failures), or distinguishing linearizability from serializability. Also use when applying the CAP theorem correctly (beyond the "pick 2 of 3" oversimplification), selecting total order broadcast as a consensus primitive, evaluating 2PC failure modes and lock-holding cost, or assessing whether causal consistency is sufficient in place of linearizability. Produces a per-operation consistency recommendation with replication mechanism, ordering guarantee, and — when consensus is needed — protocol selection (Raft, Zab, Paxos) with documented failure modes. Does not cover replication topology or failure recovery strategy (see replication-strategy-selector, distributed-failure-analyzer).
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/consistency-model-selector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- replication-strategy-selector
- distributed-failure-analyzer
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [9]
tags:
- consistency
- linearizability
- causal-consistency
- eventual-consistency
- consensus
- total-order-broadcast
- two-phase-commit
- cap-theorem
- zookeeper
- etcd
- raft
- paxos
- lamport-timestamps
- distributed-transactions
- atomic-commit
- serializability
- ordering-guarantees
- leader-election
- uniqueness-constraints
- distributed-locks
execution:
tier: 1
mode: hybrid
inputs:
- type: codebase
description: "Application codebase, docker-compose, database config, or architecture description revealing operations and their correctness requirements"
- type: document
description: "System requirements document, architecture diagram, or written description of operations if no codebase is available"
tools-required: [Read, Write]
tools-optional: [Grep, Bash]
mcps-required: []
environment: "Run inside a project directory with codebase or architecture artifacts. Falls back to interactive document/description input."
discovery:
goal: "Produce a per-operation consistency recommendation: model selection + replication mechanism + ordering guarantee + consensus protocol if needed"
tasks:
- "Classify each operation by its consistency requirement"
- "Select the minimum sufficient consistency model per operation (eventual / causal / linearizable)"
- "Identify which operations require consensus and match them to the 6-problem checklist"
- "Select a replication mechanism compatible with the chosen model"
- "Evaluate 2PC applicability and document failure modes if selected"
- "Recommend ZooKeeper/etcd for consensus-dependent coordination tasks"
- "Document the CAP trade-off in force for each linearizable operation"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "site-reliability-engineer", "tech-lead"]
experience: "intermediate-to-advanced — assumes familiarity with distributed systems, replication, and transactions"
triggers:
- "Choosing consistency guarantees for a new distributed data system"
- "Deciding whether to use ZooKeeper, etcd, or a custom leader election mechanism"
- "Evaluating whether two-phase commit is appropriate for cross-service transactions"
- "Debugging stale reads, causality violations, or uniqueness constraint failures"
- "Assessing whether leaderless replication is strong enough for an operation"
- "Distinguishing when serializability is sufficient vs. when linearizability is required"
- "Designing distributed lock or leader election infrastructure"
not_for:
- "Selecting isolation levels for single-node transactions — use transaction-isolation-selector"
- "Diagnosing replication lag and failover failures — use replication-failure-analyzer"
- "Choosing a replication topology — use replication-strategy-selector"
---
## When to Use
Use this skill when you need to decide **how strongly consistent** a distributed operation must be, and what mechanism enforces that guarantee.
Invoke it for:
- Any operation where two nodes could disagree on the current value (leader election, uniqueness check, account balance enforcement, distributed lock)
- Cross-service or cross-database transactions where atomic commit is required
- Systems where stale reads cause incorrect application behavior (not just stale UI)
- Designing coordination infrastructure (service registry, job scheduler, partition assignment)
- Evaluating whether a Dynamo-style leaderless database is strong enough for a given workload
Do not invoke it for single-node transaction isolation tuning — that is the domain of `transaction-isolation-selector`.
---
## Context and Input Gathering
Before selecting a model, collect the following per operation or component:
1. **Operation type**: read, write, read-modify-write, uniqueness check, lock acquisition, leader election, atomic commit across nodes
2. **Correctness requirement**: Can stale data cause incorrect behavior? Can two nodes diverge temporarily? Is the worst-case outcome data loss, a user-visible error, or a constraint violation?
3. **Availability requirement**: Must this operation succeed during a network partition, or is it acceptable to return an error?
4. **Cross-channel timing dependencies**: Does any other system or user observe an out-of-band signal about the write (e.g., a webhook, message queue, user notification) before reading back?
5. **Replication topology in use**: single-leader, multi-leader, or leaderless (see `replication-strategy-selector`)
6. **Throughput and latency constraints**: Are response times acceptable with a synchronous round-trip to a leader or quorum?
If a codebase is available, search for:
- Database client configuration (isolation level, quorum parameters `w`, `r`, `n`)
- Distributed lock acquisition code
- Uniqueness constraint enforcement logic
- Leader election or service discovery configuration
- Cross-service transaction boundaries
---
## Process
### Step 1 — Identify the Required Guarantee for Each Operation
**WHY**: Different operations have fundamentally different correctness requirements. Over-provisioning consistency wastes latency and availability; under-provisioning introduces bugs that only appear under concurrency or network faults — the hardest class of bugs to detect in testing.
Apply this decision table per operation:
| Scenario | Minimum Required Model |
|---|---|
| Display a user's own recent writes to that same user | Read-your-writes (causal) |
| Show a feed where replies never appear before questions | Causal consistency |
| Enforce a hard uniqueness constraint (username, seat, stock limit) | Linearizability |
| Acquire a distributed lock that prevents split-brain | Linearizability |
| Elect a single leader across nodes | Linearizability (via consensus) |
| Atomically commit a transaction across multiple nodes | Atomic commit (consensus-equivalent) |
| Show analytics dashboard (staleness of seconds acceptable) | Eventual consistency |
| Show a social feed where ordering is approximate | Eventual consistency |
| Replicate writes across datacenters for disaster recovery | Eventual consistency (async replication) |
**Key distinction — linearizability vs. serializability** (commonly confused):
- **Serializability** is an isolation property of *transactions*: multi-object operations behave as if they executed in some serial order. It says nothing about recency.
- **Linearizability** is a recency guarantee on *individual objects*: once a write completes, all subsequent reads see that value — no stale reads from any replica.
- A system can have one without the other. Serializable snapshot isolation (SSI) is explicitly *not* linearizable by design — reads come from a consistent snapshot, which may not include the latest write.
- When you need both: use strict serializability (strong one-copy serializability), implemented by systems that combine 2-phase locking with single-leader replication, or by actual serial execution.
### Step 2 — Select the Minimum Sufficient Consistency Model
**WHY**: Causal consistency is the strongest model that does not slow down under network delays and remains available during network failures. Linearizability is strictly stronger but imposes real latency costs proportional to network uncertainty. Always use the weakest model that is correct.
**Eventual consistency** — use when:
- Staleness is acceptable or expected (analytics, social feeds, DNS lookups)
- High availability during partitions is the primary requirement
- Conflict resolution is application-managed or last-write-wins is acceptable
- Systems: Cassandra (default), Riak, DynamoDB (eventual mode), CouchDB
**Causal consistency** — use when:
- Operations have cause-and-effect ordering that users would notice if violated (reply before question, update before its acknowledgment)
- Multiple communication channels exist (e.g., message queue + file storage), and race conditions between them would cause incorrect behavior
- You need read-your-writes, monotonic reads, and consistent prefix reads across replicas
- Linearizability overhead is too high but eventual is too weak
- Implementation: single-leader replication with reads from leader or synchronously updated follower; causal dependency tracking via version vectors or Lamport timestamps
- Note: causal consistency is not a standard off-the-shelf setting in most databases — it requires careful design of how reads are routed and how dependency information is propagated
**Linearizability** — use when:
- A uniqueness constraint must be enforced as data is written (username, email, seat booking, stock level)
- A distributed lock must be held by exactly one node at a time (split-brain prevention)
- A leader election must produce exactly one agreed-upon leader
- A cross-channel timing dependency exists and cannot be controlled (an external system reads after an out-of-band signal)
- The operation is a compare-and-set that must be atomic across replicas
- Systems: single-leader replication with reads from leader (potentially), consensus algorithms (ZooKeeper, etcd — linearizable writes, quorum reads for linearizable reads), 2-phase locking + single-leader, actual serial execution
- **Not provided by**: multi-leader replication (concurrent writes to different leaders), leaderless/Dynamo-style replication (last-write-wins with clock skew is not linearizable; even strict quorums with variable network delays are not linearizable — see reference)
### Step 3 — Check If the Operation Requires Consensus
**WHY**: Consensus is harder than it looks. Many operations that appear simple are actually reducible to consensus, meaning they require a consensus algorithm to implement correctly in a fault-tolerant way. Identifying this early prevents building brittle custom solutions.
The following 6 problems are all equivalent to consensus — if you need to solve any one of them in a fault-tolerant distributed system, you need a consensus algorithm:
1. **Linearizable compare-and-set registers** — atomically decide whether to set a value based on its current state
2. **Atomic transaction commit** — decide whether to commit or abort a distributed transaction (all nodes must agree)
3. **Total order broadcast** — decide the order in which messages are delivered to all nodes
4. **Distributed locks and leases** — decide which client successfully acquired the lock
5. **Membership/coordination service** — decide which nodes are alive and should be considered current members
6. **Uniqueness constraints** — decide which of concurrent conflicting writes wins
If your operation matches any of the above, evaluate whether to:
- Use an existing consensus service (ZooKeeper, etcd — preferred, see Step 5)
- Use a database that internally implements consensus (VoltDB, CockroachDB, Spanner)
- Implement two-phase commit for atomic cross-node transactions (see Step 4 for failure modes)
### Step 4 — Evaluate Two-Phase Commit (2PC) If Cross-Node Atomic Commit Is Required
**WHY**: 2PC is the standard algorithm for atomic commit across multiple nodes. It solves a real problem but introduces a single point of failure (the coordinator) and can block indefinitely — understanding its failure modes is essential before choosing it.
**How 2PC works**:
1. Coordinator sends `prepare` to all participant nodes
2. Each participant votes `yes` (promises it can commit) or `no` (aborts)
3. If all vote `yes`, coordinator writes commit decision to its log (the commit point), then sends `commit` to all participants
4. If any vote `no`, coordinator sends `abort` to all participants
**The critical failure mode — coordinator crash after prepare**:
- Once a participant has voted `yes`, it cannot unilaterally abort or commit — it must wait for the coordinator's decision
- If the coordinator crashes after receiving `yes` votes but before sending the commit/abort: participants are **in-doubt** and blocked indefinitely
- In-doubt participants hold row-level locks on all modified rows — no other transaction can proceed on those rows
- The only resolution is coordinator recovery (reads its log) or manual administrator intervention
- Orphaned in-doubt transactions (coordinator log lost or corrupted) may hold locks permanently until manual rollback
**2PC failure mode catalog**:
| Failure | Outcome |
|---|---|
| Coordinator crashes before prepare | Safe: participants can abort |
| Participant crashes before voting | Coordinator aborts on timeout |
| Network partitions participant from coordinator | Coordinator aborts on timeout (before commit point) |
| Coordinator crashes after commit point, before all `commit` sent | Remaining participants are in-doubt; blocked until coordinator recovers |
| Coordinator log lost after crash | Orphaned transactions; manual intervention required |
| Long coordinator restart (e.g., 20 minutes) | All participant locks held for that duration; application may be largely unavailable |
**2PC performance cost**: Disk fsyncs at each phase, additional network round-trips, lock-holding during coordination. MySQL distributed transactions reported at 10x slower than single-node.
**When 2PC is appropriate**:
- Database-internal distributed transactions (all nodes run the same software — e.g., VoltDB, MySQL Cluster NDB) — failure modes are more manageable
- Heterogeneous systems that all support XA transactions (PostgreSQL, MySQL, Oracle + JTA-compatible message brokers)
**When to avoid 2PC**:
- High-availability requirements: a coordinator crash can make the application unavailable
- Heterogeneous systems where not all participants support XA or 2PC
- When the coordinator is not replicated (single point of failure for the entire transaction system)
- Stateless application server deployments where coordinator logs can be lost on restart
**Alternatives to 2PC for cross-service correctness**:
- Sagas (compensating transactions) — eventual consistency with explicit rollback logic
- Outbox pattern — atomic local write + reliable event relay (see `distributed-failure-analyzer`)
- Total order broadcast + idempotent consumers — exactly-once processing without 2PC
### Step 5 — Select the Consensus Implementation
**WHY**: Implementing consensus from scratch has a very poor success record. Well-tested consensus systems exist and should be used as building blocks. The key insight is that total order broadcast (the core of Raft, Zab, Paxos) is the practical primitive that enables all 6 consensus-equivalent problems to be solved safely.
**Total order broadcast** (also called atomic broadcast) provides:
- **Reliable delivery**: if a message is delivered to one node, it is delivered to all
- **Totally ordered delivery**: all nodes receive messages in exactly the same order
- This is equivalent to repeated rounds of consensus — and it is what ZooKeeper, etcd, and Kafka (with ZooKeeper or KRaft) implement
**Consensus algorithm selection**:
| Algorithm | Implemented by | Notes |
|---|---|---|
| Raft | etcd, CockroachDB, TiKV, Consul | Well-specified, good documentation, widely adopted |
| Zab | ZooKeeper | Total order broadcast directly; basis of Hadoop/HBase/Kafka coordination |
| Multi-Paxos | Google Chubby, Spanner | Highly proven, complex to implement correctly |
| Viewstamped Replication | Basis for VR-based systems | Theoretically important, less common in production |
**Do not implement your own consensus algorithm.** Use one of the above systems.
**ZooKeeper / etcd as outsourced consensus** — prefer this model when:
- You need distributed locks, leader election, service discovery, or membership tracking
- Your application should not embed consensus logic
- You need the combination: linearizable atomic operations + total ordering + failure detection + change notifications
- Note: ZooKeeper writes are linearizable by default; reads may be stale unless you request a linearizable read (quorum read in etcd, `sync()` call in ZooKeeper)
**Fault-tolerant consensus limitations**:
- Requires a strict majority: tolerate 1 failure → need 3 nodes; tolerate 2 failures → need 5 nodes
- Safety properties (agreement, integrity, validity) are always maintained even if a majority fails
- Liveness (termination) requires fewer than half the nodes to be failed or unreachable
- Performance degrades with high network variance (frequent false leader timeouts trigger leader elections, reducing throughput)
- Fixed membership by default — dynamic membership extensions exist but are less well-understood
### Step 6 — Document the CAP Trade-Off Per Linearizable Operation
**WHY**: CAP is widely misunderstood. The correct framing is: when a network partition occurs, a linearizable system must choose between staying consistent (refusing requests) or becoming available (serving potentially stale data). This is not a design choice you make once — it is a per-operation consequence of requiring linearizability.
**What CAP actually says**:
- If an application *requires* linearizability: when a replica is disconnected from others, it must wait or return an error — it becomes *unavailable*
- If an application *does not require* linearizability: replicas can process requests independently during a partition — it remains *available* but behavior is not linearizable
- The "pick 2 of 3" framing is misleading: network partitions are not optional. A better framing: *either Consistent or Available when Partitioned*
- CAP is narrow in scope — it only addresses linearizability and network partition faults, says nothing about latency, node crashes, or other faults. It has been superseded by more precise results for system design purposes
**Practical consequence**: For each operation you mark as requiring linearizability, document the expected behavior during a partition: error returned, request queued, or application component unavailable. Stakeholders should understand this before the system is deployed.
---
## Examples
### Example 1 — Seat Booking Service
**Scenario**: An event ticketing platform lets users book the last seat in a venue. Two users submit requests concurrently.
**Trigger**: "We're seeing double-bookings in our seat reservation system."
**Process**:
1. Gather: the seat availability check + reservation write is a uniqueness constraint — exactly one booking must win
2. Model selection: linearizability required. Both concurrent reads see availability = 1; without a linearizable compare-and-set, both can commit.
3. Consensus check: uniqueness constraint → problem 6 in the consensus-equivalent list → requires consensus
4. Implementation: use a single-leader database with a serializable transaction (or at minimum a linearizable compare-and-set on the seat record). A Dynamo-style leaderless database with last-write-wins is not safe here.
5. 2PC: only needed if the seat record and the payment record live in different databases. If so, use a saga with compensation (refund) instead of 2PC to avoid coordinator-failure blocking.
**Output**: Linearizability required for the reservation write. Use a single-leader database with serializable isolation for the seat-payment transaction. If cross-database: saga pattern with idempotent payment reversal.
### Example 2 — Multi-Region Comment Feed
**Scenario**: A social platform shows threaded comments. Replies should never appear before the question being replied to. Comments can be up to 2 seconds stale. The system uses multi-leader replication across 3 regions.
**Trigger**: "Users in Asia see replies to questions that haven't appeared yet."
**Process**:
1. Gather: the anomaly is a causal ordering violation — replies appearing before their parent
2. Model selection: causal consistency is sufficient. Linearizability is not required (no uniqueness constraint, no lock, stale display is acceptable within 2 seconds).
3. Multi-leader replication is not causally consistent by default — writes on different leaders can be applied in any order on followers.
4. Options: (a) route all reads and writes for a given thread to a single-leader per partition; (b) propagate causal dependency metadata (version vectors) with writes and delay delivery of writes whose dependencies haven't arrived yet; (c) use a single-leader database and accept the write-latency increase for cross-region authors.
5. Total order broadcast is overkill — causal ordering per thread is sufficient.
**Output**: Causal consistency required for comment ordering. Recommended: single-leader-per-partition with reads from leader. Multi-leader without dependency tracking is not safe for this workload.
### Example 3 — Leader Election for a Job Scheduler
**Scenario**: A distributed job scheduler must have exactly one active scheduler node at a time. If two nodes believe they are the leader, jobs execute twice.
**Trigger**: "We have a split-brain problem — two scheduler instances both claim leadership and jobs are running twice."
**Process**:
1. Gather: leader election is problem 5 in the consensus-equivalent list (membership/coordination) and requires a linearizable lock
2. Model selection: linearizability required. The lock must be held by exactly one node at a time — all nodes must agree who holds it.
3. Consensus check: leader election → consensus required
4. Implementation: use ZooKeeper or etcd for leader election. Acquire an ephemeral node (ZooKeeper) or a lease (etcd). Use fencing tokens (the monotonically increasing `zxid` in ZooKeeper) to prevent a slow previous leader from acting on a stale lock after a new leader is elected.
5. 2PC is not relevant — this is a lock acquisition, not a cross-node transaction.
6. Do not implement with a custom distributed lock using a regular database row — it will not handle coordinator failure correctly.
**Output**: Linearizability required. Use etcd or ZooKeeper for leader election with ephemeral leases and fencing tokens. Never use a custom distributed lock without a consensus-backed service.
---
## References
- [consistency-model-spectrum.md](references/consistency-model-spectrum.md) — Formal definitions of all consistency models with ordering properties
- [linearizability-vs-serializability.md](references/linearizability-vs-serializability.md) — Side-by-side distinction with examples of systems that provide each
- [cap-theorem-analysis.md](references/cap-theorem-analysis.md) — What CAP actually says, its limitations, and how to apply it correctly
- [consensus-equivalence-checklist.md](references/consensus-equivalence-checklist.md) — The 6 consensus-equivalent problems with implementation guidance
- [2pc-failure-modes.md](references/2pc-failure-modes.md) — Complete 2PC failure catalog, in-doubt recovery, XA limitations, alternatives
- [total-order-broadcast-primitives.md](references/total-order-broadcast-primitives.md) — How to build linearizable storage and uniqueness constraints from total order broadcast
Cross-references:
- `replication-strategy-selector` — replication topology must be compatible with the selected consistency model
- `distributed-failure-analyzer` — for failure mode diagnosis when consistency violations appear in production
- `transaction-isolation-selector` — for isolation level selection within a single-node or single-leader transaction context
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-replication-strategy-selector`
- `clawhub install bookforge-distributed-failure-analyzer`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/2pc-failure-modes.md
# Two-Phase Commit: Failure Modes and Alternatives
## How 2PC Works (Normal Path)
Two-phase commit (2PC) ensures atomic transaction commit across multiple nodes. It requires a *coordinator* (transaction manager) and one or more *participants* (database nodes).
**Phase 1 — Prepare**:
1. Application requests a globally unique transaction ID from the coordinator
2. Application executes the transaction on each participant under that transaction ID
3. When ready to commit, coordinator sends `prepare` to all participants
4. Each participant either:
- Replies `yes`: writes all transaction data to disk, promises it can commit under any future circumstance (crash, power failure, etc. are no longer excuses for refusing to commit later)
- Replies `no`: aborts the transaction
**Phase 2 — Commit or Abort**:
5. If all participants replied `yes`: coordinator writes commit decision to its own durable log (the *commit point*), then sends `commit` to all participants
6. If any participant replied `no`: coordinator sends `abort` to all participants
7. Coordinator retries `commit` or `abort` indefinitely until all participants acknowledge
**The commit point** is the critical moment: writing the commit decision to the coordinator's log is the irrevocable decision. Everything before this point can still be aborted; everything after must be committed.
## Failure Modes
### Participant Failure Before Voting
**What happens**: Coordinator detects timeout; aborts the transaction and sends `abort` to all other participants.
**Safe**: The participant hasn't voted yet, so it can roll back on recovery.
### Participant Failure After Voting "Yes"
**What happens**: Participant is in-doubt. When it recovers, it checks in with the coordinator for the commit/abort decision and applies it.
**Safe**: The coordinator still holds the decision.
### Coordinator Failure Before Sending Prepare
**What happens**: Coordinator crashes before phase 1. Participants have no in-doubt transactions; they can safely abort.
**Safe**: No participant has committed anything yet.
### Coordinator Failure After Prepare, Before Commit Point (Most Dangerous)
**What happens**:
- Participants have voted `yes` — they are in-doubt
- Participants CANNOT unilaterally abort (another participant may have committed)
- Participants CANNOT unilaterally commit (another participant may have aborted)
- Participants hold row-level locks on all modified rows — no other transaction can proceed on those rows
- Participants wait indefinitely until the coordinator recovers
**Recovery**: When the coordinator recovers, it reads its transaction log:
- If the commit record is present: send `commit` to in-doubt participants
- If no commit record: the commit point was not reached; send `abort` to all
**Lock-holding duration**: If the coordinator takes 20 minutes to restart, participant locks are held for 20 minutes. During this time, other transactions accessing those rows are blocked.
### Coordinator Log Lost or Corrupted After Crash
**What happens**: Orphaned in-doubt transactions — the coordinator cannot determine the correct outcome.
**Resolution**: Manual administrator intervention required. Administrator must:
1. Examine the state of each participant (has any participant committed or aborted?)
2. Apply the same outcome to all remaining participants
**Operational risk**: This typically happens during a serious production outage, under time pressure, requiring expert judgment about distributed transaction state.
### XA Heuristic Decisions (Emergency Escape Hatch)
Some XA transaction implementations allow a *heuristic decision* — a participant unilaterally decides to commit or abort an in-doubt transaction without a definitive decision from the coordinator.
This is defined as "probably breaking atomicity." It is intended only for getting out of catastrophic situations (coordinator permanently down, locks held for an unacceptable duration). Regular use violates 2PC's correctness guarantees.
## Complete Failure Catalog
| Failure Timing | Blocked? | Data Safe? | Resolution |
|---|---|---|---|
| Before prepare sent | No | Yes | Coordinator aborts on timeout |
| Participant fails before voting | No | Yes | Coordinator aborts on timeout |
| Participant fails after voting yes | Participant blocked | Yes | Coordinator sends decision on participant recovery |
| Coordinator fails before prepare | No | Yes | Participants can abort |
| Coordinator fails after prepare, before commit | **Yes — in-doubt** | Yes (no commit yet) | Wait for coordinator recovery |
| Coordinator fails after commit point, partial sends | **Partial commit — in-doubt** | Partially committed | Wait for coordinator recovery |
| Coordinator log lost after crash | **Indefinitely blocked** | Unknown | Manual administrator intervention |
## Performance Cost
- **Disk fsyncs**: Coordinator must fsync its log before sending commit (at commit point). Each participant must fsync transaction data before replying yes to prepare. Total: 2+ fsyncs per transaction.
- **Network round-trips**: At minimum 2 round-trips (prepare + commit) vs. 1 for single-node commit.
- **Lock-holding**: Participants hold locks throughout both phases — much longer than single-node transactions.
- **Benchmark**: MySQL distributed transactions reported at 10x slower than single-node transactions.
## 2PC vs. Fault-Tolerant Consensus
| Property | 2PC | Fault-Tolerant Consensus (Raft/Zab) |
|---|---|---|
| Uniform agreement | Yes | Yes |
| Integrity | Yes | Yes |
| Validity | Yes | Yes |
| Termination | **No** (blocks if coordinator crashes) | Yes (majority quorum sufficient) |
| Coordinator | Single point of failure | Leader elected by consensus |
| Requires all nodes? | **Yes** — any participant failure can block | No — majority quorum sufficient |
| Amplifies failures | Yes — one broken participant blocks all | No — minority failures are tolerated |
2PC is a kind of consensus algorithm, but not a very good one — it does not satisfy the termination property.
## Distributed Transaction Types
**Database-internal distributed transactions**: All participants run the same database software (VoltDB, MySQL Cluster NDB, CockroachDB). The coordinator and protocol are optimized for that specific system. Failure modes are more manageable; performance is better.
**Heterogeneous distributed transactions (XA)**: Participants are different technologies (PostgreSQL + Oracle + ActiveMQ). XA is the standard C API for cross-technology 2PC (Java Transaction API/JTA wraps XA for Java). Limitations:
- Cannot detect deadlocks across systems (no shared lock information)
- Does not work with SSI (requires conflict detection across different systems)
- Coordinator log on application server makes the server stateful — breaks horizontal scaling model
- Coordinator is a single point of failure unless explicitly replicated (most XA coordinators are not highly available by default)
## Alternatives to 2PC
### Saga Pattern (Compensating Transactions)
Break a multi-step transaction into a sequence of local transactions, each with a compensating action for rollback.
- Provides eventual atomicity (not strict atomicity)
- Requires idempotent compensating actions
- Application must handle partial state during saga execution
- Best for: long-running business transactions where strict atomicity is not required
### Outbox Pattern
Write the event/message to the same database as the main write, in the same local transaction. A relay process reads the outbox and publishes to the message broker.
- Guarantees at-least-once delivery without cross-system 2PC
- Requires idempotent consumers
- Best for: reliably publishing events to a message broker after a database write
### Total Order Broadcast + Idempotent Consumers
Use a log (Kafka, Kinesis) as the source of truth. All consumers process messages in the same order. Idempotency handles duplicate delivery.
- Best for: systems where all state is derived from a log (event sourcing, CQRS)
- Does not require 2PC across the producer and consumer systems
FILE:references/cap-theorem-analysis.md
# CAP Theorem Analysis
## What CAP Actually Says
CAP (Consistency, Availability, Partition tolerance) is often summarized as "pick 2 of 3." This framing is misleading. The correct interpretation:
**When a network partition occurs, a linearizable (consistent) system must choose between**:
- **Remaining consistent**: refuse requests from nodes that cannot contact the majority — become *unavailable*
- **Remaining available**: process requests from disconnected nodes — violate *linearizability*
A better statement: **either Consistent or Available when Partitioned (CAAP)**.
## Why "Pick 2 of 3" Is Wrong
Network partitions are not a design choice. They happen whether you want them to or not — packets are delayed, lost, or reordered; hardware fails; datacenter links go down. You cannot opt out of partition tolerance. The real choice is what to do *when* a partition occurs.
Additionally:
- CAP's definition of "availability" is idiosyncratic — it means every request receives a response, not that the system is "highly available" in the operational sense. Many "highly available" systems (that use consensus) do not meet CAP's availability definition.
- CAP only considers one consistency model (linearizability) and one fault type (network partitions). It says nothing about node crashes, network delays, or other faults.
- CAP has been superseded by more precise results (PACELC, etc.) for system design purposes.
## Correct Application
For each operation that requires linearizability, document the expected behavior during a network partition:
| Situation | Linearizable System Behavior | Non-Linearizable System Behavior |
|---|---|---|
| Node disconnected from leader | Must wait or return error | Can serve requests (potentially stale) |
| Cross-datacenter link down (single-leader) | Follower datacenter unavailable for writes and linearizable reads | Multi-leader: both datacenters continue, may conflict |
| Minority partition | Minority nodes unavailable (cannot form quorum) | May continue serving (leaderless, eventual) |
## Partition Behavior by Replication Strategy
**Single-leader (linearizable reads from leader)**:
- Clients connected to leader datacenter: unaffected
- Clients connected to follower datacenter: unavailable for writes and linearizable reads during partition
- This is the correct linearizability/availability trade-off
**Multi-leader**:
- Both datacenters continue operating during partition
- Writes accepted independently; conflict resolution required when partition heals
- Behavior is NOT linearizable
**Leaderless (Dynamo-style)**:
- Sloppy quorums or degraded quorums may serve requests during partition
- Data may diverge; read repair resolves after partition heals
- Behavior is NOT linearizable
**Consensus algorithm (Raft/Zab)**:
- Majority partition: continues normally
- Minority partition: unavailable (cannot make progress without quorum)
- Behavior IS linearizable for the majority partition
## The Latency-Consistency Trade-off (Beyond CAP)
Even on a network that is working correctly (no partition), linearizability is slow. The Attiya-Welch theorem proves that in a network with variable delays, the response time of linearizable reads and writes is at least proportional to the uncertainty of those delays.
This means: in a geographically distributed system with high inter-region latency, linearizable operations will be slow even when nothing is failing. Many distributed databases choose weaker consistency models primarily for latency reasons, not partition tolerance.
**Practical implication**: Even within a single datacenter, RAM on a modern multi-core CPU is not linearizable (CPU caches create multiple copies of data, updated asynchronously). Linearizability was sacrificed for performance — memory barriers (fences) re-introduce ordering when needed.
## When to Apply CAP Analysis
Apply CAP analysis to each operation you have marked as requiring linearizability:
1. What happens during a network partition? Which clients lose access?
2. Is partial availability acceptable (clients on majority partition continue; minority partition returns errors)?
3. What is the expected duration of partitions in your network? (Data center link failure: minutes to hours. Within-DC: milliseconds.)
4. Is the cost of unavailability during a partition acceptable, given how rare partitions are?
If unavailability during a partition is not acceptable, reconsider whether linearizability is truly required for that operation, or whether causal consistency with application-level conflict handling is sufficient.
## CAP Does Not Apply to Causal Consistency
Causal consistency is the strongest model that remains available during network failures and does not slow down due to network delays. The CAP theorem does not apply to causally consistent systems because they do not make the recency guarantee that linearizability requires.
This is why, for many operations that appear to need linearizability (but actually only need causal ordering), causal consistency is the correct — and much cheaper — choice.
FILE:references/consensus-equivalence-checklist.md
# Consensus Equivalence Checklist
## What Consensus Means
Consensus means getting several nodes to agree on a single value, such that:
- **Uniform agreement**: No two nodes decide differently
- **Integrity**: No node decides twice
- **Validity**: The decided value was proposed by some node (not invented)
- **Termination**: Every non-crashed node eventually decides
The key challenge is termination under faults. 2PC satisfies the first three properties but fails termination (in-doubt participants block indefinitely when the coordinator crashes). True consensus algorithms (Raft, Zab, Paxos) satisfy all four.
## The 6 Consensus-Equivalent Problems
These problems are all reducible to each other. If you can solve one in a fault-tolerant way, you can transform that solution to solve any of the others. All require a consensus algorithm for a correct fault-tolerant implementation.
### 1. Linearizable Compare-and-Set Registers
**Problem**: Atomically read a value, compare it to an expected value, and set it to a new value only if the comparison succeeds — and have all nodes agree on whether the operation succeeded.
**Why it needs consensus**: The "compare" and "set" must be atomic across replicas. Without consensus, two nodes can each see the old value and both succeed their CAS, producing split state.
**Implementation**: ZooKeeper's `setData` with version check, etcd's transactional `compare-and-swap`.
**Practical use**: Username uniqueness, lock acquisition, leader election tokens, optimistic concurrency control.
### 2. Atomic Transaction Commit
**Problem**: A transaction spans multiple nodes. Either all nodes commit, or all abort — no partial commits.
**Why it needs consensus**: Nodes must agree on the commit/abort decision despite failures. 2PC approximates this but fails the termination property when the coordinator crashes.
**Implementation**: Two-phase commit (approximate, not full consensus), or database-internal consensus (VoltDB, CockroachDB).
**Practical use**: Cross-shard transactions, multi-table atomic updates in partitioned databases.
### 3. Total Order Broadcast
**Problem**: All nodes must receive messages in exactly the same order, with no messages lost.
**Why it needs consensus**: Deciding the position of each message in the total order requires agreement among nodes. A single leader can serialize messages, but choosing that leader — and handling leader failures — requires consensus.
**Implementation**: ZooKeeper (Zab), etcd (Raft), Kafka (KRaft or ZooKeeper-based).
**Practical use**: Database replication log, serializable transactions via stored procedures, event sourcing with global ordering.
### 4. Distributed Locks and Leases
**Problem**: Multiple clients race to acquire a lock. Exactly one must succeed. The lock must be released on client failure (leases with expiry).
**Why it needs consensus**: The decision of "which client holds the lock" must be agreed upon by all nodes. Without consensus, a network partition can cause two clients to each believe they hold the lock (split-brain).
**Implementation**: ZooKeeper ephemeral nodes, etcd leases. Always use fencing tokens to handle slow lock holders.
**Practical use**: Single-active-instance patterns (job schedulers, cron, leader-only tasks).
**Fencing token requirement**: A distributed lock is not sufficient on its own. If the lock holder is paused (garbage collection, slow network) after acquiring the lock, it may act on the lock after a new holder has been elected. Fencing tokens — a monotonically increasing number attached to every request that uses the lock — allow the resource being protected to reject stale requests from the previous lock holder. ZooKeeper provides this via `zxid`.
### 5. Membership and Coordination Service
**Problem**: Nodes must agree on which other nodes are currently alive and part of the cluster. A failed node should be removed from membership; a recovered node should be re-admitted.
**Why it needs consensus**: Without agreement on membership, different nodes have divergent views of the cluster. Operations that depend on "the current member set" (partition assignment, quorum calculation) become inconsistent.
**Implementation**: ZooKeeper (failure detection via session timeouts + ephemeral nodes), etcd (similar model).
**Practical use**: Service discovery, partition assignment, quorum determination, failover coordination.
### 6. Uniqueness Constraints
**Problem**: Multiple concurrent operations attempt to create records with the same unique key (same username, same email, same seat). Exactly one must succeed; the rest must fail with a constraint violation.
**Why it needs consensus**: The decision of "which request won" must be agreed upon by all nodes before any of them responds to the client. Lamport timestamps can define a total order after the fact, but a node receiving a request cannot know in real time whether another node is concurrently processing a conflicting request with a lower timestamp.
**Implementation**: Single-leader database with a unique index, or a linearizable compare-and-set register (which itself requires consensus).
**Practical use**: Username registration, email uniqueness, seat booking, inventory reservation.
## Quick Checklist: Do I Need Consensus?
Answer yes to any of the following → consensus is required:
- [ ] Two concurrent operations could both believe they succeeded but only one should
- [ ] I need to enforce a uniqueness constraint as data is written (not just detect violations after the fact)
- [ ] I need a distributed lock that prevents any form of split-brain
- [ ] I need to elect a single leader and all nodes must agree who it is
- [ ] A transaction spans multiple nodes or databases and must be atomic
- [ ] Messages must be delivered to all nodes in the same order for correctness
- [ ] I need to know which nodes are live members of a cluster with agreement
## How to Use the Equivalence
If you already have ZooKeeper or etcd (which implement total order broadcast = consensus):
- You can build all 6 primitives on top of it — linearizable CAS, atomic commit coordination, distributed locks, membership, uniqueness constraints, and message ordering
- You do not need to implement a separate consensus mechanism for each problem
If you do not have a consensus service:
- Do not build your own consensus algorithm (very high failure rate)
- Deploy ZooKeeper or etcd as infrastructure and use it for all consensus-equivalent operations
- Alternatively, use a database that internally implements consensus (CockroachDB, TiKV, Spanner, VoltDB)
FILE:references/consistency-model-spectrum.md
# Consistency Model Spectrum
Consistency models form a hierarchy from weakest to strongest. Stronger models are easier to reason about but impose higher latency and availability costs.
## The Spectrum (Weakest to Strongest)
```
Eventual Consistency
|
Causal Consistency ← strongest model with no network-delay slowdown
|
Sequential Consistency
|
Linearizability ← strongest common model; recency guarantee
|
Strict Serializability ← linearizability + serializability combined
```
## Definitions
### Eventual Consistency (Convergence)
**Guarantee**: If you stop writing, all replicas will eventually converge to the same value.
**What it does NOT guarantee**: When convergence happens. Until convergence, reads may return anything — including stale values, missing values, or different values from different replicas.
**Ordering**: None. Operations are not ordered relative to each other.
**Examples**: Cassandra (default), Riak, DynamoDB (eventual mode), CouchDB, DNS.
**Best for**: High-availability workloads where staleness is acceptable.
### Causal Consistency
**Guarantee**: Operations that are causally related (cause → effect) are seen in that order by all nodes. Operations with no causal relationship may be seen in any order (they are concurrent).
**Ordering**: Partial order. Causal dependencies define a directed acyclic graph; concurrent operations are incomparable.
**Key property**: Causal consistency is the *strongest* model that does not slow down due to network delays and remains available during network failures. CAP theorem does not apply to causal consistency.
**Subsumes**: Read-your-writes, monotonic reads, consistent prefix reads.
**Implementation**: Version vectors (per-key causal tracking), Lamport timestamps (total order consistent with causality), or single-leader replication with reads from the leader.
**Examples**: Snapshot isolation provides causal consistency. COPS, Eiger, Occult (research systems).
### Sequential Consistency
**Guarantee**: All nodes see operations in the same total order, and that order is consistent with the program order on each individual node.
**What it does NOT guarantee**: That the total order reflects real time. A write may be "seen" after a long delay, but once seen, it appears to have happened at a consistent point relative to other operations.
**Note**: Less common in distributed databases; more common in CPU memory models.
### Linearizability (Atomic Consistency, Strong Consistency, Immediate Consistency, External Consistency)
**Guarantee**: The system behaves as if there is a single copy of the data, and all operations are atomic. Once a write completes, all subsequent reads — from any replica — see that write. This is a *recency guarantee*.
**Ordering**: Total order. There is one global timeline; no operation is concurrent with another in the observable history.
**Key property**: After any one read returns a new value, all following reads (on the same or any other client) must also return the new value.
**Cost**: Response time is proportional to the uncertainty of network delays — linearizability is inherently slow in high-latency networks.
**Examples**: ZooKeeper (writes), etcd (writes + linearizable reads), single-leader replication with reads from leader (potentially), Spanner, CockroachDB.
### Strict Serializability (Strong One-Copy Serializability)
**Guarantee**: Transactions behave as if they executed serially on a single copy of the data, AND that serial order is consistent with real time (linearizable).
**Implementation**: Two-phase locking (2PL) + single-leader replication, or actual serial execution.
**Examples**: FaunaDB, older VoltDB configurations.
**Note**: Serializable snapshot isolation (SSI) is serializable but NOT linearizable — reads come from a snapshot that may not include the most recent writes.
## Ordering Properties Compared
| Model | Total Order? | Recency Guarantee? | Available During Partition? | Network-delay-free? |
|---|---|---|---|---|
| Eventual | No | No | Yes | Yes |
| Causal | Partial | No | Yes | Yes |
| Sequential | Total (program-order-consistent) | No | No | No |
| Linearizable | Total (real-time-consistent) | Yes | No | No |
| Strict Serializable | Total (real-time-consistent, multi-object) | Yes | No | No |
## Replication Method Compatibility
| Replication Method | Linearizable? | Notes |
|---|---|---|
| Single-leader, reads from leader | Potentially | Not every single-leader DB is linearizable; snapshot isolation breaks it |
| Single-leader, reads from follower | No | Followers may lag |
| Consensus algorithm (Raft/Zab) | Yes | This is how ZooKeeper/etcd achieve it |
| Multi-leader | No | Concurrent writes to different leaders; conflicts must be resolved |
| Leaderless (Dynamo-style) | Probably not | Quorums with variable network delays are not linearizable; LWW with clock skew is definitely not |
| Leaderless with synchronous read repair | Possible for reads/writes, not CAS | Cassandra does not do this by default; CAS requires consensus |
FILE:references/linearizability-vs-serializability.md
# Linearizability vs. Serializability
These two terms are among the most commonly confused in distributed systems. They sound similar and both involve ordering, but they are orthogonal guarantees about different things.
## The Core Distinction
| Property | Serializability | Linearizability |
|---|---|---|
| What it applies to | *Transactions* (multi-object operations) | *Individual objects* (single registers/keys) |
| What it guarantees | Transactions behave as if they ran in *some* serial order | Reads always return the most recently written value |
| Recency guarantee? | No — the serial order can differ from real time | Yes — once a write completes, all reads see it |
| Concurrent operations? | Concurrent transactions allowed; conflicts prevented or serialized | No concurrent operations: one global timeline |
| Prevents write skew? | Yes (serializable transactions prevent write skew) | No — write skew requires multi-object transactions |
| Prevents stale reads? | No — serializable reads may come from a snapshot | Yes — no stale reads from any replica |
## Serializability
Serializability is a property of a *transaction execution schedule*. It guarantees that the result of executing a set of concurrent transactions is the same as if they had been executed one at a time, in *some* serial order — but that serial order need not match the wall-clock order in which the transactions actually ran.
**What serializability does NOT prevent**:
- Stale reads from a replica that hasn't received the latest write
- A read seeing data that is "in the past" relative to real time
**What serializability DOES prevent**:
- Write skew (two transactions reading overlapping data and each making a write that is only safe based on what the other hasn't written yet)
- Lost updates, dirty reads (with sufficient isolation level)
**Implementations**:
- Two-phase locking (2PL): serializable, and typically also linearizable for individual objects
- Actual serial execution: serializable, linearizable for individual objects
- Serializable snapshot isolation (SSI): serializable, but **NOT linearizable**
**SSI and non-linearizability**: SSI deliberately reads from a consistent snapshot taken at transaction start. Writes that happen after that snapshot is taken are not visible to the transaction. This means a read within an SSI transaction may not see the latest committed write — this is the definition of non-linearizability.
## Linearizability
Linearizability is a property of a *single-object execution history*. It guarantees that the system behaves as if there is only one copy of the data, and every operation takes effect atomically at a single point in time between its start and completion.
**What linearizability does NOT prevent**:
- Write skew across multiple objects (because it applies per-object, not per-transaction)
- Dirty reads in multi-object transactions (no transaction concept)
**What linearizability DOES prevent**:
- Stale reads: once a write completes, no subsequent read from any client on any replica may return the old value
- Time-travel reads: reads cannot go "backward" in the value sequence
**Implementations**:
- Single-leader with reads from leader: potentially linearizable (depends on implementation)
- Consensus algorithms (Raft, Zab): linearizable writes; linearizable reads require quorum reads or sync()
- Multi-leader, leaderless: NOT linearizable
## Combining Both: Strict Serializability
When a system provides both serializability and linearizability, it is called *strict serializability* or *strong one-copy serializability (strong-1SR)*. This is the strongest practical guarantee:
- Multi-object transactions behave as if executed serially (serializability)
- That serial order is consistent with real time (linearizability)
**Implementations**: 2PL + single-leader replication, actual serial execution on a single leader.
## Decision Guide
Use this to select which guarantee you need:
**Do you need multi-object atomicity (atomic read-modify-write across multiple keys or tables)?**
- Yes → You need serializability (or at minimum, a weaker isolation level that prevents the specific anomaly you care about — see `transaction-isolation-selector`)
- No → Serializability is not needed
**Do you need recency: once a write completes, every subsequent read must see it, from any replica?**
- Yes → You need linearizability
- No → A weaker model (causal or eventual) may suffice
**Combining both needs → strict serializability**. Be aware this has the highest latency and availability cost.
## Practical Examples
| System / Config | Serializable? | Linearizable? |
|---|---|---|
| PostgreSQL with SERIALIZABLE isolation | Yes (SSI) | No (snapshot reads) |
| PostgreSQL with SERIALIZABLE + synchronous replication, reads from leader | Yes | Yes (for single objects) |
| MySQL with 2PL (SERIALIZABLE isolation) | Yes (2PL) | Yes (single objects) |
| Cassandra (default) | No | No |
| ZooKeeper writes | N/A (single-object) | Yes |
| ZooKeeper reads (without sync()) | N/A | No (may be stale) |
| etcd writes | N/A | Yes |
| etcd linearizable reads (default) | N/A | Yes |
| CockroachDB | Yes (SSI variant) | Yes |
| Spanner | Yes | Yes |
FILE:references/total-order-broadcast-primitives.md
# Total Order Broadcast as a Practical Consensus Primitive
## What Total Order Broadcast Provides
Total order broadcast (also called atomic broadcast) is a messaging protocol with two safety properties that must hold even during faults:
1. **Reliable delivery**: No messages are lost — if a message is delivered to one node, it is delivered to all nodes
2. **Totally ordered delivery**: Messages are delivered to every node in exactly the same order
These two properties, together, are equivalent to repeated rounds of consensus — each delivery decision is one consensus round. This is why fault-tolerant consensus algorithms (Raft, Zab, Multi-Paxos) implement total order broadcast directly.
## Relationship to Consensus
Total order broadcast and consensus are equivalent in expressive power:
- **Total order broadcast → consensus**: To reach consensus on a value, broadcast it; the first message in the total order is the consensus decision
- **Consensus → total order broadcast**: Use repeated consensus rounds to decide the next message in the sequence
This equivalence means: if you have ZooKeeper or etcd (which implement total order broadcast), you already have consensus, and all 6 consensus-equivalent problems can be solved using it.
## Relationship to Linearizability
Total order broadcast and linearizability are related but not identical:
- Total order broadcast is **asynchronous**: messages are delivered in a fixed order, but there is no guarantee about *when* a message will be delivered (one node may lag behind)
- Linearizability is a **recency guarantee**: a read is guaranteed to return the most recently written value
**From total order broadcast → linearizable storage**:
You can build linearizable read-write storage on top of total order broadcast. Example — linearizable username uniqueness:
1. Append a message to the log: "I want to claim username X"
2. Wait for the message to be delivered back to you (it's now in the global order)
3. Scan all preceding log entries for any claim on username X
4. If your entry is first: claim succeeds. If another claim appears before yours: abort.
This ensures that because all nodes receive log entries in the same order, they will all agree on who claimed the username first. Writes are linearizable by this procedure.
**Making reads linearizable** (choose one):
- Sequence reads through the log: append a read message, perform the read when the message is delivered back (used by etcd quorum reads)
- Fetch the latest log position in a linearizable way, wait for all entries up to that position, then read (ZooKeeper `sync()`)
- Read from a replica that is synchronously updated on every write (chain replication)
**From linearizable storage → total order broadcast**:
If you have a linearizable register with an atomic increment-and-get operation, you can assign monotonically increasing sequence numbers to messages and implement total order broadcast. This is why linearizable sequence number generators inevitably lead to consensus algorithms when you think about fault tolerance.
## Total Order Broadcast as a Replication Log
Total order broadcast is exactly what is needed for database replication:
- Each message represents a write to the database
- Every replica processes the same writes in the same order
- Replicas remain consistent with each other (aside from temporary replication lag)
This principle is called *state machine replication* — if you start from the same initial state and apply the same sequence of deterministic operations, you arrive at the same final state.
**Kafka** uses this model: producers append messages to a partition log; consumers process messages in partition order. Kafka's ordering guarantee is per-partition total order (not global total order across partitions).
**ZooKeeper** uses Zab to implement total order broadcast for its own replication, making ZooKeeper itself a strongly consistent, fault-tolerant data store.
## Building Linearizable Operations from Total Order Broadcast
### Linearizable Compare-and-Set (CAS)
1. Append a CAS proposal to the log: `CAS(key, expected_value, new_value)`
2. Wait for delivery back to yourself
3. Apply all preceding log entries
4. Check if the key's current value matches `expected_value`
5. If yes: apply the CAS (commit it with another log entry or inline in step 4)
6. If no: reject the CAS
All nodes apply the same log in the same order, so all agree on the CAS outcome.
### Atomic Transaction Coordination
Total order broadcast can coordinate the commit/abort decision that 2PC's coordinator holds:
- Broadcast the "prepare" outcome (yes/no votes from all participants) via total order broadcast
- All nodes see the same votes in the same order; all can independently determine the commit/abort outcome
- Eliminates the coordinator as a single point of failure
### Serializable Multi-Object Transactions
If every message represents a deterministic stored procedure, and every node processes those procedures in the same order, the result is serializable multi-object transactions — this is the model used by VoltDB and H-Store (actual serial execution).
## Fault-Tolerant Consensus Algorithm Summary
| Algorithm | Key Systems | Notes |
|---|---|---|
| Raft | etcd, CockroachDB, TiKV, Consul | Well-documented; leader-based; strong safety |
| Zab | ZooKeeper | Total order broadcast natively; basis of Hadoop/Kafka/HBase coordination |
| Multi-Paxos | Google Chubby, Spanner, some Cassandra configs | Highly proven; complex to implement correctly; "nobody really understands Paxos" |
| Viewstamped Replication (VSR) | Research, some production | Theoretically important; similar structure to Raft |
**Epoch numbers and quorums**: All these algorithms use some form of epoch (term, ballot, view) numbering. Each epoch has a unique leader. If a conflict arises between leaders from different epochs, the higher epoch wins. Leaders must collect a quorum of votes before making a decision, and the quorum for leader election must overlap with the quorum for proposals — ensuring that any new leader knows about all committed decisions from previous epochs.
**Key difference from 2PC**: Fault-tolerant consensus requires votes from a *majority* of nodes. 2PC requires a `yes` from *every* participant. This means:
- A single slow or failed participant can block 2PC indefinitely
- A minority of failures does not block Raft/Zab/Paxos
## When to Use ZooKeeper / etcd vs. Database-Internal Consensus
**Use ZooKeeper or etcd** for:
- Distributed locks and leases (with fencing tokens)
- Leader election for application components
- Service discovery and membership tracking
- Configuration that must be consistent across all nodes
- Partition assignment coordination (Kafka, HBase, YARN)
**Use database-internal consensus** (CockroachDB, Spanner, TiKV) for:
- Linearizable data storage at scale
- Cross-shard transactions with linearizability
- When you want the database to handle consensus transparently
**Do not use** ZooKeeper/etcd for:
- High-throughput application data storage — it is designed for small, slow-changing coordination data
- Per-request state — the overhead of consensus is not worth it for data that can be eventually consistent
Scan application code, SQL queries, or ORM code for exposure to the 6 database concurrency anomalies and produce a findings report with severity, affected lo...
---
name: concurrency-anomaly-detector
description: |
Scan application code, SQL queries, or ORM code for exposure to the 6 database concurrency anomalies and produce a findings report with severity, affected locations, and fix recommendations. Use when: debugging a nondeterministic data corruption or race condition bug under concurrent load; auditing transaction code before deployment or after switching databases (isolation defaults differ across engines); a read-modify-write cycle or check-then-act pattern may be exposed to lost updates or write skew; an aggregate query (COUNT, SUM) guards an INSERT or UPDATE (phantom read exposure); or multiple tables are updated in one transaction without serializable isolation. Distinct from transaction-isolation-selector (which chooses the isolation level) — this skill scans code to find which anomalies existing code is already exposed to. Covers Python, Java, Go, JavaScript, Ruby; raw SQL; ORM code (SQLAlchemy, Hibernate, ActiveRecord, GORM); PostgreSQL, MySQL InnoDB, Oracle, SQL Server, and distributed databases. Maps code patterns (read-modify-write, SELECT/INSERT pairs, cross-table boundaries, snapshot boundary reads) to anomaly type, trigger conditions, and minimum fix (isolation upgrade vs. application-level mitigation).
version: 1.0.0
homepage: https://github.com/bookforge-ai/bookforge-skills/tree/main/books/designing-data-intensive-applications/skills/concurrency-anomaly-detector
metadata: {"openclaw":{"emoji":"📚","homepage":"https://github.com/bookforge-ai/bookforge-skills"}}
status: draft
depends-on:
- transaction-isolation-selector
source-books:
- id: designing-data-intensive-applications
title: "Designing Data-Intensive Applications"
authors: ["Martin Kleppmann"]
chapters: [7]
tags: [transactions, concurrency, race-conditions, dirty-read, dirty-write, read-skew, lost-update, write-skew, phantom-read, isolation-levels, snapshot-isolation, serializable, mvcc, postgresql, mysql, oracle, sql-server, code-review, audit]
execution:
tier: 2
mode: full
inputs:
- type: codebase
description: "Application source code containing transaction logic, SQL queries, or ORM calls — the primary input"
- type: document
description: "Transaction description or architecture summary if no codebase is available"
tools-required: [Read, Grep, Write]
tools-optional: [Bash]
mcps-required: []
environment: "Run inside a project directory. Grepping for SQL keywords and transaction patterns is the primary analysis method."
discovery:
goal: "Identify all code locations exposed to at least one of the 6 concurrency anomalies; classify each finding by anomaly type and severity; produce actionable fix recommendations"
tasks:
- "Determine the database in use and its default isolation level"
- "Grep for transaction boundaries and SQL patterns that indicate anomaly exposure"
- "Classify each finding into one of the 6 anomaly types"
- "Assign severity based on the anomaly type and its business impact"
- "Produce fix recommendations — isolation upgrade or application-level mitigation — per finding"
audience:
roles: ["backend-engineer", "software-architect", "data-engineer", "tech-lead", "site-reliability-engineer"]
experience: "intermediate-to-advanced — assumes familiarity with relational databases and SQL transactions"
triggers:
- "A data corruption or inconsistency bug is suspected but hard to reproduce"
- "Code review for a new service or feature that uses database transactions"
- "Migration to a database with a different default isolation level"
- "Audit of existing codebase for concurrency safety"
- "Post-incident analysis of a race condition"
not_for:
- "Choosing an isolation level from scratch without existing code — use transaction-isolation-selector instead"
- "Distributed transaction coordination across multiple databases — use distributed-failure-analyzer"
- "Replication-level consistency issues — use replication-strategy-selector"
---
# Concurrency Anomaly Detector
## When to Use
You have existing application code that interacts with a database and you need to know whether it is safe under concurrent execution.
This skill applies when:
- A bug only manifests under concurrent load and is hard to reproduce in tests
- Code is being reviewed before deployment to a high-concurrency environment
- The application recently migrated to a database with a different default isolation level
- The codebase accesses multiple tables in a single transaction
- Any code follows the pattern: read a value, make a decision, write a result
**The core insight from Kleppmann:** Concurrency bugs caused by weak transaction isolation are not just theoretical. They cause real financial losses and data corruption. They are triggered only by unfortunate timing, making them nearly impossible to catch by testing. The only reliable approach is to analyze the code structure — not the test results — and identify which patterns are structurally vulnerable.
**Companion skill:** `transaction-isolation-selector` — once anomalies are identified, that skill selects the minimum safe isolation level. This skill identifies what anomalies exist in current code; the companion skill recommends what to do about them.
---
## Context and Input Gathering
### Required Context (must have — ask if missing)
- **Database in use and version.** Why: the same isolation level name has different behaviors across databases. MySQL's "repeatable read" does not automatically detect lost updates; PostgreSQL's does. Oracle's "serializable" is actually snapshot isolation. Without knowing the database, severity assessments cannot be calibrated to actual risk.
- Check: `docker-compose.yml`, `requirements.txt` / `pom.xml` / `go.mod` / `package.json` for database drivers, schema file syntax
- If missing, ask: "What database are you using, and what version?"
- **Current isolation level.** Why: the same code pattern has different severity depending on the isolation level in effect. A lost update pattern is high severity at read committed but handled automatically at PostgreSQL's repeatable read. If the isolation level is unknown, assume the database default (usually read committed).
- Check: ORM configuration, database session setup, application config files, environment variables
- If missing, assume the database's default and note the assumption explicitly
- **Application code or transaction descriptions.** Why: this is the primary input. The scan requires reading transaction logic to identify patterns.
- Gather: entry points for significant transactions, files containing SQL or ORM calls, service layer code
### Observable Context (gather from environment)
Before asking anything, scan the environment:
```
Grep targets:
- Transaction boundaries: BEGIN, START TRANSACTION, @Transactional, with_transaction,
session.begin(), db.transaction(), conn.cursor()
- Read-modify-write: SELECT ... followed by UPDATE in same function scope
- Check-then-act: SELECT COUNT(*), SELECT SUM(), SELECT EXISTS() followed by INSERT/UPDATE
- Explicit locking: FOR UPDATE, FOR SHARE, LOCK TABLE, SELECT ... WITH (UPDLOCK)
- Isolation settings: ISOLATION LEVEL, transaction_isolation, SET SESSION
- ORM patterns: find_by, where().first(), session.query(), Model.where()
followed by .save(), .update(), .create() in the same scope
```
### Default Assumptions
When context cannot be observed and asking would be excessive:
- Isolation level unknown → assume read committed (PostgreSQL/Oracle/SQL Server default)
- Transaction boundaries unknown → look for natural HTTP request/response boundaries
- Concurrency level unknown → assume multiple concurrent users access the same data (conservative)
---
## Process
### Step 1: Identify the Database and Its Actual Isolation Guarantees
**ACTION:** Determine the database, its default isolation level, and any overrides in the codebase.
**WHY:** Every subsequent severity assessment depends on what the database actually prevents at its current isolation level. "We use PostgreSQL so we're fine" is a common and dangerous assumption. PostgreSQL's default is read committed — it does not prevent read skew, lost updates, write skew, or phantom reads. Establishing the ground truth of what the database currently prevents is the prerequisite for everything else.
Record:
```
Database: [PostgreSQL | MySQL InnoDB | Oracle | SQL Server | other]
Default isolation level: [read committed | repeatable read | serializable]
Configured isolation: [from code scan or config — override if found]
Effective isolation: [the level actually in use]
```
**Isolation level defaults (critical to get right):**
| Database | Default | What it prevents | What it allows |
|----------|---------|-----------------|----------------|
| PostgreSQL | Read committed | Dirty reads, dirty writes | Read skew, lost updates, write skew, phantoms |
| MySQL InnoDB | Repeatable read | Dirty reads, dirty writes, read skew | Lost updates (silently!), write skew, phantoms |
| Oracle 11g | Read committed | Dirty reads, dirty writes | Everything else. Oracle "SERIALIZABLE" = snapshot isolation — write skew still possible |
| SQL Server | Read committed | Dirty reads, dirty writes | Read skew, lost updates, write skew, phantoms |
---
### Step 2: Grep for Anomaly-Indicating Code Patterns
**ACTION:** Search the codebase systematically for the structural patterns that expose each of the 6 anomaly types.
**WHY:** Concurrency anomalies cannot be found by running the code — the bugs only manifest when multiple transactions interleave at precisely the wrong time. What can be found reliably is the code structure that makes the anomaly possible. Each anomaly type has a distinct structural fingerprint. Grepping for these patterns is more reliable than reading every file manually and is the same approach a static analysis tool would take.
**Pattern catalog — what to grep for and what it indicates:**
**Dirty reads and dirty writes — Pattern: missing transaction boundary or read uncommitted**
```
Signal: Explicit SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED, or autocommit enabled
across multi-step write operations with no BEGIN/COMMIT wrapper.
Grep: READ UNCOMMITTED, autocommit=1 (MySQL) with multi-step writes
Note: Read committed and above prevent dirty reads automatically at all practical
databases. Flag only if read uncommitted is explicitly configured.
```
**Read skew — Pattern: multi-read transaction without snapshot isolation**
```
Signal: Multiple SELECTs in the same transaction over related tables where combined
results must be consistent (balance totals, integrity checks, backup scans).
Grep: Multiple SELECT statements in the same transaction scope accessing related
tables (e.g., accounts + transfers, orders + order_items)
Risk: At read committed, each SELECT reads a fresh snapshot. Concurrent commits
between reads produce internally inconsistent results.
```
**Lost updates — Pattern: read-modify-write cycle**
```
Signal: A transaction reads a value, computes a new value in application code,
then writes the new value back.
Grep: SELECT followed by UPDATE in the same transaction scope where the UPDATE
value depends on the SELECT result (look for the variable being used in both)
ORM: find() / find_by() followed by .update() or .save()
Python: result = session.query(...).one() ... result.field += delta ... session.commit()
Java: Entity e = em.find(...) ... e.setCount(e.getCount() + 1) ... em.merge(e)
Risk: Two concurrent read-modify-write cycles both read the old value, both compute
an update, and the second write overwrites the first without incorporating it.
At read committed: always vulnerable.
At PostgreSQL/Oracle snapshot isolation: automatically detected and aborted.
At MySQL InnoDB repeatable read: NOT detected — second write silently overwrites first.
Severity: HIGH at read committed or MySQL repeatable read. MEDIUM at PostgreSQL/Oracle
snapshot isolation (aborts, requires retry). NOT PRESENT at serializable.
```
**Write skew — Pattern: check-then-act (the most commonly missed anomaly)**
```
Signal: A transaction reads an aggregate or existence condition, makes a decision
based on the result, then writes to the database. The write changes the
state that the condition was checking.
Grep: SELECT COUNT(*) / SELECT SUM() / SELECT EXISTS() / SELECT MAX() followed by
INSERT, UPDATE, or DELETE in the same transaction scope.
Also: any SELECT that is used as a guard condition ("if the query returns X,
proceed with the write")
ORM: Model.where(...).count > 0 followed by Model.create() or model.update()
Model.where(...).exists? followed by record.save()
Risk: Two concurrent transactions both pass the guard check because each reads
from its own snapshot. Both write. The combined state violates the invariant
the guard was enforcing.
Critical: Write skew is NOT prevented by snapshot isolation. Oracle's "serializable"
is snapshot isolation. If the database is at any level below true serializable,
ALL check-then-act patterns are vulnerable.
Severity: CRITICAL at read committed, snapshot isolation, or Oracle "serializable".
NOT PRESENT at true serializable (PostgreSQL SERIALIZABLE, MySQL SERIALIZABLE).
```
**Phantom reads — Pattern: check-for-absence then insert**
```
Signal: SELECT COUNT(*) = 0 or empty-result check, followed by INSERT into the
same table. Booking conflict checks, username existence checks.
Grep: COUNT.*= 0 / .count() == 0 followed by INSERT or .create() in same scope
Risk: Both transactions see zero matching rows. Both insert. Constraint violated.
FOR UPDATE does not help — no rows exist to lock when SELECT returns empty.
Severity: CRITICAL at any level below serializable (including snapshot isolation).
Note: Phantom variant of write skew. Fix: serializable isolation or materializing
conflicts. UNIQUE constraint sufficient only for single-column uniqueness.
```
---
### Step 3: Classify Each Finding
**ACTION:** For each code location identified in Step 2, produce a structured finding entry.
**WHY:** Unclassified findings — "there might be a race condition somewhere" — do not produce action. A structured finding with anomaly type, trigger conditions, severity, and a concrete fix recommendation is actionable. The classification also determines severity precisely: the same read-modify-write pattern is critical at read committed but harmless at serializable.
**Finding structure:**
```
Finding #N
File: [path/to/file.py, line N]
Anomaly type: [dirty read | dirty write | read skew | lost update | write skew | phantom]
Code pattern: [brief description of what the code does]
Trigger: [concurrency condition required to produce the anomaly]
Severity: [CRITICAL | HIGH | MEDIUM | LOW] (see severity table below)
Affected data: [which tables/entities are involved]
Fix: [isolation upgrade | SELECT FOR UPDATE | atomic operation | unique constraint | materializing conflicts]
Fix detail: [specific change to make]
```
**Severity classification:**
| Anomaly | Severity | Rationale |
|---------|----------|-----------|
| Dirty read | HIGH | Reads data that may never have existed (rolled-back write). Direct data integrity violation. |
| Dirty write | HIGH | Mixes writes from concurrent transactions into a single object. Produces corrupted state. |
| Read skew | MEDIUM | Transaction sees internally inconsistent state. Dangerous for analytics, backups, multi-step reads. |
| Lost update | HIGH | Silent data loss — one write disappears without error. Counter increments, balance updates. |
| Write skew | CRITICAL | Invariant violation that no weaker isolation level prevents. Often produces constraint violations in business logic (zero doctors on call, double-booked rooms, negative balances). |
| Phantom (write skew variant) | CRITICAL | Same as write skew but additionally cannot be mitigated by SELECT FOR UPDATE — only serializable isolation or materializing conflicts works. |
**Downgrade severity if:**
- The anomaly is prevented by the effective isolation level (e.g., lost update at PostgreSQL repeatable read → NOT PRESENT)
- The code path is read-only and cannot cause the write side of the pattern
- The affected data has a compensating unique constraint in the schema
---
### Step 4: Produce the Findings Report
**ACTION:** Write a structured anomaly findings report with all classified findings, a summary table, and prioritized fix recommendations.
**WHY:** The findings report is the deliverable. It must be readable by an engineer without re-reading this skill, actionable as a review ticket or backlog item, and precise enough to be used as evidence in an incident post-mortem. The summary table gives a quick severity overview; the detailed findings give the information needed to implement a fix.
**Output format:**
```markdown
# Concurrency Anomaly Scan — [Project Name]
## Scan Context
Database: [database + version]
Effective isolation: [actual isolation level in use]
Files scanned: [count or list]
Findings: [N total — X CRITICAL, Y HIGH, Z MEDIUM, W LOW]
---
## Summary Table
| # | File | Anomaly Type | Severity | Fix Type |
|---|------|-------------|----------|----------|
| 1 | payments/transfer.py:47 | Write skew | CRITICAL | Upgrade to serializable |
| 2 | scheduling/shift.py:112 | Write skew | CRITICAL | SELECT FOR UPDATE or serializable |
| 3 | accounts/balance.py:33 | Lost update | HIGH | Atomic operation or serializable |
...
---
## Findings
### Finding #1 — [Anomaly Type] — [Severity]
**File:** [path:line]
**Pattern:** [what the code does]
**Trigger:** [concurrency scenario that produces the anomaly]
**Affected data:** [tables/entities]
**Code excerpt:**
[relevant code snippet]
**Fix:** [fix type]
[specific change description]
---
[repeat for each finding]
---
## Recommendations
### Immediate (CRITICAL findings)
[List of changes required before the next production deployment]
### Short-term (HIGH findings)
[List of changes to address in the current sprint]
### For review (MEDIUM findings)
[List of changes to assess — may depend on workload characteristics]
### Related skills
- `transaction-isolation-selector` — select the minimum safe isolation level for this codebase
- `replication-failure-analyzer` — if findings include distributed transaction concerns
```
---
## The 6 Anomalies — Quick Reference
| Anomaly | What happens | Code signal | Minimum fix |
|---------|-------------|-------------|-------------|
| **Dirty read** | Transaction reads uncommitted data that later rolls back — decision based on data that never existed | Read uncommitted isolation explicitly set | Enable read committed (database default) |
| **Dirty write** | Two uncommitted writes to the same object mix results — listing shows one buyer, invoice shows another | Multi-step writes with no transaction boundary; autocommit enabled | Wrap in transaction (read committed prevents) |
| **Read skew** | Long-running read sees two tables at different points in time — Alice's $1000 appears as $900 mid-transfer | Multiple SELECTs over related tables in one transaction | Upgrade to snapshot isolation (REPEATABLE READ) |
| **Lost update** | Read-modify-write cycle: both transactions read 42, both write 43, result is 43 not 44 — one update silently lost | SELECT followed by UPDATE where new value computed in app code | Atomic SQL (`value = value + 1`) or SELECT FOR UPDATE |
| **Write skew** | Check-then-act: two transactions both pass a guard check (count >= 2), both write to different rows, combined result violates the invariant (count = 0). The 5 forms: at-least-one, no-overlap, unique claim, budget, game state | SELECT COUNT/SUM/EXISTS followed by INSERT/UPDATE/DELETE in same transaction | SELECT FOR UPDATE (rows exist) or serializable isolation |
| **Phantom (write skew)** | Check-for-absence then insert: both see zero conflicts, both insert — double-booking, duplicate username | COUNT = 0 check followed by INSERT matching same condition | Serializable isolation; UNIQUE constraint for single-column uniqueness; materializing conflicts as last resort |
Full detail with SQL and ORM patterns per language is in `references/anomaly-detection-patterns.md`.
---
## What Can Go Wrong
**The most dangerous gap:** Write skew is the anomaly teams most commonly miss. It looks like safe code — each transaction individually is correct. The invariant violation only appears in the combined outcome of two concurrent transactions. At any isolation level below true serializable (including PostgreSQL's snapshot isolation, Oracle's "serializable"), write skew is possible.
**The naming trap:** Oracle's `SERIALIZABLE` is snapshot isolation. Teams that set Oracle isolation to SERIALIZABLE and assume they have full protection against write skew do not. MySQL's `REPEATABLE READ` does not detect lost updates. These naming mismatches have caused real production data corruption.
**Testing cannot catch these:** Concurrency anomalies only manifest when two transactions interleave at precisely the wrong time. This is nondeterministic and depends on load. Unit tests run single-threaded. Load tests may not hit the exact timing window. The only reliable method is structural code analysis — which is what this skill does.
---
## Examples
### Example 1: E-Commerce Inventory Deduction (Lost Update)
**Scenario:** A Python Flask service handles concurrent purchase requests. When an order is placed, the code reads inventory, checks if stock > 0, deducts the quantity, and saves.
**Trigger:** "We occasionally oversell products — orders go through for items that are actually out of stock. It only happens during flash sales."
**Process:**
Step 1: Database is PostgreSQL, default isolation is read committed. No isolation override found in the codebase.
Step 2: Grep finds this pattern in `orders/inventory.py`:
```python
# orders/inventory.py:34
with db.session.begin():
item = db.session.query(InventoryItem).filter_by(sku=sku).one()
if item.quantity < requested_qty:
raise InsufficientStock()
item.quantity -= requested_qty
db.session.commit()
```
This is a read-modify-write cycle: read `item.quantity`, compute `item.quantity - requested_qty` in application code, write back.
Step 3 classification:
```
Finding #1
File: orders/inventory.py:34
Anomaly type: Lost update
Code pattern: Read quantity → check > 0 → deduct in application code → save
Trigger: Two concurrent purchase requests for the same SKU both read quantity=5,
both check 5 >= 2 (requested), both compute 5-2=3, both write 3.
Result: quantity=3 instead of quantity=1. One purchase's deduction is lost.
Severity: HIGH
Affected data: InventoryItem.quantity
Fix: Atomic SQL operation (no read-modify-write cycle needed)
Fix detail: Replace with:
UPDATE inventory_items
SET quantity = quantity - :requested_qty
WHERE sku = :sku AND quantity >= :requested_qty
Check rows_affected == 1 to detect insufficient stock.
This removes the application-layer read-modify-write cycle entirely.
```
**Output excerpt:**
```markdown
# Concurrency Anomaly Scan — E-Commerce Service
Database: PostgreSQL 14
Effective isolation: Read committed (default, no override found)
Files scanned: 47
Findings: 1 total — 0 CRITICAL, 1 HIGH, 0 MEDIUM, 0 LOW
## Summary Table
| # | File | Anomaly Type | Severity | Fix Type |
|---|------|-------------|----------|----------|
| 1 | orders/inventory.py:34 | Lost update | HIGH | Atomic SQL operation |
## Finding #1 — Lost Update — HIGH
...
Fix: Replace Python read-modify-write with atomic SQL UPDATE with WHERE guard.
The atomic UPDATE eliminates the window for concurrent reads of the same value.
No isolation level change required for this fix.
```
---
### Example 2: Scheduling Service (Write Skew — Classic Pattern)
**Scenario:** A Java Spring Boot service manages staff shift assignments. Staff can voluntarily release a shift if at least two others are assigned to cover it.
**Trigger:** "A shift ended up with zero assigned staff after two people simultaneously clicked 'release' on the app. Both got a confirmation that their release was accepted."
**Process:**
Step 1: Database is MySQL InnoDB, isolation is repeatable read (default). No override configured.
Step 2: Grep finds this pattern in `ShiftService.java`:
```java
// ShiftService.java:87
@Transactional
public void releaseShift(long staffId, long shiftId) {
long coverCount = shiftRepository.countAssigned(shiftId); // SELECT COUNT(*)
if (coverCount < 2) {
throw new InsufficientCoverageException();
}
staffShiftRepository.markReleased(staffId, shiftId); // UPDATE staff_shifts
}
```
Step 3 classification:
```
Finding #1
File: ShiftService.java:87
Anomaly type: Write skew
Code pattern: COUNT assigned staff → if >= 2 → mark self as released
Trigger: Alice and Bob are both assigned, both click Release simultaneously.
Both transactions read COUNT = 2 (their own snapshot). Both pass the
check. Alice's UPDATE sets her record to released. Bob's UPDATE sets
his record to released. Result: COUNT = 0. Zero staff on shift.
Severity: CRITICAL
Affected data: staff_shifts table
Fix: SELECT FOR UPDATE on the precondition query, or upgrade to SERIALIZABLE
Fix detail:
Option A — SELECT FOR UPDATE:
Replace countAssigned() with a locking query:
SELECT COUNT(*) FROM staff_shifts
WHERE shift_id = ? AND status = 'assigned'
FOR UPDATE
This locks all assigned rows for the shift. Bob's transaction must wait
for Alice's to commit. After Alice commits (count becomes 1), Bob reads
count = 1 and throws InsufficientCoverageException. Correct behavior.
Option B — Serializable isolation:
Add @Transactional(isolation = Isolation.SERIALIZABLE) to the method.
MySQL SERIALIZABLE uses two-phase locking — true serializable.
Note: adds lock contention overhead on the staff_shifts table.
```
---
### Example 3: Multi-Tenant SaaS Subscription Billing (Read Skew + Write Skew)
**Scenario:** A Node.js billing service runs a monthly billing job. The job reads each account's subscription plan and usage, computes the invoice total, and inserts an invoice record. The job runs while normal usage continues.
**Trigger:** "Our monthly billing report shows totals that don't match usage records. Some invoices have usage figures that don't correspond to any committed state we can find."
**Process:**
Step 1: Database is PostgreSQL, isolation is read committed (default). The billing job uses a long-running transaction.
Step 2: Grep finds two issues:
Issue A — `billing/job.js:23`: The billing job reads usage in a loop over accounts. Each loop iteration issues a new SELECT. Between iterations, usage records for already-billed accounts may be updated by concurrent write transactions. The job reads accounts at different points in time.
Issue B — `billing/proration.js:67`: A proration credit calculation reads the current plan price and the days remaining, computes a credit, then inserts a credit record. A concurrent plan upgrade transaction can commit between the two reads, leaving the proration based on the old plan price while the new plan is active.
Step 3 classification:
```
Finding #1
File: billing/job.js:23
Anomaly type: Read skew
Code pattern: Long-running job reads multiple tables across loop iterations with
separate SELECT queries per account — no consistent snapshot
Trigger: Concurrent write transactions commit usage updates between account reads.
Job sees Account A's usage before the update and Account B's usage after.
Severity: MEDIUM
Affected data: usage_records, subscriptions, accounts tables
Fix: Set transaction isolation to REPEATABLE READ or SERIALIZABLE for the
billing job transaction. PostgreSQL REPEATABLE READ = snapshot isolation.
All reads within the transaction see the database at the transaction's
start time. Concurrent writes are invisible until the transaction commits.
Finding #2
File: billing/proration.js:67
Anomaly type: Write skew
Code pattern: Read current plan price → read days remaining → compute credit →
insert credit record. Write changes the precondition (plan price).
Trigger: Plan upgrade commits between the two reads. Credit is computed at
old plan price but inserted while new plan is active.
Severity: CRITICAL
Fix: SELECT plan_price FROM subscriptions WHERE id = ? FOR UPDATE before
reading days remaining. This locks the subscription row for the
duration of the transaction. A concurrent plan upgrade must wait.
```
---
## References
| File | Contents | When to read |
|------|----------|--------------|
| `references/anomaly-detection-patterns.md` | SQL and ORM grep patterns for all 6 anomaly types; per-language examples (Python/SQLAlchemy, Java/Hibernate, Go/sqlx, Ruby/ActiveRecord, Node.js/Sequelize); false positive filters | Step 2 — systematic grep sweep |
| `references/severity-and-fix-matrix.md` | Full severity table per anomaly type per isolation level; fix decision tree (isolation upgrade vs FOR UPDATE vs atomic op vs unique constraint vs materializing conflicts); fix applicability conditions | Step 3 — classifying each finding and selecting a fix |
**Cross-reference:** `transaction-isolation-selector` — use after this skill produces its findings report. That skill takes the anomaly exposure as input and produces the minimum safe isolation level recommendation with database-specific configuration.
## License
This skill is licensed under [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Source: [BookForge](https://github.com/bookforge-ai/bookforge-skills) — Designing Data-Intensive Applications by Martin Kleppmann.
## Related BookForge Skills
Install related skills from ClawhHub:
- `clawhub install bookforge-transaction-isolation-selector`
Or install the full book set from GitHub: [bookforge-skills](https://github.com/bookforge-ai/bookforge-skills)
FILE:references/anomaly-detection-patterns.md
# Anomaly Detection Patterns
Reference for Step 2 of the concurrency-anomaly-detector process. SQL and ORM grep patterns for all 6 concurrency anomaly types, with per-language examples and false positive filters.
---
## How to Use This Reference
Run the grep patterns below in the codebase to find candidate code locations. Each pattern produces a list of files and line numbers. For each hit, apply the false positive filter to determine whether it is an actual exposure. Then classify into a finding using Step 3 of the skill.
---
## Pattern 1: Lost Update Detection
### What to find
A transaction that reads a value, computes a new value in application code, and writes it back. The key marker is application-layer arithmetic between read and write — the new value depends on the old value.
### SQL grep patterns
```
SELECT.*FROM.*\n.*UPDATE
UPDATE.*SET.*=.*\+
UPDATE.*SET.*=.*-
SELECT.*FOR UPDATE.*\n.*UPDATE ← explicit lock present; lower risk but verify
```
### ORM grep patterns (Python/SQLAlchemy)
```python
# High risk: find then modify in loop or immediate scope
session.query(Model).filter(...).one()
session.query(Model).filter(...).first()
# Followed by: obj.field += or obj.field -= or obj.field = obj.field +
```
### ORM grep patterns (Java/Hibernate)
```java
em.find(Entity.class, id)
repository.findById(id)
// Followed by: entity.setField(entity.getField() + delta)
// or entity.setCount(entity.getCount() + 1)
```
### ORM grep patterns (Ruby/ActiveRecord)
```ruby
record = Model.find(id)
record = Model.find_by(...)
# Followed by: record.field += delta, record.save
# Also: Model.increment(:column, ...) ← atomic — NOT a lost update
```
### ORM grep patterns (Go/sqlx or GORM)
```go
db.First(&entity, id)
db.Where(...).First(&entity)
// Followed by: entity.Field = entity.Field + delta
// db.Save(&entity)
```
### ORM grep patterns (Node.js/Sequelize)
```javascript
await Model.findOne({ where: ... })
await Model.findByPk(id)
// Followed by: instance.field += delta, instance.save()
```
### False positives (NOT lost updates)
```sql
-- Pure SQL atomic expression: no application-layer read needed
UPDATE counters SET value = value + 1 WHERE key = 'x';
UPDATE accounts SET balance = balance - :amount WHERE id = :id AND balance >= :amount;
```
```python
# SQLAlchemy atomic update: no Python-layer arithmetic
session.execute(update(Counter).where(...).values(value=Counter.value + 1))
```
```ruby
# ActiveRecord atomic: no Ruby-layer arithmetic
Account.where(id: id).update_all("balance = balance - #{amount}")
Model.increment_counter(:count, id)
```
These are safe — the arithmetic happens inside the database atomically, with no window for a concurrent read to see the old value.
---
## Pattern 2: Write Skew Detection
### What to find
A transaction that reads an aggregate or existence condition (COUNT, SUM, EXISTS, MAX, MIN), makes an application-level decision based on the result, then performs a write that changes the state of the condition.
This is the single most commonly missed pattern. It looks like safe code — each transaction is individually correct. The race condition only appears when two transactions run simultaneously.
### SQL grep patterns
```sql
SELECT COUNT(*) -- check count, then write based on result
SELECT SUM( -- check aggregate, then insert/update
SELECT EXISTS( -- check existence, then write
SELECT MAX( -- check bound, then write
```
All followed (in the same transaction scope) by INSERT, UPDATE, or DELETE.
### ORM grep patterns
```python
# Python/SQLAlchemy
session.query(Model).filter(...).count() # check count
session.query(func.sum(Model.field)).scalar() # check sum
session.query(Model).filter(...).first() # check existence
# All followed by session.add(...) or session.execute(update(...))
```
```java
// Java/Hibernate
repository.countBy...()
repository.existsBy...()
entityManager.createQuery("SELECT SUM(...)").getSingleResult()
// Followed by: entityManager.persist(new Entity(...)) or repository.save(new Entity(...))
```
```ruby
# Ruby/ActiveRecord
Model.where(...).count
Model.where(...).exists?
Model.where(...).sum(:field)
# Followed by: Model.create(...) or record.update(...)
```
```javascript
// Node.js/Sequelize
await Model.count({ where: ... })
await Model.sum('field', { where: ... })
await Model.findOne({ where: ... }) // guard for uniqueness
// Followed by: await Model.create(...)
```
### The 5 write skew patterns — specific code signatures
**At-least-one constraint (doctor on-call):**
```sql
SELECT COUNT(*) FROM resources WHERE status = 'active' AND group_id = $1;
-- app: if count >= 2, proceed
UPDATE resources SET status = 'inactive' WHERE id = $2;
```
Signal: COUNT check with threshold, followed by UPDATE on a different row than was counted.
**No-overlap constraint (booking system):**
```sql
SELECT COUNT(*) FROM bookings
WHERE resource_id = $1 AND end_time > $2 AND start_time < $3;
-- app: if count = 0, proceed
INSERT INTO bookings (resource_id, start_time, end_time, user_id)
VALUES ($1, $2, $3, $4);
```
Signal: Time-range overlap check followed by INSERT into same table.
**Unique claim (username registration):**
```sql
SELECT COUNT(*) FROM users WHERE username = $1;
-- app: if count = 0, proceed
INSERT INTO users (username, ...) VALUES ($1, ...);
```
Signal: Existence check followed by INSERT with the checked value.
Note: A UNIQUE constraint on `username` is sufficient here — no serializable isolation needed.
**Budget / sum constraint (spending limit):**
```sql
SELECT SUM(amount) FROM spending WHERE account_id = $1;
-- app: if sum + new_amount <= limit, proceed
INSERT INTO spending (account_id, amount, ...) VALUES ($1, $2, ...);
```
Signal: SUM aggregate check followed by INSERT that would increase the sum.
**Game state validity:**
```sql
SELECT position FROM game_pieces WHERE piece_id = $1 AND game_id = $2;
-- app: validate move legality based on position
UPDATE game_pieces SET position = $3 WHERE piece_id = $1;
```
Signal: State read for validity check, followed by UPDATE that changes other game state.
### False positives (NOT write skew)
- Read-only queries with no subsequent write in the same transaction
- Writes that do not change the condition checked by the SELECT (e.g., SELECT from table A, write to unrelated table B with no logical dependency)
- Cases where a UNIQUE database constraint catches the conflict at the insert level (duplicate username pattern)
---
## Pattern 3: Phantom Read Detection
### What to find
A transaction checks for zero matching rows, then inserts a row that matches the same condition. This is the phantom variant of write skew. Unlike write skew over existing rows, `SELECT FOR UPDATE` cannot help — there are no rows to lock when the SELECT returns empty.
### Key distinguishing marker
The SELECT returns zero rows (or checks for absence), and the INSERT creates a row that would match the SELECT's WHERE condition.
### SQL grep patterns
```sql
-- Absence check followed by insert
SELECT COUNT(*) FROM table WHERE condition = $1 -- returning 0
INSERT INTO table WHERE ... -- values match condition
-- Or: SELECT returns empty, INSERT follows
SELECT id FROM table WHERE unique_key = $1;
-- if no rows: INSERT INTO table (unique_key, ...) VALUES ($1, ...);
```
### ORM grep patterns
```python
# Booking conflict check
existing = session.query(Booking).filter(
Booking.room_id == room_id,
Booking.end_time > start_time,
Booking.start_time < end_time
).count()
if existing == 0:
session.add(Booking(room_id=room_id, ...))
```
```java
// Room booking
long conflicts = bookingRepo.countConflicting(roomId, startTime, endTime);
if (conflicts == 0) {
bookingRepo.save(new Booking(roomId, startTime, endTime, userId));
}
```
### Differentiation from general write skew
Write skew over existing rows: SELECT returns rows → FOR UPDATE possible as mitigation
Phantom write skew: SELECT returns no rows → FOR UPDATE does nothing → serializable required
---
## Pattern 4: Read Skew Detection
### What to find
A long-running transaction that reads from multiple tables or reads the same data multiple times, where the combined result must be internally consistent.
### Code signals
- Long-running batch or background jobs inside a transaction
- Multiple SELECT queries over related tables in one transaction scope
- Backup or export operations that read the entire database or large subsets
- Integrity check queries that join or compare multiple tables
### SQL grep patterns
```sql
-- Two or more SELECTs from related tables in the same transaction
BEGIN;
SELECT ... FROM table_a WHERE ...;
-- other processing
SELECT ... FROM table_b WHERE ...; -- related to table_a result
COMMIT;
```
### ORM grep patterns (Python)
```python
with db.session.begin():
accounts = session.query(Account).filter(...).all()
# ... some processing ...
transfers = session.query(Transfer).filter(...).all()
# If accounts and transfers must balance, this is read skew exposure
```
### False positives
- Single-table reads (no cross-table consistency requirement)
- Read committed isolation is acceptable when the query is idempotent (re-running gives a useful result even if slightly stale)
- Short transactions where the window for a concurrent write to commit between reads is negligible
---
## Pattern 5: Dirty Read and Dirty Write Detection
### Dirty reads
Only possible at isolation level `READ UNCOMMITTED`. Rare in production — almost no database defaults to this level.
### SQL grep patterns
```sql
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
isolation_level='READ UNCOMMITTED'
```
### Dirty writes
Prevented by all practical isolation levels (read committed and above) through row-level write locks. Flag only if:
- Autocommit is enabled per-statement and transactions are not used
- The application manually manages locking and a lock is missing
### SQL grep patterns for missing transaction boundaries
```sql
-- INSERTs, UPDATEs, DELETEs outside any transaction context
-- Autocommit enabled with no explicit transaction wrapping multi-step operations
SET autocommit = 1 -- MySQL
-- Followed by multi-step write operations with no BEGIN/COMMIT
```
---
## Per-Database Notes on What Is and Is Not Automatically Prevented
### PostgreSQL
| Anomaly | At READ COMMITTED (default) | At REPEATABLE READ | At SERIALIZABLE |
|---------|--------------------------|---------------------|-----------------|
| Dirty read | Prevented | Prevented | Prevented |
| Dirty write | Prevented | Prevented | Prevented |
| Read skew | **EXPOSED** | Prevented | Prevented |
| Lost update | **EXPOSED** | Auto-detected, aborts | Prevented |
| Write skew | **EXPOSED** | **EXPOSED** | Prevented |
| Phantom (write skew) | **EXPOSED** | **EXPOSED** | Prevented |
**PostgreSQL SERIALIZABLE uses SSI (serializable snapshot isolation)** — true serializable with optimistic concurrency. Aborts transactions rather than blocking. Application must implement retry on SQLSTATE 40001.
### MySQL InnoDB
| Anomaly | At READ COMMITTED | At REPEATABLE READ (default) | At SERIALIZABLE |
|---------|------------------|------------------------------|-----------------|
| Dirty read | Prevented | Prevented | Prevented |
| Dirty write | Prevented | Prevented | Prevented |
| Read skew | **EXPOSED** | Prevented | Prevented |
| Lost update | **EXPOSED** | **EXPOSED (NOT auto-detected!)** | Prevented |
| Write skew | **EXPOSED** | **EXPOSED** | Prevented |
| Phantom (write skew) | **EXPOSED** | **EXPOSED** | Prevented |
**Critical MySQL difference:** MySQL InnoDB at REPEATABLE READ does NOT automatically detect lost updates. PostgreSQL does. Two concurrent read-modify-write cycles will silently lose one update in MySQL. PostgreSQL will abort one and require a retry.
**MySQL SERIALIZABLE uses two-phase locking (2PL)** — pessimistic. Readers block writers; writers block readers. No abort/retry needed but lock contention higher.
### Oracle 11g
| Anomaly | At READ COMMITTED (default) | At "SERIALIZABLE" (= snapshot isolation) |
|---------|-----------------------------|------------------------------------------|
| Dirty read | Prevented | Prevented |
| Dirty write | Prevented | Prevented |
| Read skew | **EXPOSED** | Prevented |
| Lost update | **EXPOSED** | Auto-detected |
| Write skew | **EXPOSED** | **EXPOSED** |
| Phantom (write skew) | **EXPOSED** | **EXPOSED** |
**Oracle's SERIALIZABLE is snapshot isolation, not true serializable.** Write skew is still possible. There is no true serializable isolation available in Oracle 11g.
---
## Quick Grep Commands
### Find read-modify-write cycles (lost update candidates)
```bash
# Python: find() followed by attribute modification
grep -n "session.query\|db.session.get\|Model.get\|Model.find" -A 5 \
$(find . -name "*.py") | grep -B 3 "+="
# Java: find followed by setter
grep -rn "findById\|em.find\|repository.find" --include="*.java" -A 5 | \
grep -B 3 "\.set[A-Z]"
```
### Find check-then-act write skew candidates
```bash
# SQL files: COUNT or SUM followed by INSERT/UPDATE in same file
grep -n "SELECT COUNT\|SELECT SUM\|SELECT EXISTS" **/*.sql
# Python: count() or exists() in transaction context
grep -rn "\.count()\|\.exists()" --include="*.py" -B 2 -A 5 | \
grep -B 5 "session.add\|\.save()\|\.create("
# Java: countBy or existsBy followed by save/persist
grep -rn "countBy\|existsBy\|\.count()\|\.exists()" --include="*.java" -A 10 | \
grep -B 8 "\.save\|\.persist\|\.create"
```
### Find phantom-variant candidates (absence check + insert)
```bash
# SQL: COUNT = 0 check followed by INSERT
grep -n "COUNT.*= 0\|count.*== 0\|count.*=== 0" **/*.{py,java,go,js,rb} -A 5 | \
grep -B 3 "INSERT\|\.create\|\.add\|\.save"
```
### Find isolation level configuration
```bash
grep -rn "ISOLATION LEVEL\|isolation_level\|transaction_isolation\|@Transactional" \
--include="*.{py,java,go,js,rb,yml,yaml,properties,xml}"
```
FILE:references/severity-and-fix-matrix.md
# Severity and Fix Matrix
Reference for Step 3 of the concurrency-anomaly-detector process. Severity per anomaly type per isolation level, fix decision tree, and fix applicability conditions.
---
## Severity by Anomaly and Effective Isolation Level
A finding's severity depends on both the anomaly type and the isolation level currently in effect. The same code pattern that is CRITICAL at read committed may be NOT PRESENT at serializable.
| Anomaly | Read Uncommitted | Read Committed | Snapshot Isolation | Serializable |
|---------|:----------------:|:--------------:|:-----------------:|:------------:|
| Dirty read | HIGH | NOT PRESENT | NOT PRESENT | NOT PRESENT |
| Dirty write | HIGH | NOT PRESENT | NOT PRESENT | NOT PRESENT |
| Read skew | MEDIUM | MEDIUM | NOT PRESENT | NOT PRESENT |
| Lost update (PostgreSQL/Oracle SI) | HIGH | HIGH | MEDIUM* | NOT PRESENT |
| Lost update (MySQL InnoDB RR) | HIGH | HIGH | HIGH** | NOT PRESENT |
| Write skew | CRITICAL | CRITICAL | CRITICAL | NOT PRESENT |
| Phantom (write skew variant) | CRITICAL | CRITICAL | CRITICAL | NOT PRESENT |
*PostgreSQL and Oracle automatically detect and abort one of the conflicting transactions at snapshot isolation. The anomaly does not silently corrupt data, but the application must handle the abort and retry. Severity is MEDIUM because it requires application-level retry handling that is often missing.
**MySQL InnoDB repeatable read does NOT automatically detect lost updates. Two concurrent read-modify-write cycles silently lose one update. This is HIGH severity, not MEDIUM — the data corruption is silent.
---
## Severity Definitions
| Severity | Meaning | Action required |
|----------|---------|-----------------|
| CRITICAL | Invariant violation that cannot be prevented without serializable isolation (or specific mitigation). Business logic constraints can be violated silently with no database error. Examples: double-booking, zero staff on shift, negative balance. | Fix before next production deployment. |
| HIGH | Silent data corruption or reads of data that never existed. Direct integrity violation. Examples: lost counter increment, overwritten balance update. | Fix in current sprint. |
| MEDIUM | Transaction sees internally inconsistent state. Dangerous for analytics, backups, integrity checks. May produce incorrect results without error. | Assess impact; fix if long-running reads are involved. |
| LOW | Possible exposure but requires unusual timing or very high conturrency to trigger. Or: the anomaly is possible in theory but the affected data has compensating constraints. | Note in code review; monitor. |
| NOT PRESENT | The effective isolation level prevents this anomaly. No code change required for this specific finding. | No action required. |
---
## Fix Decision Tree
For each finding, select the fix using this tree:
```
Is the effective isolation level already sufficient to prevent this anomaly?
YES → NOT PRESENT; skip this finding
NO → Continue
Is the anomaly a lost update?
YES →
Is the write value computed purely in SQL (no application-layer arithmetic)?
YES → Already safe (atomic SQL operation). Mark as false positive.
NO →
Can the write be expressed as a pure SQL expression? (e.g., value = value + ?)
YES → Fix A: Replace read-modify-write with atomic SQL UPDATE
NO →
Is the database PostgreSQL or Oracle at snapshot isolation?
YES → Aborts automatically; add retry logic (Fix B)
NO → Fix C: Add SELECT FOR UPDATE before the read, OR
Fix D: Upgrade to serializable isolation
NO → Continue (not a lost update)
Is the anomaly write skew (non-phantom variant)?
YES →
Does the precondition SELECT return existing rows?
YES → Fix E: Add SELECT FOR UPDATE to the precondition query
NO → Fix F: Upgrade to serializable isolation
(or Fix G: Materialize conflicts — last resort)
NO → Continue
Is the anomaly a phantom (check-for-absence then insert)?
YES →
Is a UNIQUE database constraint sufficient to enforce the invariant?
YES → Fix H: Add UNIQUE constraint (no application change needed)
NO → Fix F: Upgrade to serializable isolation
(or Fix G: Materialize conflicts — last resort)
NO → Continue
Is the anomaly read skew?
YES → Fix I: Upgrade the transaction to snapshot isolation (REPEATABLE READ or higher)
NO → Continue
Is the anomaly a dirty read or dirty write?
YES → Fix J: Enable read committed isolation (the database default for most systems)
```
---
## Fix Catalog
### Fix A: Replace Read-Modify-Write with Atomic SQL UPDATE
**Applies to:** Lost update — when the new value can be expressed as a SQL expression.
**What to change:** Remove the SELECT that reads the old value. Replace the application-layer arithmetic with a SQL expression in the UPDATE statement. Add a WHERE guard to check preconditions.
**Before:**
```python
# Python — read-modify-write cycle (vulnerable)
item = session.query(InventoryItem).filter_by(id=item_id).one()
if item.quantity < requested:
raise InsufficientStock()
item.quantity -= requested
session.commit()
```
**After:**
```python
# Atomic SQL update with guard — safe
result = session.execute(
update(InventoryItem)
.where(InventoryItem.id == item_id)
.where(InventoryItem.quantity >= requested)
.values(quantity=InventoryItem.quantity - requested)
.returning(InventoryItem.quantity)
)
if result.rowcount == 0:
raise InsufficientStock()
session.commit()
```
**SQL equivalent:**
```sql
UPDATE inventory_items
SET quantity = quantity - :requested
WHERE id = :item_id AND quantity >= :requested;
-- Check rows_affected == 1; if 0, insufficient stock
```
**Why it works:** The arithmetic happens atomically inside the database. No window exists for a concurrent read to see the old value and compute the same update.
**Limitation:** Requires that the new value can be expressed as a function of the current value in SQL. Not applicable when the update logic requires complex application code (e.g., parsing a JSON document, applying business rules, calling an external service).
---
### Fix B: Add Retry Logic for Snapshot Isolation Aborts (PostgreSQL/Oracle)
**Applies to:** Lost update — PostgreSQL or Oracle at snapshot isolation (automatic detection aborts the transaction; application must retry).
**What to change:** Wrap the transaction in a retry loop. Catch the serialization failure exception and retry from scratch.
```python
# Python / psycopg2
import psycopg2
from psycopg2 import errors
def execute_with_retry(conn, transaction_fn, max_retries=5):
for attempt in range(max_retries):
try:
with conn.cursor() as cur:
cur.execute("BEGIN")
result = transaction_fn(cur)
conn.commit()
return result
except errors.SerializationFailure:
conn.rollback()
if attempt == max_retries - 1:
raise
continue
except Exception:
conn.rollback()
raise
```
```java
// Java / Spring — retry template
@Retryable(value = {DataAccessException.class}, maxAttempts = 5)
@Transactional(isolation = Isolation.REPEATABLE_READ)
public void performUpdate(...) {
// transaction body
}
```
**SQLSTATE to catch:**
- PostgreSQL: `40001` (serialization_failure)
- Oracle: `ORA-08177` (can't serialize access)
**Why it works:** The database detects the lost update conflict at commit time and aborts one transaction. The retry re-executes the transaction from the beginning, at which point it reads the updated value and proceeds correctly.
---
### Fix C: SELECT FOR UPDATE (Read Lock)
**Applies to:** Lost update — when Fix A is not applicable (complex application logic between read and write).
**What to change:** Add `FOR UPDATE` to the SELECT that reads the value being modified. This acquires an exclusive lock on the row, blocking any concurrent transaction that tries to read the same row with FOR UPDATE.
**Before:**
```sql
SELECT balance FROM accounts WHERE id = $1;
-- application computes new balance
UPDATE accounts SET balance = $new_balance WHERE id = $1;
```
**After:**
```sql
SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
-- application computes new balance
UPDATE accounts SET balance = $new_balance WHERE id = $1;
```
**Why it works:** The FOR UPDATE lock prevents a concurrent transaction from reading the same row in its own FOR UPDATE SELECT until the first transaction commits. The second transaction reads the updated value rather than the stale value.
**Limitation:** Requires remembering to add FOR UPDATE everywhere the row is read for modification. Missing one lock anywhere in the codebase creates the vulnerability. Serializable isolation is more robust because it applies automatically.
---
### Fix D: Upgrade Isolation Level
**Applies to:** Lost update — when Fix A, B, and C are not applicable or not sufficient.
Per-database settings:
```sql
-- PostgreSQL
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- or: SET default_transaction_isolation = 'repeatable read';
-- MySQL InnoDB (REPEATABLE READ does NOT detect lost updates)
-- Must upgrade to SERIALIZABLE
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- SQL Server
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
-- Requires READ_COMMITTED_SNAPSHOT=ON at database level
-- Oracle
-- No true serializable available; use SELECT FOR UPDATE (Fix C)
```
---
### Fix E: SELECT FOR UPDATE on Precondition Query (Write Skew over Existing Rows)
**Applies to:** Write skew — where the precondition SELECT returns existing rows that can be locked.
**Applicable when:** The check-then-act pattern reads rows that exist in the database (not checking for absence).
**Not applicable when:** The SELECT returns zero rows (checking for absence) — no rows exist to lock.
**Before:**
```sql
BEGIN;
SELECT COUNT(*) FROM doctors
WHERE on_call = true AND shift_id = $1;
-- if count >= 2, proceed
UPDATE doctors SET on_call = false WHERE name = $2 AND shift_id = $1;
COMMIT;
```
**After:**
```sql
BEGIN;
SELECT COUNT(*) FROM doctors
WHERE on_call = true AND shift_id = $1
FOR UPDATE; -- lock all on-call rows
-- concurrent transactions must wait until this one commits
UPDATE doctors SET on_call = false WHERE name = $2 AND shift_id = $1;
COMMIT;
```
**Why it works:** FOR UPDATE locks every row returned by the SELECT. A second concurrent transaction's FOR UPDATE SELECT must wait until the first transaction commits. After the first commits (count = 1), the second reads count = 1 and fails the check.
**Limitation:** If the shift has many on-call doctors, this locks all of them for the duration of the transaction. Assess whether this lock scope is acceptable for the workload.
**ORM equivalents:**
```python
# SQLAlchemy
session.query(Doctor).filter_by(on_call=True, shift_id=shift_id).with_for_update().count()
```
```java
// Spring Data JPA
@Lock(LockModeType.PESSIMISTIC_WRITE)
@Query("SELECT COUNT(d) FROM Doctor d WHERE d.onCall = true AND d.shiftId = :shiftId")
long countOnCallForUpdate(@Param("shiftId") long shiftId);
```
```ruby
# ActiveRecord
Doctor.where(on_call: true, shift_id: shift_id).lock("FOR UPDATE").count
```
---
### Fix F: Upgrade to Serializable Isolation
**Applies to:** Write skew (phantom variant, or where FOR UPDATE is insufficient or impractical).
**Per-database settings:**
```sql
-- PostgreSQL: SSI (optimistic; aborts conflicting transactions at commit)
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- or: SET default_transaction_isolation = 'serializable';
-- MySQL InnoDB: 2PL (pessimistic; blocks conflicting transactions)
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- SQL Server: 2PL
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Oracle: not available as true serializable.
-- Oracle "SERIALIZABLE" = snapshot isolation. Write skew still possible.
-- Use Fix E (FOR UPDATE) or Fix G (materializing conflicts) for Oracle.
```
**Application requirements for PostgreSQL SSI:**
- Implement retry logic for SQLSTATE 40001 (serialization_failure)
- The entire transaction must be re-executed on abort — ORM frameworks do not retry by default
- See fix B code patterns for retry implementation
---
### Fix G: Materialize Conflicts (Last Resort)
**Applies to:** Phantom write skew — where the SELECT returns no rows, FOR UPDATE cannot be used, and upgrading to serializable isolation is not possible (e.g., Oracle).
**What to change:** Create a table of "lock rows" that represent the space of possible conflicts. Lock the appropriate rows before checking for conflicts and inserting.
**Example — booking system:**
```sql
-- Create a time slot lock table ahead of time (all rooms × all 15-minute slots for next 6 months)
CREATE TABLE booking_locks (
room_id INT,
slot_start TIMESTAMP,
PRIMARY KEY (room_id, slot_start)
);
-- In the booking transaction:
BEGIN;
-- Lock all slots that overlap the requested time range
SELECT * FROM booking_locks
WHERE room_id = $1
AND slot_start >= $2
AND slot_start < $3
FOR UPDATE;
-- Now check for conflicting bookings
SELECT COUNT(*) FROM bookings
WHERE room_id = $1 AND end_time > $2 AND start_time < $3;
-- If count = 0, insert
INSERT INTO bookings (...) VALUES (...);
COMMIT;
```
**Why it works:** Now there are rows to lock with FOR UPDATE — the lock rows. Two concurrent booking transactions for the same room and time range both try to lock the same lock rows. One must wait. After the first commits, the second reads the count and finds the conflict.
**Why this is a last resort:**
- Requires creating and maintaining an additional table
- Lock rows must be populated ahead of time (cron job or migration)
- Coupling concurrency control mechanics into the data model is ugly
- Harder to reason about correctness than serializable isolation
- Only use when serializable isolation is genuinely unavailable or has unacceptable overhead
---
### Fix H: UNIQUE Database Constraint
**Applies to:** Phantom write skew where the uniqueness can be enforced at the single-row level (username, single-column unique values).
**What to change:** Add a UNIQUE constraint to the table. Let the database enforce the invariant at the storage level rather than at the application level.
```sql
-- Username uniqueness: UNIQUE constraint is the right tool
ALTER TABLE users ADD CONSTRAINT unique_username UNIQUE (username);
-- Seat assignment: UNIQUE constraint on (flight_id, seat_number)
ALTER TABLE seat_assignments ADD CONSTRAINT unique_seat UNIQUE (flight_id, seat_number);
```
**Application change:** Catch the unique constraint violation exception and return the appropriate application error.
```python
try:
db.session.add(User(username=username, ...))
db.session.commit()
except IntegrityError:
db.session.rollback()
raise UsernameAlreadyTakenError()
```
**Why it works:** The UNIQUE constraint is enforced by the database at the storage layer, regardless of isolation level. Two concurrent inserts for the same username — one will succeed, one will fail with a constraint violation. No race condition possible.
**Limitation:** Only works when the invariant can be expressed as a uniqueness constraint on a single row. Does not apply to multi-row invariants (e.g., "at most one booking per room per overlapping time range" requires range comparison, not row uniqueness).
---
### Fix I: Upgrade to Snapshot Isolation for Long-Running Reads
**Applies to:** Read skew — long-running transactions that read multiple related tables.
**What to change:** Set the transaction isolation to REPEATABLE READ (snapshot isolation) for transactions that perform long-running reads across multiple tables.
```python
# Python: per-transaction isolation
with engine.connect() as conn:
conn.execute(text("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ"))
# All reads within this transaction see a consistent snapshot
accounts = conn.execute(select(accounts_table)).fetchall()
transfers = conn.execute(select(transfers_table)).fetchall()
conn.commit()
```
```java
// Java Spring: per-method isolation
@Transactional(isolation = Isolation.REPEATABLE_READ)
public BillingReport generateBillingReport() {
// All reads see the database state at transaction start
}
```
**Why it works:** Snapshot isolation takes a consistent snapshot of the entire database at the start of the transaction. All reads within the transaction — regardless of how many there are or how long the transaction runs — see the same point-in-time state. Concurrent writes by other transactions are invisible.
---
### Fix J: Enable Read Committed (Baseline Fix)
**Applies to:** Dirty reads — when the database is at read uncommitted.
**What to change:** Set the isolation level to read committed (the default for most production databases).
```sql
-- PostgreSQL
SET default_transaction_isolation = 'read committed';
-- MySQL
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Application-level: remove any explicit READ UNCOMMITTED settings
```
**Note:** Dirty writes are prevented at read committed and above through row-level write locks — no additional change needed beyond ensuring read uncommitted is not in use.
---
## Fix Applicability Summary
| Anomaly | Fix A | Fix B | Fix C | Fix D | Fix E | Fix F | Fix G | Fix H | Fix I |
|---------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
| Dirty read / dirty write | | | | | | | | | Fix J |
| Read skew | | | | | | | | | ✓ |
| Lost update (expressible as SQL) | ✓ | | | | | | | | |
| Lost update (complex logic, PG/Oracle) | | ✓ | ✓ | | | | | | |
| Lost update (complex logic, MySQL) | | | ✓ | ✓ | | | | | |
| Write skew (rows exist to lock) | | | | | ✓ | ✓ | | | |
| Write skew (phantom, unique claim) | | | | | | ✓ | | ✓ | |
| Write skew (phantom, range/overlap) | | | | | | ✓ | ✓ | | |
| Write skew (Oracle — no true serial.) | | | | | ✓ | — | ✓ | | |